R Notes For Professionals
R Notes For Professionals
R Notes For Professionals
R
Notes for Professionals
400+ pages
of professional hints and tricks
Disclaimer
GoalKicker.com This is an unocial free book created for educational purposes and is
not aliated with ocial R group(s) or company(s).
Free Programming Books All trademarks and registered trademarks are
the property of their respective owners
Contents
About ................................................................................................................................................................................... 1
Chapter 1: Getting started with R Language .................................................................................................. 2
Section 1.1: Installing R ................................................................................................................................................... 2
Section 1.2: Hello World! ................................................................................................................................................ 3
Section 1.3: Getting Help ............................................................................................................................................... 3
Section 1.4: Interactive mode and R scripts ................................................................................................................ 3
Chapter 2: Variables .................................................................................................................................................... 7
Section 2.1: Variables, data structures and basic Operations .................................................................................. 7
Chapter 3: Arithmetic Operators ........................................................................................................................ 10
Section 3.1: Range and addition ................................................................................................................................. 10
Section 3.2: Addition and subtraction ....................................................................................................................... 10
Chapter 4: Matrices ................................................................................................................................................... 13
Section 4.1: Creating matrices .................................................................................................................................... 13
Chapter 5: Formula .................................................................................................................................................... 15
Section 5.1: The basics of formula ............................................................................................................................. 15
Chapter 6: Reading and writing strings .......................................................................................................... 17
Section 6.1: Printing and displaying strings ............................................................................................................... 17
Section 6.2: Capture output of operating system command ................................................................................. 18
Section 6.3: Reading from or writing to a file connection ....................................................................................... 19
Chapter 7: String manipulation with stringi package .............................................................................. 21
Section 7.1: Count pattern inside string ..................................................................................................................... 21
Section 7.2: Duplicating strings .................................................................................................................................. 21
Section 7.3: Paste vectors ........................................................................................................................................... 22
Section 7.4: Splitting text by some fixed pattern ...................................................................................................... 22
Chapter 8: Classes ...................................................................................................................................................... 23
Section 8.1: Inspect classes ......................................................................................................................................... 23
Section 8.2: Vectors and lists ..................................................................................................................................... 23
Section 8.3: Vectors ..................................................................................................................................................... 24
Chapter 9: Lists ............................................................................................................................................................ 25
Section 9.1: Introduction to lists .................................................................................................................................. 25
Section 9.2: Quick Introduction to Lists ..................................................................................................................... 25
Section 9.3: Serialization: using lists to pass information ........................................................................................ 27
Chapter 10: Hashmaps ............................................................................................................................................. 29
Section 10.1: Environments as hash maps ................................................................................................................ 29
Section 10.2: package:hash ........................................................................................................................................ 32
Section 10.3: package:listenv ...................................................................................................................................... 33
Chapter 11: Creating vectors ................................................................................................................................. 35
Section 11.1: Vectors from build in constants: Sequences of letters & month names ........................................... 35
Section 11.2: Creating named vectors ........................................................................................................................ 35
Section 11.3: Sequence of numbers ............................................................................................................................ 37
Section 11.4: seq() ......................................................................................................................................................... 37
Section 11.5: Vectors .................................................................................................................................................... 38
Section 11.6: Expanding a vector with the rep() function ......................................................................................... 39
Chapter 12: Date and Time .................................................................................................................................... 41
Section 12.1: Current Date and Time .......................................................................................................................... 41
Section 12.2: Go to the End of the Month .................................................................................................................. 41
Section 12.3: Go to First Day of the Month ................................................................................................................ 42
Section 12.4: Move a date a number of months consistently by months ............................................................. 42
Chapter 13: The Date class ..................................................................................................................................... 44
Section 13.1: Formatting Dates ................................................................................................................................... 44
Section 13.2: Parsing Strings into Date Objects ........................................................................................................ 44
Section 13.3: Dates ....................................................................................................................................................... 45
Chapter 14: Date-time classes (POSIXct and POSIXlt) ............................................................................ 47
Section 14.1: Formatting and printing date-time objects ......................................................................................... 47
Section 14.2: Date-time arithmetic ............................................................................................................................. 47
Section 14.3: Parsing strings into date-time objects ................................................................................................ 48
Chapter 15: The character class .......................................................................................................................... 50
Section 15.1: Coercion .................................................................................................................................................. 50
Chapter 16: Numeric classes and storage modes ...................................................................................... 51
Section 16.1: Numeric ................................................................................................................................................... 51
Chapter 17: The logical class ................................................................................................................................. 53
Section 17.1: Logical operators ................................................................................................................................... 53
Section 17.2: Coercion ................................................................................................................................................. 53
Section 17.3: Interpretation of NAs ............................................................................................................................. 53
Chapter 18: Data frames ......................................................................................................................................... 55
Section 18.1: Create an empty data.frame ................................................................................................................ 55
Section 18.2: Subsetting rows and columns from a data frame ............................................................................ 56
Section 18.3: Convenience functions to manipulate data.frames .......................................................................... 59
Section 18.4: Introduction ............................................................................................................................................ 60
Section 18.5: Convert all columns of a data.frame to character class .................................................................. 61
Chapter 19: Split function ....................................................................................................................................... 63
Section 19.1: Using split in the split-apply-combine paradigm ............................................................................... 63
Section 19.2: Basic usage of split ............................................................................................................................... 64
Chapter 20: Reading and writing tabular data in plain-text files (CSV, TSV, etc.) ................... 67
Section 20.1: Importing .csv files ................................................................................................................................ 67
Section 20.2: Importing with data.table .................................................................................................................... 68
Section 20.3: Exporting .csv files ................................................................................................................................ 69
Section 20.4: Import multiple csv files ....................................................................................................................... 69
Section 20.5: Importing fixed-width files ................................................................................................................... 69
Chapter 21: Pipe operators (%>% and others) ............................................................................................. 71
Section 21.1: Basic use and chaining .......................................................................................................................... 71
Section 21.2: Functional sequences ........................................................................................................................... 72
Section 21.3: Assignment with %<>% .......................................................................................................................... 73
Section 21.4: Exposing contents with %$% ................................................................................................................ 73
Section 21.5: Creating side eects with %T>% .......................................................................................................... 74
Section 21.6: Using the pipe with dplyr and ggplot2 ................................................................................................ 75
Chapter 22: Linear Models (Regression) ......................................................................................................... 76
Section 22.1: Linear regression on the mtcars dataset ........................................................................................... 76
Section 22.2: Using the 'predict' function .................................................................................................................. 78
Section 22.3: Weighting .............................................................................................................................................. 79
Section 22.4: Checking for nonlinearity with polynomial regression ..................................................................... 81
Section 22.5: Plotting The Regression (base) ........................................................................................................... 83
Section 22.6: Quality assessment .............................................................................................................................. 85
Chapter 23: data.table ............................................................................................................................................. 87
Section 23.1: Creating a data.table ............................................................................................................................ 87
Section 23.2: Special symbols in data.table ............................................................................................................. 88
Section 23.3: Adding and modifying columns .......................................................................................................... 89
Section 23.4: Writing code compatible with both data.frame and data.table ...................................................... 91
Section 23.5: Setting keys in data.table .................................................................................................................... 93
Chapter 24: Pivot and unpivot with data.table .......................................................................................... 95
Section 24.1: Pivot and unpivot tabular data with data.table - I ............................................................................. 95
Section 24.2: Pivot and unpivot tabular data with data.table - II ........................................................................... 96
Chapter 25: Bar Chart .............................................................................................................................................. 98
Section 25.1: barplot() function .................................................................................................................................. 98
Chapter 26: Base Plotting .................................................................................................................................... 104
Section 26.1: Density plot .......................................................................................................................................... 104
Section 26.2: Combining Plots .................................................................................................................................. 105
Section 26.3: Getting Started with R_Plots ............................................................................................................. 107
Section 26.4: Basic Plot ............................................................................................................................................. 108
Section 26.5: Histograms .......................................................................................................................................... 111
Section 26.6: Matplot ................................................................................................................................................ 113
Section 26.7: Empirical Cumulative Distribution Function ..................................................................................... 119
Chapter 27: boxplot ................................................................................................................................................. 121
Section 27.1: Create a box-and-whisker plot with boxplot() {graphics} .............................................................. 121
Section 27.2: Additional boxplot style parameters ................................................................................................ 125
Chapter 28: ggplot2 ................................................................................................................................................ 128
Section 28.1: Displaying multiple plots .................................................................................................................... 128
Section 28.2: Prepare your data for plotting ......................................................................................................... 131
Section 28.3: Add horizontal and vertical lines to plot .......................................................................................... 133
Section 28.4: Scatter Plots ........................................................................................................................................ 136
Section 28.5: Produce basic plots with qplot .......................................................................................................... 136
Section 28.6: Vertical and Horizontal Bar Chart .................................................................................................... 138
Section 28.7: Violin plot ............................................................................................................................................. 140
Chapter 29: Factors ................................................................................................................................................. 143
Section 29.1: Consolidating Factor Levels with a List ............................................................................................ 143
Section 29.2: Basic creation of factors ................................................................................................................... 144
Section 29.3: Changing and reordering factors ..................................................................................................... 145
Section 29.4: Rebuilding factors from zero ............................................................................................................ 150
Chapter 30: Pattern Matching and Replacement .................................................................................... 152
Section 30.1: Finding Matches .................................................................................................................................. 152
Section 30.2: Single and Global match ................................................................................................................... 153
Section 30.3: Making substitutions .......................................................................................................................... 154
Section 30.4: Find matches in big data sets ........................................................................................................... 154
Chapter 31: Run-length encoding ..................................................................................................................... 156
Section 31.1: Run-length Encoding with `rle` ............................................................................................................ 156
Section 31.2: Identifying and grouping by runs in base R ..................................................................................... 156
Section 31.3: Run-length encoding to compress and decompress vectors ........................................................ 157
Section 31.4: Identifying and grouping by runs in data.table ............................................................................... 158
Chapter 32: Speeding up tough-to-vectorize code ................................................................................. 159
Section 32.1: Speeding tough-to-vectorize for loops with Rcpp ........................................................................... 159
Section 32.2: Speeding tough-to-vectorize for loops by byte compiling ............................................................ 159
Chapter 33: Introduction to Geographical Maps ...................................................................................... 161
Section 33.1: Basic map-making with map() from the package maps ............................................................... 161
Section 33.2: 50 State Maps and Advanced Choropleths with Google Viz ......................................................... 164
Section 33.3: Interactive plotly maps ...................................................................................................................... 165
Section 33.4: Making Dynamic HTML Maps with Leaflet ...................................................................................... 167
Section 33.5: Dynamic Leaflet maps in Shiny applications .................................................................................. 168
Chapter 34: Set operations ................................................................................................................................. 171
Section 34.1: Set operators for pairs of vectors ..................................................................................................... 171
Section 34.2: Cartesian or "cross" products of vectors ......................................................................................... 171
Section 34.3: Set membership for vectors .............................................................................................................. 172
Section 34.4: Make unique / drop duplicates / select distinct elements from a vector .................................... 172
Section 34.5: Measuring set overlaps / Venn diagrams for vectors ................................................................... 173
Chapter 35: tidyverse ............................................................................................................................................. 174
Section 35.1: tidyverse: an overview ........................................................................................................................ 174
Section 35.2: Creating tbl_df’s ................................................................................................................................. 175
Chapter 36: Rcpp ...................................................................................................................................................... 176
Section 36.1: Extending Rcpp with Plugins .............................................................................................................. 176
Section 36.2: Inline Code Compile ............................................................................................................................ 176
Section 36.3: Rcpp Attributes ................................................................................................................................... 177
Section 36.4: Specifying Additional Build Dependencies ...................................................................................... 178
Chapter 37: Random Numbers Generator .................................................................................................. 179
Section 37.1: Random permutations ........................................................................................................................ 179
Section 37.2: Generating random numbers using various density functions ..................................................... 179
Section 37.3: Random number generator's reproducibility .................................................................................. 181
Chapter 38: Parallel processing ........................................................................................................................ 182
Section 38.1: Parallel processing with parallel package ........................................................................................ 182
Section 38.2: Parallel processing with foreach package ...................................................................................... 183
Section 38.3: Random Number Generation ............................................................................................................ 184
Section 38.4: mcparallelDo ....................................................................................................................................... 184
Chapter 39: Subsetting .......................................................................................................................................... 186
Section 39.1: Data frames ......................................................................................................................................... 186
Section 39.2: Atomic vectors .................................................................................................................................... 187
Section 39.3: Matrices ............................................................................................................................................... 188
Section 39.4: Lists ...................................................................................................................................................... 190
Section 39.5: Vector indexing ................................................................................................................................... 191
Section 39.6: Other objects ....................................................................................................................................... 192
Section 39.7: Elementwise Matrix Operations ........................................................................................................ 192
Chapter 40: Debugging ......................................................................................................................................... 194
Section 40.1: Using debug ........................................................................................................................................ 194
Section 40.2: Using browser ..................................................................................................................................... 194
Chapter 41: Installing packages ....................................................................................................................... 196
Section 41.1: Install packages from GitHub ............................................................................................................. 196
Section 41.2: Download and install packages from repositories ......................................................................... 197
Section 41.3: Install package from local source ..................................................................................................... 198
Section 41.4: Install local development version of a package .............................................................................. 198
Section 41.5: Using a CLI package manager -- basic pacman usage ................................................................. 199
Chapter 42: Inspecting packages .................................................................................................................... 200
Section 42.1: View Package Version ........................................................................................................................ 200
Section 42.2: View Loaded packages in Current Session ..................................................................................... 200
Section 42.3: View package information ................................................................................................................ 200
Section 42.4: View package's built-in data sets ..................................................................................................... 200
Section 42.5: List a package's exported functions ................................................................................................ 200
Chapter 43: Creating packages with devtools ......................................................................................... 201
Section 43.1: Creating and distributing packages .................................................................................................. 201
Section 43.2: Creating vignettes .............................................................................................................................. 203
Chapter 44: Using pipe assignment in your own package %<>%: How to ? .............................. 204
Section 44.1: Putting the pipe in a utility-functions file .......................................................................................... 204
Chapter 45: Arima Models ................................................................................................................................... 205
Section 45.1: Modeling an AR1 Process with Arima ................................................................................................ 205
Chapter 46: Distribution Functions ................................................................................................................. 210
Section 46.1: Normal distribution ............................................................................................................................. 210
Section 46.2: Binomial Distribution .......................................................................................................................... 210
Chapter 47: Shiny ..................................................................................................................................................... 214
Section 47.1: Create an app ...................................................................................................................................... 214
Section 47.2: Checkbox Group ................................................................................................................................. 214
Section 47.3: Radio Button ....................................................................................................................................... 215
Section 47.4: Debugging ........................................................................................................................................... 216
Section 47.5: Select box ............................................................................................................................................ 216
Section 47.6: Launch a Shiny app ............................................................................................................................ 217
Section 47.7: Control widgets ................................................................................................................................... 218
Chapter 48: spatial analysis ............................................................................................................................... 220
Section 48.1: Create spatial points from XY data set ............................................................................................. 220
Section 48.2: Importing a shape file (.shp) ............................................................................................................. 221
Chapter 49: sqldf ...................................................................................................................................................... 222
Section 49.1: Basic Usage Examples ....................................................................................................................... 222
Chapter 50: Code profiling .................................................................................................................................. 224
Section 50.1: Benchmarking using microbenchmark ............................................................................................ 224
Section 50.2: proc.time() ........................................................................................................................................... 225
Section 50.3: Microbenchmark ................................................................................................................................ 226
Section 50.4: System.time ........................................................................................................................................ 227
Section 50.5: Line Profiling ....................................................................................................................................... 227
Chapter 51: Control flow structures ................................................................................................................ 229
Section 51.1: Optimal Construction of a For Loop .................................................................................................. 229
Section 51.2: Basic For Loop Construction .............................................................................................................. 230
Section 51.3: The Other Looping Constructs: while and repeat ............................................................................ 230
Chapter 52: Column wise operation ................................................................................................................ 234
Section 52.1: sum of each column ........................................................................................................................... 234
Chapter 53: JSON ..................................................................................................................................................... 236
Section 53.1: JSON to / from R objects ................................................................................................................... 236
Chapter 54: RODBC ................................................................................................................................................. 238
Section 54.1: Connecting to Excel Files via RODBC ................................................................................................ 238
Section 54.2: SQL Server Management Database connection to get individual table ...................................... 238
Section 54.3: Connecting to relational databases ................................................................................................. 238
Chapter 55: lubridate ............................................................................................................................................. 239
Section 55.1: Parsing dates and datetimes from strings with lubridate .............................................................. 239
Section 55.2: Dierence between period and duration ........................................................................................ 240
Section 55.3: Instants ................................................................................................................................................ 240
Section 55.4: Intervals, Durations and Periods ....................................................................................................... 241
Section 55.5: Manipulating date and time in lubridate .......................................................................................... 242
Section 55.6: Time Zones ......................................................................................................................................... 243
Section 55.7: Parsing date and time in lubridate ................................................................................................... 243
Section 55.8: Rounding dates .................................................................................................................................. 243
Chapter 56: Time Series and Forecasting .................................................................................................... 245
Section 56.1: Creating a ts object ............................................................................................................................. 245
Section 56.2: Exploratory Data Analysis with time-series data ............................................................................ 245
Chapter 57: strsplit function ............................................................................................................................... 247
Section 57.1: Introduction .......................................................................................................................................... 247
Chapter 58: Web scraping and parsing ........................................................................................................ 248
Section 58.1: Basic scraping with rvest .................................................................................................................... 248
Section 58.2: Using rvest when login is required ................................................................................................... 248
Chapter 59: Generalized linear models ......................................................................................................... 250
Section 59.1: Logistic regression on Titanic dataset .............................................................................................. 250
Chapter 60: Reshaping data between long and wide forms ............................................................. 253
Section 60.1: Reshaping data ................................................................................................................................... 253
Section 60.2: The reshape function ......................................................................................................................... 254
Chapter 61: RMarkdown and knitr presentation ...................................................................................... 256
Section 61.1: Adding a footer to an ioslides presentation ...................................................................................... 256
Section 61.2: Rstudio example .................................................................................................................................. 257
Chapter 62: Scope of variables ......................................................................................................................... 259
Section 62.1: Environments and Functions ............................................................................................................. 259
Section 62.2: Function Exit ........................................................................................................................................ 259
Section 62.3: Sub functions ...................................................................................................................................... 260
Section 62.4: Global Assignment ............................................................................................................................. 260
Section 62.5: Explicit Assignment of Environments and Variables ...................................................................... 261
Chapter 63: Performing a Permutation Test .............................................................................................. 262
Section 63.1: A fairly general function ..................................................................................................................... 262
Chapter 64: xgboost ............................................................................................................................................... 265
Section 64.1: Cross Validation and Tuning with xgboost ....................................................................................... 265
Chapter 65: R code vectorization best practices ..................................................................................... 267
Section 65.1: By row operations ............................................................................................................................... 267
Chapter 66: Missing values .................................................................................................................................. 270
Section 66.1: Examining missing data ...................................................................................................................... 270
Section 66.2: Reading and writing data with NA values ....................................................................................... 270
Section 66.3: Using NAs of dierent classes .......................................................................................................... 270
Section 66.4: TRUE/FALSE and/or NA .................................................................................................................... 271
Chapter 67: Hierarchical Linear Modeling ................................................................................................... 272
Section 67.1: basic model fitting ............................................................................................................................... 272
Chapter 68: *apply family of functions (functionals) ............................................................................ 273
Section 68.1: Using built-in functionals .................................................................................................................... 273
Section 68.2: Combining multiple `data.frames` (`lapply`, `mapply`) .................................................................... 273
Section 68.3: Bulk File Loading ................................................................................................................................ 275
Section 68.4: Using user-defined functionals ......................................................................................................... 275
Chapter 69: Text mining ........................................................................................................................................ 277
Section 69.1: Scraping Data to build N-gram Word Clouds .................................................................................. 277
Chapter 70: ANOVA ................................................................................................................................................. 281
Section 70.1: Basic usage of aov() ........................................................................................................................... 281
Section 70.2: Basic usage of Anova() ..................................................................................................................... 281
Chapter 71: Raster and Image Analysis ........................................................................................................ 283
Section 71.1: Calculating GLCM Texture ................................................................................................................... 283
Section 71.2: Mathematical Morphologies .............................................................................................................. 285
Chapter 72: Survival analysis ............................................................................................................................. 287
Section 72.1: Random Forest Survival Analysis with randomForestSRC ............................................................. 287
Section 72.2: Introduction - basic fitting and plotting of parametric survival models with the survival
package ............................................................................................................................................................. 288
Section 72.3: Kaplan Meier estimates of survival curves and risk set tables with survminer ........................... 289
Chapter 73: Fault-tolerant/resilient code ................................................................................................... 292
Section 73.1: Using tryCatch() .................................................................................................................................. 292
Chapter 74: Reproducible R ............................................................................................................................... 295
Section 74.1: Data reproducibility ............................................................................................................................ 295
Section 74.2: Package reproducibility ..................................................................................................................... 295
Chapter 75: Fourier Series and Transformations .................................................................................... 296
Section 75.1: Fourier Series ....................................................................................................................................... 297
Chapter 76: .Rprofile ............................................................................................................................................... 302
Section 76.1: .Rprofile - the first chunk of code executed ...................................................................................... 302
Section 76.2: .Rprofile example ................................................................................................................................ 303
Chapter 77: dplyr ...................................................................................................................................................... 304
Section 77.1: dplyr's single table verbs .................................................................................................................... 304
Section 77.2: Aggregating with %>% (pipe) operator ............................................................................................ 311
Section 77.3: Subset Observation (Rows) ............................................................................................................... 312
Section 77.4: Examples of NSE and string variables in dpylr ............................................................................... 313
Chapter 78: caret ..................................................................................................................................................... 314
Section 78.1: Preprocessing ...................................................................................................................................... 314
Chapter 79: Extracting and Listing Files in Compressed Archives .................................................. 315
Section 79.1: Extracting files from a .zip archive .................................................................................................... 315
Chapter 80: Probability Distributions with R .............................................................................................. 316
Section 80.1: PDF and PMF for dierent distributions in R .................................................................................... 316
Chapter 81: R in LaTeX with knitr ..................................................................................................................... 317
Section 81.1: R in LaTeX with Knitr and Code Externalization ............................................................................... 317
Section 81.2: R in LaTeX with Knitr and Inline Code Chunks ................................................................................. 317
Section 81.3: R in LaTex with Knitr and Internal Code Chunks .............................................................................. 318
Chapter 82: Web Crawling in R .......................................................................................................................... 319
Section 82.1: Standard scraping approach using the RCurl package ................................................................. 319
Chapter 83: Creating reports with RMarkdown ........................................................................................ 320
Section 83.1: Including bibliographies ...................................................................................................................... 320
Section 83.2: Including LaTeX Preample Commands ........................................................................................... 320
Section 83.3: Printing tables ..................................................................................................................................... 321
Section 83.4: Basic R-markdown document structure .......................................................................................... 323
Chapter 84: GPU-accelerated computing ................................................................................................... 326
Section 84.1: gpuR gpuMatrix objects ..................................................................................................................... 326
Section 84.2: gpuR vclMatrix objects ...................................................................................................................... 326
Chapter 85: heatmap and heatmap.2 ........................................................................................................... 327
Section 85.1: Examples from the ocial documentation ...................................................................................... 327
Section 85.2: Tuning parameters in heatmap.2 ..................................................................................................... 335
Chapter 86: Network analysis with the igraph package ...................................................................... 341
Section 86.1: Simple Directed and Non-directed Network Graphing ................................................................... 341
Chapter 87: Functional programming ........................................................................................................... 343
Section 87.1: Built-in Higher Order Functions ......................................................................................................... 343
Chapter 88: Get user input .................................................................................................................................. 344
Section 88.1: User input in R ..................................................................................................................................... 344
Chapter 89: Spark API (SparkR) ........................................................................................................................ 345
Section 89.1: Setup Spark context ............................................................................................................................ 345
Section 89.2: Cache data .......................................................................................................................................... 345
Section 89.3: Create RDDs (Resilient Distributed Datasets) ................................................................................. 346
Chapter 90: Meta: Documentation Guidelines ........................................................................................... 347
Section 90.1: Style ...................................................................................................................................................... 347
Section 90.2: Making good examples ..................................................................................................................... 347
Chapter 91: Input and output ............................................................................................................................. 348
Section 91.1: Reading and writing data frames ...................................................................................................... 348
Chapter 92: I/O for foreign tables (Excel, SAS, SPSS, Stata) ............................................................ 350
Section 92.1: Importing data with rio ....................................................................................................................... 350
Section 92.2: Read and write Stata, SPSS and SAS files ....................................................................................... 350
Section 92.3: Importing Excel files ........................................................................................................................... 351
Section 92.4: Import or Export of Feather file ........................................................................................................ 354
Chapter 93: I/O for database tables .............................................................................................................. 356
Section 93.1: Reading Data from MySQL Databases ............................................................................................ 356
Section 93.2: Reading Data from MongoDB Databases ...................................................................................... 356
Chapter 94: I/O for geographic data (shapefiles, etc.) ....................................................................... 357
Section 94.1: Import and Export Shapefiles ............................................................................................................ 357
Chapter 95: I/O for raster images .................................................................................................................. 358
Section 95.1: Load a multilayer raster ..................................................................................................................... 358
Chapter 96: I/O for R's binary format ........................................................................................................... 360
Section 96.1: Rds and RData (Rda) files ................................................................................................................. 360
Section 96.2: Enviromments ..................................................................................................................................... 360
Chapter 97: Recycling ............................................................................................................................................ 361
Section 97.1: Recycling use in subsetting ................................................................................................................ 361
Chapter 98: Expression: parse + eval ............................................................................................................. 362
Section 98.1: Execute code in string format ............................................................................................................ 362
Chapter 99: Regular Expression Syntax in R .............................................................................................. 363
Section 99.1: Use `grep` to find a string in a character vector .............................................................................. 363
Chapter 100: Regular Expressions (regex) ................................................................................................... 365
Section 100.1: Dierences between Perl and POSIX regex .................................................................................... 365
Section 100.2: Validate a date in a "YYYYMMDD" format ..................................................................................... 365
Section 100.3: Escaping characters in R regex patterns ....................................................................................... 366
Section 100.4: Validate US States postal abbreviations ........................................................................................ 366
Section 100.5: Validate US phone numbers ............................................................................................................ 366
Chapter 101: Combinatorics ................................................................................................................................. 368
Section 101.1: Enumerating combinations of a specified length ........................................................................... 368
Section 101.2: Counting combinations of a specified length ................................................................................. 369
Chapter 102: Solving ODEs in R .......................................................................................................................... 370
Section 102.1: The Lorenz model .............................................................................................................................. 370
Section 102.2: Lotka-Volterra or: Prey vs. predator ............................................................................................... 371
Section 102.3: ODEs in compiled languages - definition in R ................................................................................ 373
Section 102.4: ODEs in compiled languages - definition in C ................................................................................ 373
Section 102.5: ODEs in compiled languages - definition in fortran ...................................................................... 375
Section 102.6: ODEs in compiled languages - a benchmark test ......................................................................... 376
Chapter 103: Feature Selection in R -- Removing Extraneous Features ...................................... 378
Section 103.1: Removing features with zero or near-zero variance ..................................................................... 378
Section 103.2: Removing features with high numbers of NA ................................................................................ 378
Section 103.3: Removing closely correlated features ............................................................................................ 378
Chapter 104: Bibliography in RMD ................................................................................................................... 380
Section 104.1: Specifying a bibliography and cite authors .................................................................................... 380
Section 104.2: Inline references ................................................................................................................................ 381
Section 104.3: Citation styles .................................................................................................................................... 382
Chapter 105: Writing functions in R ................................................................................................................. 385
Section 105.1: Anonymous functions ........................................................................................................................ 385
Section 105.2: RStudio code snippets ...................................................................................................................... 385
Section 105.3: Named functions ............................................................................................................................... 386
Chapter 106: Color schemes for graphics .................................................................................................... 388
Section 106.1: viridis - print and colorblind friendly palettes ................................................................................. 388
Section 106.2: A handy function to glimse a vector of colors ............................................................................... 389
Section 106.3: colorspace - click&drag interface for colors .................................................................................. 390
Section 106.4: Colorblind-friendly palettes ............................................................................................................. 391
Section 106.5: RColorBrewer .................................................................................................................................... 392
Section 106.6: basic R color functions ..................................................................................................................... 393
Chapter 107: Hierarchical clustering with hclust ...................................................................................... 394
Section 107.1: Example 1 - Basic use of hclust, display of dendrogram, plot clusters ........................................ 394
Section 107.2: Example 2 - hclust and outliers ....................................................................................................... 397
Chapter 108: Random Forest Algorithm ....................................................................................................... 400
Section 108.1: Basic examples - Classification and Regression ............................................................................ 400
Chapter 109: RESTful R Services ....................................................................................................................... 402
Section 109.1: opencpu Apps .................................................................................................................................... 402
Chapter 110: Machine learning ........................................................................................................................... 403
Section 110.1: Creating a Random Forest model .................................................................................................... 403
Chapter 111: Using texreg to export models in a paper-ready way ............................................... 404
Section 111.1: Printing linear regression results ....................................................................................................... 404
Chapter 112: Publishing ........................................................................................................................................... 406
Section 112.1: Formatting tables ............................................................................................................................... 406
Section 112.2: Formatting entire documents ........................................................................................................... 406
Chapter 113: Implement State Machine Pattern using S4 Class ....................................................... 407
Section 113.1: Parsing Lines using State Machine .................................................................................................... 407
Chapter 114: Reshape using tidyr ..................................................................................................................... 419
Section 114.1: Reshape from long to wide format with spread() .......................................................................... 419
Section 114.2: Reshape from wide to long format with gather() .......................................................................... 419
Chapter 115: Modifying strings by substitution ......................................................................................... 421
Section 115.1: Rearrange character strings using capture groups ....................................................................... 421
Section 115.2: Eliminate duplicated consecutive elements .................................................................................... 421
Chapter 116: Non-standard evaluation and standard evaluation ................................................... 423
Section 116.1: Examples with standard dplyr verbs ................................................................................................ 423
Chapter 117: Randomization ................................................................................................................................ 425
Section 117.1: Random draws and permutations .................................................................................................... 425
Section 117.2: Setting the seed .................................................................................................................................. 427
Chapter 118: Object-Oriented Programming in R ..................................................................................... 428
Section 118.1: S3 .......................................................................................................................................................... 428
Chapter 119: Coercion ............................................................................................................................................. 429
Section 119.1: Implicit Coercion ................................................................................................................................. 429
Chapter 120: Standardize analyses by writing standalone R scripts ............................................ 430
Section 120.1: The basic structure of standalone R program and how to call it ................................................ 430
Section 120.2: Using littler to execute R scripts ...................................................................................................... 431
Chapter 121: Analyze tweets with R ................................................................................................................. 433
Section 121.1: Download Tweets ............................................................................................................................... 433
Section 121.2: Get text of tweets ............................................................................................................................... 433
Chapter 122: Natural language processing ................................................................................................. 435
Section 122.1: Create a term frequency matrix ...................................................................................................... 435
Chapter 123: R Markdown Notebooks (from RStudio) .......................................................................... 437
Section 123.1: Creating a Notebook ......................................................................................................................... 437
Section 123.2: Inserting Chunks ................................................................................................................................ 437
Section 123.3: Executing Chunk Code ...................................................................................................................... 438
Section 123.4: Execution Progress ............................................................................................................................ 439
Section 123.5: Preview Output .................................................................................................................................. 440
Section 123.6: Saving and Sharing ........................................................................................................................... 440
Chapter 124: Aggregating data frames ....................................................................................................... 442
Section 124.1: Aggregating with data.table ............................................................................................................. 442
Section 124.2: Aggregating with base R ................................................................................................................. 443
Section 124.3: Aggregating with dplyr ..................................................................................................................... 444
Chapter 125: Data acquisition ............................................................................................................................ 446
Section 125.1: Built-in datasets ................................................................................................................................. 446
Section 125.2: Packages to access open databases ............................................................................................. 446
Section 125.3: Packages to access restricted data ................................................................................................ 448
Section 125.4: Datasets within packages ................................................................................................................ 452
Chapter 126: R memento by examples .......................................................................................................... 454
Section 126.1: Plotting (using plot) ........................................................................................................................... 454
Section 126.2: Commonly used functions ............................................................................................................... 454
Section 126.3: Data types ......................................................................................................................................... 455
Chapter 127: Updating R version ...................................................................................................................... 457
Section 127.1: Installing from R Website .................................................................................................................. 457
Section 127.2: Updating from within R using installr Package ............................................................................. 457
Section 127.3: Deciding on the old packages ......................................................................................................... 457
Section 127.4: Updating Packages ........................................................................................................................... 459
Section 127.5: Check R Version ................................................................................................................................ 459
Credits ............................................................................................................................................................................ 460
You may also like ...................................................................................................................................................... 464
About
Please feel free to share this PDF with anyone for free,
latest version of this book can be downloaded from:
https://goalkicker.com/RBook
This is an unofficial free book created for educational purposes and is not
affiliated with official R group(s) or company(s) nor Stack Overflow. All
trademarks and registered trademarks are the property of their respective
company owners
Windows only:
Visual Studio (starting from version 2015 Update 3) now features a development environment for R called R Tools,
that includes a live interpreter, IntelliSense, and a debugging module. If you choose this method, you won't have to
install R as specified in the following section.
For Windows
1. Go to the CRAN website, click on download R for Windows, and download the latest version of R.
2. Right-click the installer file and RUN as administrator.
3. Select the operational language for installation.
4. Follow the instructions for installation.
This will install both R and the R-MacGUI. It will put the GUI in the /Applications/ Folder as R.app where it can either
be double-clicked or dragged to the Doc. When a new version is released, the (re)-installation process will overwrite
R.app but prior major versions of R will be maintained. The actual R code will be in the
/Library/Frameworks/R.Framework/Versions/ directory. Using R within RStudio is also possible and would be using
the same R code with a different GUI.
Alternative 2
1. Install homebrew (the missing package manager for macOS) by following the instructions on https://brew.sh/
2. brew install R
Those choosing the second method should be aware that the maintainer of the Mac fork advises against it, and will
not respond to questions about difficulties on the R-SIG-Mac Mailing List.
You can get the version of R corresponding to your distro via apt-get. However, this version will frequently be quite
far behind the most recent version available on CRAN. You can add CRAN to your list of recognized "sources".
You can get a more recent version directly from CRAN by adding CRAN to your sources list. Follow the directions
from CRAN for more details. Note in particular the need to also execute this so that you can use
For Archlinux
sudo pacman -S r
More info on using R under Archlinux can be found on the ArchWiki R page.
Also, check out the detailed discussion of how, when, whether and why to print a string.
The most basic way to use R is the interactive mode. You type commands and immediately get the result from R.
Using R as a calculator
Start R by typing R at the command prompt of your operating system or by executing RGui on Windows. Below you
can see a screenshot of an interactive R session on Linux:
After the > sign, expressions can be typed in. Once an expression is typed, the result is shown by R. In the
screenshot above, R is used as a calculator: Type
to immediately see the result, 2. The leading [1] indicates that R returns a vector. In this case, the vector contains
only one number (2).
R can be used to generate plots. The following example uses the data set PlantGrowth, which comes as an example
data set along with R
Type int the following all lines into the R prompt which do not start with ##. Lines starting with ## are meant to
document the result which R will return.
data(PlantGrowth)
str(PlantGrowth)
## 'data.frame': 30 obs. of 2 variables:
## $ weight: num 4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ...
## $ group : Factor w/ 3 levels "ctrl","trt1",..: 1 1 1 1 1 1 1 1 1 1 ...
anova(lm(weight ~ group, data = PlantGrowth))
## Analysis of Variance Table
##
## Response: weight
## Df Sum Sq Mean Sq F value Pr(>F)
## group 2 3.7663 1.8832 4.8461 0.01591 *
## Residuals 27 10.4921 0.3886
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
boxplot(weight ~ group, data = PlantGrowth, ylab = "Dry weight")
data(PlantGrowth) loads the example data set PlantGrowth, which is records of dry masses of plants which were
subject to two different treatment conditions or no treatment at all (control group). The data set is made available
under the name PlantGrowth. Such a name is also called a Variable.
To load your own data, the following two documentation pages might be helpful:
Reading and writing tabular data in plain-text files (CSV, TSV, etc.)
I/O for foreign tables (Excel, SAS, SPSS, Stata)
str(PlantGrowth) shows information about the data set which was loaded. The output indicates that PlantGrowth
is a data.frame, which is R's name for a table. The data.frame contains of two columns and 30 rows. In this case,
each row corresponds to one plant. Details of the two columns are shown in the lines starting with $: The first
To compare the dry masses of the three different groups, a one-way ANOVA is performed using anova(lm( ... )).
weight ~ group means "Compare the values of the column weight, grouping by the values of the column group".
This is called a Formula in R. data = ... specifies the name of the table where the data can be found.
The result shows, among others, that there exists a significant difference (Column Pr(>F)), p = 0.01591) between
some of the three groups. Post-hoc tests, like Tukey's Test, must be performed to determine which groups' means
differ significantly.
boxplot(...) creates a box plot of the data. where the values to be plotted come from. weight ~ group means:
"Plot the values of the column weight versus the values of the column group. ylab = ... specifies the label of the y
axis. More information: Base plotting
R scripts
To document your research, it is favourable to save the commands you use for calculation in a file. For that effect,
you can create R scripts. An R script is a simple text file, containing R commands.
Create a text file with the name plants.R, and fill it with the following text, where some commands are familiar
from the code block above:
data(PlantGrowth)
Execute the script by typing into your terminal (The terminal of your operating system, not an interactive R session
like in the previous section!)
The file plant_result.txt contains the results of your calculation, as if you had typed them into the interactive R
prompt. Thereby, your calculations are documented.
The new commands png and dev.off are used for saving the boxplot to disk. The two commands must enclose the
plotting command, as shown in the example above. png("FILENAME", width = ..., height = ...) opens a new
PNG file with the specified file name, width and height in pixels. dev.off() will finish plotting and saves the plot to
disk. No output is saved until dev.off() is called.
Names that start with a digit or an underscore (e.g. 1a), or names that are valid numerical expressions (e.g.
.11), or names with dashes ('-') or spaces can only be used when they are quoted: `1a` and `.11`. The
names will be printed with backticks:
All other combinations of alphanumeric characters, dots and underscores can be used freely, where
reference with or without backticks points to the same object.
Names that begin with . are considered system names and are not always visible using the ls()-function.
Some examples of valid object names are: foobar, foo.bar, foo_bar, .foobar
In R, variables are assigned values using the infix-assignment operator <-. The operator = can also be used for
assigning values to variables, however its proper use is for associating values with parameter names in function
calls. Note that omitting spaces around operators may create confusion for users. The expression a<-1 is parsed as
assignment (a <- 1) rather than as a logical comparison (a < -1).
So foo is assigned the value of 42. Typing foo within the console will output 42, while typing fooEquals will output
43.
> foo
[1] 42
> fooEquals
[1] 43
The following command assigns a value to the variable named x and prints the value simultaneously:
> (x <- 5)
[1] 5
# actually two function calls: first one to `<-`; second one to the `()`-function
> is.function(`(`)
[1] TRUE # Often used in R help page examples for its side-effect of printing.
> 5 -> x
> x
There are no scalar data types in R. Vectors of length-one act like scalars.
Vectors: Atomic vectors must be sequence of same-class objects.: a sequence of numbers, or a sequence of
logicals or a sequence of characters. v <- c(2, 3, 7, 10), v2 <- c("a", "b", "c") are both vectors.
Matrices: A matrix of numbers, logical or characters. a <- matrix(data = c(1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12), nrow = 4, ncol = 3, byrow = F). Like vectors, matrix must be made of same-class
elements. To extract elements from a matrix rows and columns must be specified: a[1,2] returns [1] 5 that
is the element on the first row, second column.
Lists: concatenation of different elements mylist <- list (course = 'stat', date = '04/07/2009',
num_isc = 7, num_cons = 6, num_mat = as.character(c(45020, 45679, 46789, 43126, 42345, 47568,
45674)), results = c(30, 19, 29, NA, 25, 26 ,27) ). Extracting elements from a list can be done by
name (if the list is named) or by index. In the given example mylist$results and mylist[[6]] obtains the
same element. Warning: if you try mylist[6], R won't give you an error, but it extract the result as a list.
While mylist[[6]][2] is permitted (it gives you 19), mylist[6][2] gives you an error.
data.frame: object with columns that are vectors of equal length, but (possibly) different types. They are not
matrices. exam <- data.frame(matr = as.character(c(45020, 45679, 46789, 43126, 42345, 47568,
45674)), res_S = c(30, 19, 29, NA, 25, 26, 27), res_O = c(3, 3, 1, NA, 3, 2, NA), res_TOT =
c(30,22,30,NA,28,28,27)). Columns can be read by name exam$matr, exam[, 'matr'] or by index exam[1],
exam[,1]. Rows can also be read by name exam['rowname', ] or index exam[1,]. Dataframes are actually
just lists with a particular structure (rownames-attribute and equal length components)
Default operations are done element by element. See ?Syntax for the rules of operator precedence. Most
operators (and may other functions in base R) have recycling rules that allow arguments of unequal length. Given
these objects:
Example objects
> a <- 1
> b <- 2
> c <- c(2,3,4)
> d <- c(10,10,10)
> e <- c(1,2,3,4)
> f <- 1:6
> W <- cbind(1:4,5:8,9:12)
> Z <- rbind(rep(0,3),1:3,rep(10,3),c(4,7,1))
R sums what it can and then reuses the shorter vector to fill in the blanks... The warning was given only because the
two vectors have lengths that are not exactly multiples. c+f # no warning whatsoever.
> W + c # matrix + vector... : no warnings and R does the operation in a column-wise manner
[,1] [,2] [,3]
[1,] 3 8 13
[2,] 5 10 12
[3,] 7 9 14
[4,] 6 11 16
"Private" variables
A leading dot in a name of a variable or function in R is commonly used to denote that the variable or function is
meant to be hidden.
And then using the ls function to list objects will only show the first object.
> ls()
[1] "foo"
However, passing all.names = TRUE to the function will show the 'private' variable
3+1:5
Gives:
[1] 4 5 6 7 8
This is because the range operator : has higher precedence than addition operator +.
3+1:5
3+c(1, 2, 3, 4, 5) expansion of the range operator to make a vector of integers.
c(4, 5, 6, 7, 8) Addition of 3 to each member of the vector.
To avoid this behavior you have to tell the R interpreter how you want it to order the operations with ( ) like this:
(3+1):5
Now R will compute what is inside the parentheses before expanding the range and gives:
[1] 4 5
We can simple enter the numbers concatenated with + for adding and - for subtracting:
> 3 + 4.5
# [1] 7.5
> 3 + 4.5 + 2
# [1] 9.5
> 3 + 4.5 + 2 - 3.8
# [1] 5.7
> 3 + NA
#[1] NA
> NA + NA
#[1] NA
> NA - NA
#[1] NA
> NaN - NA
#[1] NaN
> NaN + NA
#[1] NaN
We can assign the numbers to variables (constants in this case) and do the same operations:
2. Using vectors
In this case we create vectors of numbers and do the operations using those vectors, or combinations with single
numbers. In this case the operation is done considering each element of the vector:
We can also use the function sum to add all elements of a vector:
> sum(A)
# [1] 5.7
> sum(-A)
# [1] -5.7
> sum(A[-n]) + A[n]
# [1] 5.7
We must take care with recycling, which is one of the characteristics of R, a behavior that happens when doing math
operations where the length of vectors is different. Shorter vectors in the expression are recycled as often as need be
(perhaps fractionally) until they match the length of the longest vector. In particular a constant is simply repeated. In this
case a Warning is show.
In this case the correct procedure will be to consider only the elements of the shorter vector:
> B[1:n] + A
# [1] 6.0 9.5 -1.0 -1.1
> B[1:n] - A
# [1] 0.0 0.5 -5.0 6.5
When using the sum function, again all the elements inside the function are added.
> sum(A, B)
# [1] 15.2
> sum(A, -B)
# [1] -3.8
> sum(A)+sum(B)
# [1] 15.2
> sum(A)-sum(B)
# [1] -3.8
As you can see this gives us a matrix of all numbers from 1 to 6 with two rows and three columns. The data
parameter takes a vector of values, nrow specifies the number of rows in the matrix, and ncol specifies the number
of columns. By convention the matrix is filled by column. The default behavior can be changed with the byrow
parameter as shown below:
Matrices do not have to be numeric – any vector can be transformed into a matrix. For example:
Like vectors matrices can be stored as variables and then called later. The rows and columns of a matrix can have
names. You can look at these using the functions rownames and colnames. As shown below, the rows and columns
don't initially have names, which is denoted by NULL. However, you can assign values to them.
It is important to note that similarly to vectors, matrices can only have one data type. If you try to specify a matrix
with multiple data types the data will be coerced to the higher order data class.
class(mat1)
## [1] "matrix"
is.matrix(mat1)
## [1] TRUE
as.vector(mat1)
## [1] 1 4 2 5 3 6
When running model functions like lm for the Linear Regressions, they need a formula. This formula specifies which
regression coefficients shall be estimated.
On the left side of the ~ (LHS) the dependent variable is specified, while the right hand side (RHS) contains the
independent variables. Technically the formula call above is redundant because the tilde-operator is an infix
function that returns an object with formula class:
The advantage of the formula function over ~ is that it also allows an environment for evaluation to be specified:
In this case, the output shows that a regression coefficient for wt is estimated, as well as (per default) an intercept
parameter. The intercept can be excluded / forced to be 0 by including 0 or -1 in the formula:
Interactions between variables a and b can added by included a:b to the formula:
As it is (from a statistical point of view) generally advisable not have interactions in the model without the main
effects, the naive approach would be to expand the formula to a + b + a:b. This works but can be simplified by
writing a*b, where the * operator indicates factor crossing (when between two factor columns) or multiplication
when one or both of the columns are 'numeric':
Using the * notation expands a term to include all lower order effects, such that:
will give, in addition to the intercept, 7 regression coefficients. One for the three-way interaction, three for the two-
way interactions and three for the main effects.
Or, we can use the ^ notation to specify which level of interaction we require:
Those two formula specifications should create the same model matrix.
Finally, . is shorthand to use all available variables as main effects. In this case, the data argument is used to obtain
the available variables (which are not on the LHS). Therefore:
gives coefficients for the intercept and 10 independent variables. This notation is frequently used in machine
learning packages, where one would like to use all variables for prediction or classification. Note that the meaning
of . depends on context (see e.g. ?update.formula for a different meaning).
1. G. N. Wilkinson and C. E. Rogers. Journal of the Royal Statistical Society. Series C (Applied Statistics) Vol. 22, No. 3
(1973), pp. 392-399
print("Hello World")
#[1] "Hello World"
cat("Hello World\n")
#Hello World
Note the difference in both input and output for the two functions. (Note: there are no quote-characters in the
value of x created with x <- "Hello World". They are added by print at the output stage.)
cat takes one or more character vectors as arguments and prints them to the console. If the character vector has a
length greater than 1, arguments are separated by a space (by default):
cat("Hello World")
#Hello World>
The prompt for the next command appears immediately after the output. (Some consoles such as RStudio's may
automatically append a newline to strings that do not end with a newline.)
print is an example of a "generic" function, which means the class of the first argument passed is detected and a
class-specific method is used to output. For a character vector like "Hello World", the result is similar to the output
of cat. However, the character string is quoted and a number [1] is output to indicate the first element of a
character vector (In this case, the first and only element):
print("Hello World")
#[1] "Hello World"
This default print method is also what we see when we simply ask R to print a variable. Note how the output of
typing s is the same as calling print(s) or print("Hello World"):
"Hello World"
#[1] "Hello World"
If we add another character string as a second element of the vector (using the c() function to concatenate the
elements together), then the behavior of print() looks quite a bit different from that of cat:
Observe that the c() function does not do string-concatenation. (One needs to use paste for that purpose.) R
shows that the character vector has two elements by quoting them separately. If we have a vector long enough to
span multiple lines, R will print the index of the element starting each line, just as it prints [1] at the start of the first
line.
The particular behavior of print depends on the class of the object passed to the function.
If we call print an object with a different class, such as "numeric" or "logical", the quotes are omitted from the
output to indicate we are dealing with an object that is not character class:
print(1)
#[1] 1
print(TRUE)
#[1] TRUE
Factor objects get printed in the same fashion as character variables which often creates ambiguity when console
output is used to display objects in SO question bodies. It is rare to use cat or print except in an interactive
context. Explicitly calling print() is particularly rare (unless you wanted to suppress the appearance of the quotes
or view an object that is returned as invisible by a function), as entering foo at the console is a shortcut for
print(foo). The interactive console of R is known as a REPL, a "read-eval-print-loop". The cat function is best saved
for special purposes (like writing output to an open file connection). Sometimes it is used inside functions (where
calls to print() are suppressed), however using cat() inside a function to generate output to the console is
bad practice. The preferred method is to message() or warning() for intermediate messages; they behave
similarly to cat but can be optionally suppressed by the end user. The final result should simply returned so that
the user can assign it to store it if necessary.
message("hello world")
#hello world
suppressMessages(message("hello world"))
Base R has two functions for invoking a system command. Both require an additional parameter to capture the
output of the system command.
[1] "top - 08:52:03 up 70 days, 15:09, 0 users, load average: 0.00, 0.00, 0.00"
[2] "Tasks: 125 total, 1 running, 124 sleeping, 0 stopped, 0 zombie"
[3] "Cpu(s): 0.9%us, 0.3%sy, 0.0%ni, 98.7%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st"
[4] "Mem: 12194312k total, 3613292k used, 8581020k free, 216940k buffers"
[5] "Swap: 12582908k total, 2334156k used, 10248752k free, 1682340k cached"
[6] ""
[7] " PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND "
For illustration, the UNIX command top -a -b -n 1 is used. This is OS specific and may need to be
amended to run the examples on your computer.
Package devtools has a function to run a system command and capture the output without an additional
parameter. It also returns a character vector.
The fread function in package data.table allows to execute a shell command and to read the output like
read.table. It returns a data.table or a data.frame.
Note, that fread automatically has skipped the top 6 header lines.
Here the parameter check.names = TRUE was added to convert %CPU, %MEN, and TIME+ to syntactically
valid column names.
Establish a file connection to read with file() command ("r" is for read mode):
As this will establish just file connection, one can read the data from these file connections as follows:
Here we are reading the data from file connection conn line by line as n=1. one can change value of n (say 10, 20
etc.) for reading data blocks for faster reading (10 or 20 lines block read in one go). To read complete file in one go
set n=-1.
After data processing or say model execution; one can write the results back to file connection using many different
commands like writeLines(),cat() etc. which are capable of writing to a file connection. However all of these
commands will leverage file connection established for writing. This could be done using file() command as:
stri_count_fixed("babab", "b")
# [1] 3
stri_count_fixed("babab", "ba")
# [1] 2
stri_count_fixed("babab", "bab")
# [1] 1
Natively:
length(gregexpr("b","babab")[[1]])
# [1] 3
length(gregexpr("ba","babab")[[1]])
# [1] 2
length(gregexpr("bab","babab")[[1]])
# [1] 1
stri_count_fixed("babab", c("b","ba"))
# [1] 3 2
stri_count_fixed(c("babab","bbb","bca","abc"), c("b","ba"))
# [1] 3 0 1 0
A base R solution:
sapply(c("b","ba"),function(x)length(gregexpr(x,"babab")[[1]]))
# b ba
# 3 2
With regex
A base R solution that does the same would look like this:
> paste(LETTERS,1:13,sep="-")
#[1] "A-1" "B-2" "C-3" "D-4" "E-5" "F-6" "G-7" "H-8" "I-9" "J-10" "K-11" "L-12" "M-13"
#[14] "N-1" "O-2" "P-3" "Q-4" "R-5" "S-6" "T-7" "U-8" "V-9" "W-10" "X-11" "Y-12" "Z-13"
class(iris)
[1] "data.frame"
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 ...
class(iris$Species)
[1] "factor"
We see that iris has the class data.frame and using str() allows us to examine the data inside. The variable
Species in the iris data frame is of class factor, in contrast to the other variables which are of class numeric. The
str() function also provides the length of the variables and shows the first couple of observations, while the
class() function only provides the object's class.
x <- 1826
class(x) <- "Date"
x
# [1] "1975-01-01"
x <- as.Date("1970-01-01")
class(x)
#[1] "Date"
is(x,"Date")
#[1] TRUE
is(x,"integer")
#[1] FALSE
is(x,"numeric")
#[1] FALSE
mode(x)
#[1] "numeric"
Since functions can only return a single value, it is common to return complicated results in a list:
f(7)
# $xplus
# [1] 17
#
# $xsq
# [1] 49
Lists are also the underlying fundamental class for data frames. Under the hood, a data frame is a list of
vectors all having the same length:
The other class of recursive vectors is R expressions, which are "language"- objects
c(1, 2, 3)
## [1] 1 2 3
c(TRUE, TRUE, FALSE)
## [1] TRUE TRUE FALSE
c("a", "b", "c")
## [1] "a" "b" "c"
x <- c(1, 2, 5)
y <- c(3, 4, 6)
z <- c(x, y)
z
## [1] 1 2 5 3 4 6
A more elaborate treatment of how to create vectors can be found in the "Creating vectors" topic
Notice the vectors that make up the above list are different classes. Lists allow users to group elements of different
classes. Each element in a list can also have a name. List names are accessed by the names function, and are
assigned in the same manner row and column names are assigned in a matrix.
names(l1)
## NULL
names(l1) <- c("vector1", "vector2")
l1
## $vector1
## [1] 1 2 3
##
## $vector2
## [1] "a" "b" "c"
It is often easier and safer to declare the list names when creating the list object.
Above the list has two elements, named "vec" and "mat," a vector and matrix, resepcively.
A list would be able to store any type variable in it, making it to the generic object that can store any type of
variables we would need.
In order to understand the data that was defined in the list, we can use the str function.
str(exampleList1)
str(exampleList2)
str(exampleList3)
Subsetting of lists distinguishes between extracting a slice of the list, i.e. obtaining a list containing a subset of the
elements in the original list, and extracting a single element. Using the [ operator commonly used for vectors
produces a new list.
# Returns List
exampleList3[1]
exampleList3[1:2]
# Returns Character
exampleList3[[1]]
The entries in named lists can be accessed by their name instead of their index.
exampleList4[['char']]
exampleList4$num
This has the advantage that it is faster to type and may be easier to read but it is important to be aware of a
potential pitfall. The $ operator uses partial matching to identify matching list elements and may produce
unexpected results.
exampleList4$num
# c(1, 2, 3)
exampleList5$num
# 0.5
exampleList5[['num']]
# NULL
## Numeric vector
exampleVector1 <- c(12, 13, 14)
## Character vector
exampleVector2 <- c("a", "b", "c", "d", "e", "f")
## Matrix
exampleMatrix1 <- matrix(rnorm(4), ncol = 2, nrow = 2)
## List
exampleList3 <- list('a', 1, 2)
> df
name height team fun_index title age desc Y
1 Andrea 195 Lazio 97 6 33 eccellente 1
2 Paja 165 Fiorentina 87 6 31 deciso 1
3 Roro 190 Lazio 65 6 28 strano 0
4 Gioele 70 Lazio 100 0 2 simpatico 1
5 Cacio 170 Juventus 81 3 33 duro 0
6 Edola 171 Lazio 72 5 32 svampito 1
7 Salami 175 Inter 75 3 30 doppiopasso 1
8 Braugo 180 Inter 79 5 32 gjn 0
9 Benna 158 Juventus 80 6 28 esaurito 0
10 Riggio 182 Lazio 92 5 31 certezza 1
11 Giordano 185 Roma 79 5 29 buono 1
In order to put different types of data in a dataframe we have to use the list object and the serialization. In
particular we have to put the data in a generic list and then put the list in a particular dataframe:
l <- list(df,number)
dataframe_container <- data.frame(out2 = as.integer(serialize(l, connection=NULL)))
Once we have stored the information in the dataframe, we need to deserialize it in order to use it:
Introduction
Although R does not provide a native hash table structure, similar functionality can be achieved by leveraging the
fact that the environment object returned from new.env (by default) provides hashed key lookups. The following
two statements are equivalent, as the hash parameter defaults to TRUE:
Additionally, one may specify that the internal hash table is pre-allocated with a particular size via the size
parameter, which has a default value of 29. Like all other R objects, environments manage their own memory and
will grow in capacity as needed, so while it is not necessary to request a non-default value for size, there may be a
slight performance advantage in doing so if the object will (eventually) contain a very large number of elements. It
is worth noting that allocating extra space via size does not, in itself, result in an object with a larger memory
footprint:
object.size(new.env())
# 56 bytes
object.size(new.env(size = 10e4))
# 56 bytes
Insertion
Insertion of elements may be done using either of the [[<- or $<- methods provided for the environment class, but
not by using "single bracket" assignment ([<-):
H <- new.env()
H["error"] <- 42
#Error in H["error"] <- 42 :
# object of type 'environment' is not subsettable
Like other facets of R, the first method (object[[key]] <- value) is generally preferred to the second (object$key
<- value) because in the former case, a variable maybe be used instead of a literal value (e.g key2 in the example
above).
As is generally the case with hash map implementations, the environment object will not store duplicate keys.
Attempting to insert a key-value pair for an existing key will replace the previously stored value:
H[["key3"]]
#[1] "new value"
Key Lookup
H[["key"]]
#[1] 1.630631
H$another_key
# [,1] [,2] [,3]
# [1,] TRUE TRUE TRUE
# [2,] FALSE FALSE FALSE
# [3,] TRUE TRUE TRUE
H[1]
#Error in H[1] : object of type 'environment' is not subsettable
Being just an ordinary environment, the hash map can be inspected by typical means:
names(H)
#[1] "another_key" "xyz" "key" "key3"
ls(H)
#[1] "another_key" "key" "key3" "xyz"
str(H)
#<environment: 0x7828228>
ls.str(H)
# another_key : logi [1:3, 1:3] TRUE FALSE TRUE TRUE FALSE TRUE ...
# key : num 1.63
# key3 : chr "new value"
# xyz : 'data.frame': 3 obs. of 2 variables:
# $ x: int 1 2 3
# $ y: chr "a" "b" "c"
ls.str(H)
# another_key : logi [1:3, 1:3] TRUE FALSE TRUE TRUE FALSE TRUE ...
# xyz : 'data.frame': 3 obs. of 2 variables:
# $ x: int 1 2 3
Flexibility
One of the major benefits of using environment objects as hash tables is their ability to store virtually any type of
object as a value, even other environments:
H2 <- new.env()
ls.str(H2)
# a : chr [1:26] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" ...
# b : List of 5
# $ : int 1
# $ : int 2
# $ : int 3
# $ : int 4
# $ : int 5
# c : 'data.frame': 3 obs. of 11 variables:
# $ mpg : num 21 21 22.8
# $ cyl : num 6 6 4
# $ disp: num 160 160 108
# $ hp : num 110 110 93
# $ drat: num 3.9 3.9 3.85
# $ wt : num 2.62 2.88 2.32
# $ qsec: num 16.5 17 18.6
# $ vs : num 0 0 1
# $ am : num 1 1 1
# $ gear: num 4 4 4
# $ carb: num 4 4 1
# d : Date[1:1], format: "2016-08-03"
# e : POSIXct[1:1], format: "2016-08-03 19:25:14"
# f : <environment: 0x91a7cb8>
ls.str(H2$f)
# a : chr [1:26] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" ...
# b : List of 5
# $ : int 1
# $ : int 2
# $ : int 3
# $ : int 4
# $ : int 5
# c : 'data.frame': 3 obs. of 11 variables:
# $ mpg : num 21 21 22.8
# $ cyl : num 6 6 4
# $ disp: num 160 160 108
# $ hp : num 110 110 93
# $ drat: num 3.9 3.9 3.85
# $ wt : num 2.62 2.88 2.32
Limitations
One of the major limitations of using environment objects as hash maps is that, unlike many aspects of R,
vectorization is not supported for element lookup / insertion:
names(H2)
#[1] "a" "b" "c" "d" "e" "f"
H2[[c("a", "b")]]
#Error in H2[[c("a", "b")]] :
# wrong arguments for subsetting an environment
Depending on the nature of the data being stored in the object, it may be possible to use vapply or list2env for
assigning many elements at once:
E1 <- new.env()
invisible({
vapply(letters, function(x) {
E1[[x]] <- rnorm(1)
logical(0)
}, FUN.VALUE = logical(0))
})
all.equal(sort(names(E1)), letters)
#[1] TRUE
all.equal(sort(names(E2)), letters)
#[1] TRUE
Neither of the above are particularly concise, but may be preferable to using a for loop, etc. when the number of
key-value pairs is large.
Consider:
1) Sequences of letters:
> letters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x"
"y" "z"
> LETTERS[7:9]
[1] "G" "H" "I"
> letters[c(1,5,3,2,4)]
[1] "a" "e" "c" "b" "d"
> month.abb
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
> month.name[1:4]
[1] "January" "February" "March" "April"
> month.abb[c(3,6,9,12)]
[1] "Mar" "Jun" "Sep" "Dec"
> xc
a b c d
5 6 7 8
with list:
$b
[1] 6
$c
[1] 7
$d
[1] 8
With the setNames function, two vectors of the same length can be used to create a named vector:
x <- 5:8
y <- letters[1:4]
xy <- setNames(x, y)
> xy
a b c d
5 6 7 8
You may also use the names function to get the same result:
xy <- 5:8
names(xy) <- letters[1:4]
> xy["c"]
c
7
This feature makes it possible to use such a named vector as a look-up vector/table to match the values to values of
another vector or column in dataframe. Considering the following dataframe:
> mydf
let
1 c
2 a
3 b
4 d
Suppose you want to create a new variable in the mydf dataframe called num with the correct values from xy in the
rows. Using the match function the appropriate values from xy can be selected:
x <- 1:5
x
## [1] 1 2 3 4 5
10:4
# [1] 10 9 8 7 6 5 4
1.25:5
# [1] 1.25 2.25 3.25 4.25
or negatives
-4:4
#[1] -4 -3 -2 -1 0 1 2 3 4
The function creates a sequence from the start (default is 1) to the end including that number.
seq(5)
# [1] 1 2 3 4 5
seq can optionally infer the (evenly spaced) steps when alternatively the desired length of the output (length.out)
is supplied
If the sequence needs to have the same length as another vector we can use the along.with as a shorthand for
length.out = length(x)
x = 1:8
seq(2,5,along.with = x)
# [1] 2.000000 2.428571 2.857143 3.285714 3.714286 4.142857 4.571429 5.000000
There are two useful simplified functions in the seq family: seq_along, seq_len, and seq.int. seq_along and
seq_len functions construct the natural (counting) numbers from 1 through N where N is determined by the
function argument, the length of a vector or list with seq_along, and the integer argument with seq_len.
seq_along(x)
# [1] 1 2 3 4 5 6 7 8
There is also an old function sequencethat creates a vector of sequences from a non negative argument.
sequence(4)
# [1] 1 2 3 4
sequence(c(3, 2))
# [1] 1 2 3 1 2
sequence(c(3, 2, 5))
# [1] 1 2 3 1 2 1 2 3 4 5
integer(2) # is the same as vector('integer',2) and creates an integer vector with two elements
character(2) # is the same as vector('integer',2) and creates an character vector with two elements
logical(2) # is the same as vector('logical',2) and creates an logical vector with two elements
Creating vectors with values, other than the default values, is also possible. Often the function c() is used for this.
The c is short for combine or concatenate.
Important to note here is that R interprets any integer (e.g. 1) as an integer vector of size one. The same holds for
numerics (e.g. 1.1), logicals (e.g. T or F), or characters (e.g. 'a'). Therefore, you are in essence combining vectors,
which in turn are vectors.
Pay attention that you always have to combine similar vectors. Otherwise, R will try to convert the vectors in vectors
of the same type.
c(1,1.1,'a',T) # all types (integer, numeric, character and logical) are converted to the 'lowest'
type which is character.
Finally, the : operator (short for the function seq()) can be used to quickly create a vector of numbers.
This can also be used to subset vectors (from easy to more complex subsets)
The each argument is especially useful for expanding a vector of statistics of observational/experimental units into
a vector of data.frame with repeated observations of these units.
This should expose the possibility of allowing an external function to feed the second argument of rep in order to
dynamically construct a vector that expands according to the data.
As with seq, faster, simplified versions of rep are rep_len and rep.int. These drop some attributes that rep
maintains and so may be most useful in situations where speed is a concern and additional aspects of the repeated
vector are unnecessary.
## [1] "2016-07-21"
## [1] 1469113479
## [1] "Australia/Melbourne"
Use OlsonNames() to view the time zone names in Olson/IANA database on the current system:
str(OlsonNames())
## chr [1:589] "Africa/Abidjan" "Africa/Accra" "Africa/Addis_Ababa" "Africa/Algiers"
"Africa/Asmara" "Africa/Asmera" "Africa/Bamako" ...
Test:
x <- seq(as.POSIXct("2000-12-10"),as.POSIXct("2001-05-10"),by="months")
> data.frame(before=x,after=eom(x))
before after
1 2000-12-10 2000-12-31
2 2001-01-10 2001-01-31
3 2001-02-10 2001-02-28
4 2001-03-10 2001-03-31
5 2001-04-10 2001-04-30
6 2001-05-10 2001-05-31
>
> eom('2000-01-01')
[1] "2000-01-31"
It moves consistently the month part of the date and adjusting the day, in case the date refers to the last day of the
month.
For example:
> moveNumOfMonths("2017-10-30",-1)
[1] "2017-09-30"
> moveNumOfMonths("2017-10-30",-2)
[1] "2017-08-30"
> moveNumOfMonths("2017-02-28", 2)
[1] "2017-04-30"
It moves two months from the last day of February, therefore the last day of April.
Let's se how it works for backward and forward operations when it is the last day of the month:
> moveNumOfMonths("2016-11-30", 2)
[1] "2017-01-31"
> moveNumOfMonths("2017-01-31", -2)
[1] "2016-11-30"
Because November has 30 days, we get the same date in the backward operation, but:
as.Date('March 23rd, 2016', '%B %drd, %Y') # add separators and literals to format
## [1] "2016-03-23"
as.Date(' 2016-08-01 foo') # leading whitespace and all trailing characters are ignored
## [1] "2016-08-01"
as.Date(c('2016-01-01', '2016-01-02'))
# [1] "2016-01-01" "2016-01-02"
The as.Date() function allows you to provide a format argument. The default is %Y-%m-%d, which is Year-month-
day.
The format string can be placed either within a pair of single quotes or double quotes. Dates are usually expressed
in a variety of forms such as: "d-m-yy" or "d-m-YYYY" or "m-d-yy" or "m-d-YYYY" or "YYYY-m-d" or "YYYY-d-m".
These formats can also be expressed by replacing "-" by "/". Furher, dates are also expressed in the forms, say,
"Nov 6, 1986" or "November 6, 1986" or "6 Nov, 1986" or "6 November, 1986" and so on. The as.Date() function
accepts all such character strings and when we mention the appropriate format of the string, it always outputs the
date in the form "YYYY-m-d".
#
# It tries to interprets the string as YYYY-m-d
#
> as.Date("9-6-1962")
[1] "0009-06-19" #interprets as "%Y-%m-%d"
>
as.Date("9/6/1962")
[1] "0009-06-19" #again interprets as "%Y-%m-%d"
>
# It has no problem in understanding, if the date is in form YYYY-m-d or YYYY/m/d
#
> as.Date("1962-6-9")
[1] "1962-06-09" # no problem
> as.Date("1962/6/9")
[1] "1962-06-09" # no problem
>
By specifying the correct format of the input string, we can get the desired results. We use the following codes for
specifying the formats to the as.Date() function.
Some times, names of the months abbreviated to the first three characters are used in the writing the dates. In
which case we use the format specifier %b.
> as.Date("6Nov1962","%d%b%Y")
[1] "1962-11-06"
>
Note that, there are no either '-' or '/' or white spaces between the members in the date string. The format string
should exactly match that input string. Consider the following example:
Note that, there is a comma in the date string and hence a comma in the format specification too. If comma is
omitted in the format string, it results in an NA. An example usage of %B format specifier is as follows:
%y format is system specific and hence, should be used with caution. Other parameters used with this function are
origin and tz( time zone).
See ?strptime for details on the format strings here, as well as other formats.
More formally, as.difftime can be used to specify time periods to add to a date or datetime object. E.g.:
as.POSIXct("2016-01-01") +
as.difftime(3, units="hours") +
as.difftime(14, units="mins") +
as.difftime(15, units="secs")
# [1] "2016-01-01 03:14:15 AEDT"
as.POSIXct("11 AM",
format = "%I %p")
## [1] "2016-07-21 11:00:00 CDT"
as.POSIXct("2016-07-21 00:00:00",
format = "%F %T") # shortcut tokens for "%Y-%m-%d" and "%H:%M:%S"
Notes
Missing elements
If a date element is not supplied, then that from the current date is used.
If a time element is not supplied, then that from midnight is used, i.e. 0s.
If no timezone is supplied in either the string or the tz parameter, the local timezone is used.
Time zones
x <- "The quick brown fox jumps over the lazy dog"
class(x)
[1] "character"
is.character(x)
[1] TRUE
Note that numerics can be coerced to characters, but attempting to coerce a character to numeric may result in NA.
as.numeric("2")
[1] 2
as.numeric("fox")
[1] NA
Warning message:
NAs introduced by coercion
x <- 12.3
y <- 12L
#confirm types
typeof(x)
[1] "double"
typeof(y)
[1] "integer"
# logical to numeric
as.numeric(TRUE)
[1] 1
Doubles are R's default numeric value. They are double precision vectors, meaning that they take up 8 bytes of
memory for each value in the vector. R has no single precision data type and so all real numbers are stored in the
double precision format.
is.double(1)
TRUE
is.double(1.0)
TRUE
is.double(1L)
FALSE
Integers are whole numbers that can be written without a fractional component. Integers are represented by a
number with an L after it. Any number without an L after it will be considered a double.
typeof(1)
[1] "double"
class(1)
[1] "numeric"
typeof(1L)
[1] "integer"
class(1L)
[1] "integer"
Though in most cases using an integer or double will not matter, sometimes replacing doubles with integers will
for( i in 1:100000){
2.0 * i
10.0 + i
}
)
Unit: milliseconds
expr min lq mean median uq
max neval
for (i in 1:1e+05) { 2L * i 10L + i } 40.74775 42.34747 50.70543 42.99120 65.46864
94.11804 100
for (i in 1:1e+05) { 2 * i 10 + i } 41.07807 42.38358 53.52588 44.26364 65.84971
83.00456 100
Note that the || operator evaluates the left condition and if the left condition is TRUE the right side is never
evaluated. This can save time if the first is the result of a complex operation. The && operator will likewise return
FALSE without evaluation of the second argument when the first element of the first argument is FALSE.
> x <- 5
> x > 6 || stop("X is too small")
Error: X is too small
> x > 3 || stop("X is too small")
[1] TRUE
To check whether a value is a logical you can use the is.logical() function.
> x <- 2
> z <- x > 4
> z
[1] FALSE
> class(x)
[1] "numeric"
> as.logical(2)
[1] TRUE
When applying as.numeric() to a logical, a double will be returned. NA is a logical value and a logical operator with
an NA will return NA if the outcome is ambiguous.
But this is unusual. It is more common for a data.frame to have many columns and many rows. Here is a
data.frame with three rows and two columns (a is numeric class and b is character class):
In order for the data.frame to print, we will need to supply some row names. Here we use just the numbers 1:3:
Now it becomes obvious that we have a data.frame with 3 rows and 2 columns. You can check this using nrow(),
ncol(), and dim():
R provides two other functions (besides structure()) that can be used to create a data.frame. The first is called,
intuitively, data.frame(). It checks to make sure that the column names you supplied are valid, that the list
elements are all the same length, and supplies some automatically generated row names. This means that the
output of data.frame() might now always be exactly what you expect:
The other function is called as.data.frame(). This can be used to coerce an object that is not a data.frame into
being a data.frame by running it through data.frame(). As an example, consider a matrix:
> as.data.frame(m)
V1 V2 V3
1 a d g
2 b e h
3 c f i
> str(as.data.frame(m))
'data.frame': 3 obs. of 3 variables:
$ V1: Factor w/ 3 levels "a","b","c": 1 2 3
$ V2: Factor w/ 3 levels "d","e","f": 1 2 3
$ V3: Factor w/ 3 levels "g","h","i": 1 2 3
This topic covers the most common syntax to access specific rows and columns of a data frame. These are
Using the built in data frame mtcars, we can extract rows and columns using [] brackets with a comma included.
Indices before the comma are rows:
As shown above, if either rows or columns are left blank, all will be selected. mtcars[1, ] indicates the first row
with all the columns.
mtcars["Mazda Rx4", ]
# 2nd and 5th row of the mpg, cyl, and disp columns
mtcars[c(2, 5), c("mpg", "cyl", "disp")]
When using these methods, if you extract multiple columns, you will get a data frame back. However, if you extract
a single column, you will get a vector, not a data frame under the default options.
There are two ways around this. One is to treat the data frame as a list (see below), the other is to add a drop =
FALSE argument. This tells R to not "drop the unused dimensions":
Note that matrices work the same way - by default a single column or row will be a vector, but if you specify drop =
FALSE you can keep it as a one-column or one-row matrix.
Like a list
Data frames are essentially lists, i.e., they are a list of column vectors (that all must have the same length). Lists
can be subset using single brackets [ for a sub-list, or double brackets [[ for a single element.
When you use single brackets and no commas, you will get column back because data frames are lists of columns.
mtcars["mpg"]
mtcars[c("mpg", "cyl", "disp")]
my_columns <- c("mpg", "cyl", "hp")
mtcars[my_columns]
The difference between data[columns] and data[, columns] is that when treating the data.frame as a list (no
comma in the brackets) the object returned will be a data.frame. If you use a comma to treat the data.frame like a
matrix then selecting a single column will return a vector but selecting multiple columns will return a data.frame.
To extract a single column as a vector when treating your data.frame as a list, you can use double brackets [[.
This will only work for a single column at a time.
A single column can be extracted using the magical shortcut $ without using a quoted column name:
The $ can be a convenient shortcut, especially if you are working in an environment (such as RStudio) that will auto-
complete the column name in this case. However, $ has drawbacks as well: it uses non-standard evaluation to avoid
the need for quotes, which means it will not work if your column name is stored in a variable.
Due to these concerns, $ is best used in interactive R sessions when your column names are constant. For
programmatic use, for example in writing a generalizable function that will be used on different data sets with
different column names, $ should be avoided.
Also note that the default behaviour is to use partial matching only when extracting from recursive objects (except
environments) by $
Whenever we have the option to use numbers for a index, we can also use negative numbers to omit certain
indices or a boolean (logical) vector to indicate exactly which items to keep.
We can use a condition such as < to generate a logical vector, and extract only the rows that meet the condition:
# logical vector indicating TRUE when a row has mpg less than 15
# FALSE when a row has mpg >= 15
test <- mtcars$mpg < 15
subset
The subset() function allows you to subset a data.frame in a more convenient way (subset also works with other
classes):
In the code above we asking only for the lines in which cyl == 6 and for the columns mpg and hp. You could achieve
the same result using [] with the following code:
The transform() function is a convenience function to change columns inside a data.frame. For instance the
following code adds another column named mpg2 with the result of mpg^2 to the mtcars data.frame:
Both with() and within() let you to evaluate expressions inside the data.frame environment, allowing a
somewhat cleaner syntax, saving you the use of some $ or [].
For example, if you want to create, change and/or remove multiple columns in the airquality data.frame:
aq <- within(airquality, {
lOzone <- log(Ozone) # creates new column
Month <- factor(month.abb[Month]) # changes Month Column
cTemp <- round((Temp - 32) * 5/9, 1) # creates new column
S.cT <- Solar.R / cTemp # creates new column
rm(Day, Temp) # removes columns
})
Data frame objects do not print with quotation marks, so the class of the columns is not always obvious.
Without further investigation, the "x" columns in df1 and df2 cannot be differentiated. The str function can be
used to describe objects with more detail than class.
str(df1)
## 'data.frame': 3 obs. of 2 variables:
## $ x: int 1 2 3
## $ y: Factor w/ 3 levels "a","b","c": 1 2 3
str(df2)
## 'data.frame': 3 obs. of 2 variables:
Here you see that df1 is a data.frame and has 3 observations of 2 variables, "x" and "y." Then you are told that "x"
has the data type integer (not important for this class, but for our purposes it behaves like a numeric) and "y" is a
factor with three levels (another data class we are not discussing). It is important to note that, by default, data
frames coerce characters to factors. The default behavior can be changed with the stringsAsFactors parameter:
Now the "y" column is a character. As mentioned above, each "column" of a data frame must have the same length.
Trying to create a data.frame from vectors with different lengths will result in an error. (Try running data.frame(x
= 1:3, y = 1:4) to see the resulting error.)
As test-cases for data frames, some data is provided by R by default. One of them is iris, loaded as follows:
The best time to do this is when the data is read in - almost all input methods that create data frames have an
options stringsAsFactors which can be set to FALSE.
If the data has already been created, factor columns can be converted to character columns as shown below.
Let's consider a data analysis where we want to obtain the two cars with the best miles per gallon (mpg) for each
cylinder count (cyl) in the built-in mtcars dataset. First, we split the mtcars data frame by the cylinder count:
This has returned a list of data frames, one for each cylinder count. As indicated by the output, we could obtain the
relevant data frames with spl$`4`, spl$`6`, and spl$`8` (some might find it more visually appealing to use
spl$"4" or spl[["4"]] instead).
Now, we can use lapply to loop through this list, applying our function that extracts the cars with the best 2 mpg
values from each of the list elements:
do.call(rbind, best2)
# mpg cyl disp hp drat wt qsec vs am gear carb
# 4.Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# 4.Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# 6.Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
# 6.Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
# 8.Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
# 8.Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
This returns the result of rbind (argument 1, a function) with all the elements of best2 (argument 2, a list) passed as
arguments.
With simple analyses like this one, it can be more compact (and possibly much less readable!) to do the whole split-
apply-combine in a single line of code:
It is also worth noting that the lapply(split(x,f), FUN) combination can be alternatively framed using the ?by
function:
testdata <- c("e", "o", "r", "g", "a", "y", "w", "q", "i", "s", "b", "v", "x", "h", "u")
Objective is to separate those letters into voyels and consonants, ie split it accordingly to letter type.
Note that letter_type has the same length that our vector testdata. Now we can split this test data in the two
groups, vowels and consonants :
split(testdata, letter_type)
#$consonants
#[1] "r" "g" "w" "q" "s" "b" "v" "x" "h"
#$vowels
#[1] "e" "o" "a" "y" "i" "u"
data(iris)
By using split, one can create a list containing one data.frame per iris specie (variable: Species):
One example operation would be to compute correlation matrix per iris specie; one would then use lapply:
$setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.0000000 0.7425467 0.2671758 0.2780984
Sepal.Width 0.7425467 1.0000000 0.1777000 0.2327520
Petal.Length 0.2671758 0.1777000 1.0000000 0.3316300
Petal.Width 0.2780984 0.2327520 0.3316300 1.0000000
$versicolor
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.0000000 0.5259107 0.7540490 0.5464611
Sepal.Width 0.5259107 1.0000000 0.5605221 0.6639987
Petal.Length 0.7540490 0.5605221 1.0000000 0.7866681
Petal.Width 0.5464611 0.6639987 0.7866681 1.0000000
$virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.0000000 0.4572278 0.8642247 0.2811077
Sepal.Width 0.4572278 1.0000000 0.4010446 0.5377280
Petal.Length 0.8642247 0.4010446 1.0000000 0.3221082
Petal.Width 0.2811077 0.5377280 0.3221082 1.0000000
Then we can retrieve per group the best pair of correlated variables: (correlation matrix is reshaped/melted,
diagonal is filtered out and selecting best record is performed)
> library(reshape)
> (topcor <- lapply(lcor, FUN=function(cormat){
correlations <- melt(cormat,variable_name="correlatio);
filtered <- correlations[correlations$X1 != correlations$X2,];
filtered[which.max(filtered$correlation),]
}))
$versicolor
X1 X2 correlation
12 Petal.Width Petal.Length 0.7866681
$virginica
X1 X2 correlation
3 Petal.Length Sepal.Length 0.8642247
Note that one computations are performed on such groupwise level, one may be interested in stacking the results,
which can be done with:
X1 X2 correlation
setosa Sepal.Width Sepal.Length 0.7425467
versicolor Petal.Width Petal.Length 0.7866681
virginica Petal.Length Sepal.Length 0.8642247
Comma separated value files (CSVs) can be imported using read.csv, which wraps read.table, but uses sep = ","
to set the delimiter to a comma.
df <- read.csv(csv_path)
df
## Var1 Var2
## 1 2.70 A
## 2 3.14 B
## 3 10.00 A
## 4 -7.00 A
df <- read.csv(file.choose())
Notes
Unlike read.table, read.csv defaults to header = TRUE, and uses the first row as column names.
All these functions will convert strings to factor class by default unless either as.is = TRUE or
stringsAsFactors = FALSE.
The read.csv2 variant defaults to sep = ";" and dec = "," for use on data from countries where the
comma is used as a decimal point and the semicolon as a field separator.
The readr package's read_csv function offers much faster performance, a progress bar for large files, and more
popular default options than standard read.csv, including stringsAsFactors = FALSE.
df <- read_csv(csv_path)
df
## # A tibble: 4 x 2
## Var1 Var2
## <dbl> <chr>
## 1 2.70 A
## 2 3.14 B
## 3 10.00 A
## 4 -7.00 A
dt <- fread(csv_path)
dt
## Var1 Var2
## 1: 2.70 A
## 2: 3.14 B
## 3: 10.00 A
## 4: -7.00 A
fread returns an object of class data.table that inherits from class data.frame, suitable for use with the
data.table's usage of []. To return an ordinary data.frame, set the data.table parameter to FALSE:
class(df)
## [1] "data.frame"
df
## Var1 Var2
## 1 2.70 A
## 2 3.14 B
## 3 10.00 A
## 4 -7.00 A
Notes
fread does not have all same options as read.table. One missing argument is na.comment, which may lead
write.csv(mtcars, "mtcars.csv")
readr::write_csv is significantly faster than write.csv and does not write row names.
library(readr)
write_csv(mtcars, "mtcars.csv")
This read every file and adds it to a list. Afterwards, if all data.frame have the same structure they can be combined
into one big data.frame:
An example:
Let's assume this data table exists in the local file constants.txt in the working directory.
df
#> V1 V2 V3 V4 V5
Note:
df <- read_fwf('constants.txt',
fwf_cols(Year = 8, Name = 10, Importance = 18, Value = 7, Doubled = 8),
skip = 1)
df
#> # A tibble: 3 x 5
#> Year Name Importance Value Doubled
#> <int> <chr> <chr> <dbl> <dbl>
#> 1 1647 pi 'important' 3.14159 6.28318
#> 2 1731 euler 'quite important' 2.71828 5.43656
#> 3 1979 answer 'The Answer.' 42.00000 42.00000
Note:
readr's fwf_* helper functions offer alternative ways of specifying column lengths, including automatic
guessing (fwf_empty)
readr is faster than base R
Column titles cannot be automatically imported from data file
Pipe operators, available in magrittr, dplyr, and other R packages, process a data-object using a sequence of
operations by passing the result of one step as input for the next step using infix-operators rather than the more
typical R method of nested function calls.
Note that the intended aim of pipe operators is to increase human readability of written code. See Remarks section
for performance considerations.
library(magrittr)
# is equivalent to
mean(1:10)
# [1] 5.5
The pipe can be used to replace a sequence of function calls. Multiple pipes allow us to read and write the
sequence from left to right, rather than from inside to out. For example, suppose we have years defined as a factor
but want to convert it to a numeric. To prevent possible information loss, we first convert to character and then to
numeric:
# nesting
as.numeric(as.character(years))
# piping
years %>% as.character %>% as.numeric
If we don't want the LHS (Left Hand Side) used as the first argument on the RHS (Right Hand Side), there are
workarounds, such as naming the arguments or using . to indicate where the piped input goes.
. %>% RHS
As an example, suppose we have factor dates and want to extract the year:
# Creating a dataset
df <- data.frame(now = "2015-11-11", before = "2012-01-01")
# now before
# 1 2015-11-11 2012-01-01
We can review the composition of the function by typing its name or using functions:
read_year
# Functional sequence with the following components:
#
# 1. as.character(.)
# 2. as.Date(.)
# 3. year(.)
#
# Use 'functions' to extract the individual functions.
read_year[[2]]
Generally, this approach may be useful when clarity is more important than speed.
library(magrittr)
library(dplyr)
df <- mtcars
Instead of writing
or
The compound assignment operator will both pipe and reassign df:
The exposition pipe operator %$% allows a user to avoid breaking a pipeline when needing to refer to column
names. For instance, say you want to filter a data.frame and then run a correlation test on two columns with
cor.test:
library(magrittr)
library(dplyr)
mtcars %>%
filter(wt > 2) %$%
cor.test(hp, mpg)
#>
#> Pearson's product-moment correlation
#>
#> data: hp and mpg
#> t = -5.9546, df = 26, p-value = 2.768e-06
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> -0.8825498 -0.5393217
#> sample estimates:
#> cor
Here the standard %>% pipe passes the data.frame through to filter(), while the %$% pipe exposes the column
names to cor.test().
The exposition pipe works like a pipe-able version of the base R with() functions, and the same left-hand side
objects are accepted as inputs.
%T>% (tee operator) allows you to forward a value into a side-effect-producing function while keeping the original
lhs value intact. In other words: the tee operator works like %>%, except the return values is lhs itself, and not the
result of the rhs function/expression.
Example: Create, pipe, write, and return an object. If %>% were used in place of %T>% in this example, then the
variable all_letters would contain NULL rather than the value of the sorted object.
Warning: Piping an unnamed object to save() will produce an object named . when loaded into the workspace
with load(). However, a workaround using a helper function is possible (which can also be written inline as an
anonymous function).
get("all_letters", envir = e)
# Error in get("all_letters", envir = e) : object 'all_letters' not found
get(".", envir = e)
# [1] "a" "A" "b" "B" "c" "C" "d" "D" "e" "E" "f" "F" "g" "G" "h" "H" "i" "I" "j" "J"
# [21] "k" "K" "l" "L" "m" "M" "n" "N" "o" "O" "p" "P" "q" "Q" "r" "R" "s" "S" "t" "T"
# [41] "u" "U" "v" "V" "w" "W" "x" "X" "y" "Y" "z" "Z"
# Work-around
save2 <- function(. = ., name, file = stop("'file' must be specified")) {
assign(name, .)
call_save <- call("save", ... = name, file = file)
eval(call_save)
}
library(dplyr)
library(ggplot)
diamonds %>%
filter(depth > 60) %>%
group_by(cut) %>%
summarize(mean_price = mean(price)) %>%
ggplot(aes(x = cut, y = mean_price)) +
geom_bar(stat = "identity")
If we are interested in the relationship between fuel efficiency (mpg) and weight (wt) we may start plotting those
variables with:
The plots shows a (linear) relationship!. Then if we want to perform linear regression to determine the coefficients
of a linear model, we would use the lm function:
The ~ here means "explained by", so the formula mpg ~ wt means we are predicting mpg as explained by wt. The
most helpful way to view the output is with:
summary(fit)
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
the estimated slope of each coefficient (wt and the y-intercept), which suggests the best-fit prediction of mpg
is 37.2851 + (-5.3445) * wt
The p-value of each coefficient, which suggests that the intercept and weight are probably not due to chance
Overall estimates of fit such as R^2 and adjusted R^2, which show how much of the variation in mpg is
explained by the model
We could add a line to our first plot to show the predicted mpg:
abline(fit,col=3,lwd=2)
It is also possible to add the equation to that plot. First, get the coefficients with coef. Then using paste0 we
collapse the coefficients with appropriate variables and +/-, to built the equation. Finally, we add it to the plot using
mtext:
bs <- round(coef(fit), 3)
lmlab <- paste0("mpg = ", bs[1],
ifelse(sign(bs[2])==1, " + ", " - "), abs(bs[2]), " wt ")
mtext(lmlab, 3, line=-2)
Call:
lm(formula = mpg ~ disp, data = mtcars)
Coefficients:
(Intercept) disp
29.59985 -0.04122
If I had a new data source with displacement I could see the estimated miles per gallon.
set.seed(1234)
newdata <- sample(mtcars$disp, 5)
newdata
[1] 258.0 71.1 75.7 145.0 400.0
The most important part of the process is to create a new data frame with the same column names as the original
data. In this case, the original data had a column labeled disp, I was sure to call the new data that same name.
predict(my_mdl, newdata)
Error in eval(predvars, data, env) :
numeric 'envir' arg not of length one
Accuracy
To check the accuracy of the prediction you will need the actual y values of the new data. In this example, newdf will
need a column for 'mpg' and 'disp'.
Analytic Weights: Reflect the different levels of precision of different observations. For example, if analyzing
data where each observation is the average results from a geographic area, the analytic weight is
proportional to the inverse of the estimated variance. Useful when dealing with averages in data by providing
a proportional weight given the number of observations. Source
Sampling Weights (Inverse Probability Weights - IPW): a statistical technique for calculating statistics
standardized to a population different from that in which the data was collected. Study designs with a
disparate sampling population and population of target inference (target population) are common in
application. Useful when dealing with data that have missing values. Source
Test Data
Analytic Weights
Output
Call:
lm(formula = lexptot ~ progvillm + sexhead + agehead, data = data,
weights = weight)
Weighted Residuals:
1 2 3 4 5
9.249e-02 5.823e-01 0.000e+00 -6.762e-01 -1.527e-16
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.016054 1.744293 5.742 0.110
progvillm -0.781204 1.344974 -0.581 0.665
sexhead 0.306742 1.040625 0.295 0.818
agehead -0.005983 0.032024 -0.187 0.882
library(survey)
data$X <- 1:nrow(data) # Create unique id
# Build survey design object with unique id, ipw, and data.frame
des1 <- svydesign(id = ~X, weights = ~weight, data = data)
Output
Call:
svyglm(formula = lexptot ~ progvillm + sexhead + agehead, design = des1)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.016054 0.183942 54.452 0.0117 *
progvillm -0.781204 0.640372 -1.220 0.4371
sexhead 0.306742 0.397089 0.772 0.5813
agehead -0.005983 0.014747 -0.406 0.7546
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Let's fit a quadratic model for the mtcars dataset. For a linear model see Linear regression on the mtcars dataset.
First we make a scatter plot of the variables mpg (Miles/gallon), disp (Displacement (cu.in.)), and wt (Weight (1000
lbs)). The relationship among mpg and disp appears non-linear.
plot(mtcars[,c("mpg","disp","wt")])
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 34.96055 2.16454 16.151 4.91e-16 ***
#wt -3.35082 1.16413 -2.878 0.00743 **
#disp -0.01773 0.00919 -1.929 0.06362 .
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 2.917 on 29 degrees of freedom
#Multiple R-squared: 0.7809, Adjusted R-squared: 0.7658
Then, to get the result of a quadratic model, we added I(disp^2). The new model appears better when looking at
R^2 and all variables are significant.
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 41.4019837 2.4266906 17.061 2.5e-16 ***
#wt -3.4179165 0.9545642 -3.581 0.001278 **
#disp -0.0823950 0.0182460 -4.516 0.000104 ***
#I(disp^2) 0.0001277 0.0000328 3.892 0.000561 ***
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 2.391 on 28 degrees of freedom
mpg = 41.4020-3.4179*wt-0.0824*disp+0.0001277*disp^2
Another way to specify polynomial regression is using poly with parameter raw=TRUE, otherwise orthogonal
polynomials will be considered (see the help(ploy) for more information). We get the same result using:
Finally, what if we need to show a plot of the estimated surface? Well there are many options to make 3D plots in R.
Here we use Fit3d from p3dpackage.
library(p3d)
Init3d(family="serif", cex = 1)
Plot3d(mpg ~ disp+wt, mtcars)
Axes3d()
Fit3d(fit1)
Then plot the two variables of interest and add the regression line within the definition domain:
Almost there! The last step is to add to the plot, the regression equation, the rsquare as well as the correlation
coefficient. This is done using the vector function:
rp = vector('expression',3)
rp[1] = substitute(expression(italic(y) == MYOTHERVALUE3 + MYOTHERVALUE4 %*% x),
list(MYOTHERVALUE3 = format(fit$coefficients[1], digits = 2),
MYOTHERVALUE4 = format(fit$coefficients[2], digits = 2)))[2]
rp[2] = substitute(expression(italic(R)^2 == MYVALUE),
list(MYVALUE = format(summary(fit)$adj.r.squared,dig=3)))[2]
rp[3] = substitute(expression(Pearson-R == MYOTHERVALUE2),
list(MYOTHERVALUE2 = format(cor(mtcars$wt,mtcars$mpg), digits = 2)))[2]
Note that you can add any other parameter such as the RMSE by adapting the vector function. Imagine you want a
legend with 10 elements. The vector definition would be the following:
rp = vector('expression',10)
These plots check for two assumptions that were made while building the model:
1. That the expected value of the predicted variable (in this case mpg) is given by a linear combination of the
predictors (in this case wt). We expect this estimate to be unbiased. So the residuals should be centered
around the mean for all values of the predictors. In this case we see that the residuals tend to be positive at
the ends and negative in the middle, suggesting a non-linear relationship between the variables.
2. That the actual predicted variable is normally distributed around its estimate. Thus, the residuals should be
Build
library(data.table)
DT <- data.table(
x = letters[1:5],
y = 1:5,
z = (1:5) > 3
)
# x y z
# 1: a 1 FALSE
# 2: b 2 FALSE
# 3: c 3 FALSE
# 4: d 4 TRUE
# 5: e 5 TRUE
sapply(DT, class)
# x y z
# "character" "integer" "logical"
Read in
dt <- fread("my_file.csv")
Modify a data.frame
For efficiency, data.table offers a way of altering a data.frame or list to make a data.table in-place (without making a
copy or changing its memory location):
# example data.frame
DF <- data.frame(x = letters[1:5], y = 1:5, z = (1:5) > 3)
# modification
setDT(DF)
Note that we do not <- assign the result, since the object DF has been modified in-place. The class attributes of the
sapply(DF, class)
# x y z
# "factor" "integer" "logical"
If you have a list, data.frame, or data.table, you should use the setDT function to convert to a data.table
because it does the conversion by reference instead of making a copy (which as.data.table does). This is
important if you are working with large datasets.
If you have another R object (such as a matrix), you must use as.data.table to coerce it to a data.table.
DT <- as.data.table(mat)
# or
DT <- data.table(mat)
.SD refers to the subset of the data.table for each group, excluding all columns used in by.
.SD along with lapply can be used to apply any function to multiple columns by group in a data.table
Apart from cyl, there are other categorical columns in the dataset such as vs, am, gear and carb. It doesn't really
make sense to take the mean of these columns. So let's exclude these columns. This is where .SDcols comes into
the picture.
.SDcols
.SDcols specifies the columns of the data.table that are included in .SD.
Mean of all columns (continuous columns) in the dataset by number of gears gear, and number of cylinders, cyl,
arranged by gear and cyl:
Maybe we don't want to calculate the mean by groups. To calculate the mean for all the cars in the dataset, we don't
specify the by variable.
Note:
It is not necessary to define cols_chosen beforehand. .SDcols can directly take column names
.SDcols can also directly take a vector of columnnumbers. In the above example this would be mtcars[ ,
lapply(.SD, mean), .SDcols = c(1,3:7)]
.N
# Species count
#1: setosa 50
#2: versicolor 50
#3: virginica 50
If the columns are dependent and must be defined in sequence, one way is:
The .() syntax is used when the right-hand side of LHS := RHS is a list of columns.
vn = "mpg_sq"
mtcars[, (vn) := mpg^2]
Columns can also be modified with set, though this is rarely necessary:
As in a data.frame, we can subset using row numbers or logical tests. It is also possible to use a "join" in i, but that
more complicated task is covered in another example.
Functions that edit attributes, such as levels<- or names<-, actually replace an object with a modified copy. Even if
only used on one column in a data.table, the entire object is copied and replaced.
To modify an object without copies, use setnames to change the column names of a data.table or data.frame and
setattr to change an attribute for any object.
Be aware that these changes are made by reference, so they are global. Changing them within one environment
A data.table is one of several two-dimensional data structures available in R, besides data.frame, matrix and (2D)
array. All of these classes use a very similar but not identical syntax for subsetting, the A[rows, cols] schema.
ma[2:3] #---> returns the 2nd and 3rd items, as if 'ma' were a vector (because it is!)
df[2:3] #---> returns the 2nd and 3rd columns
dt[2:3] #---> returns the 2nd and 3rd rows!
ma[2:3, ] # \
df[2:3, ] # }---> returns the 2nd and 3rd rows
dt[2:3, ] # /
But, if you want to subset columns, some cases are interpreted differently. All three can be subset the same way
with integer or character indices not stored in a variable.
ma[, 2:3] # \
df[, 2:3] # \
dt[, 2:3] # }---> returns the 2nd and 3rd columns
ma[, c("Y", "Z")] # /
df[, c("Y", "Z")] # /
dt[, c("Y", "Z")] # /
In the last case, mycols is evaluated as the name of a column. Because dt cannot find a column named mycols, an
error is raised.
Note: For versions of the data.table package priorto 1.9.8, this behavior was slightly different. Anything in the
column index would have been evaluated using dt as an environment. So both dt[, 2:3] and dt[, mycols] would
There are many reasons to write code that is guaranteed to work with data.frame and data.table. Maybe you are
forced to use data.frame, or you may need to share some code that you don't know how will be used. So, there are
some main strategies for achieving this, in order of convenience:
Subset rows. Its simple, just use the [, ] selector, with the comma:
A[1:10, ]
A[A$var > 17, ] # A[var > 17, ] just works for data.table
Subset columns. If you want a single column, use the $ or the [[ ]] selector:
A$var
colname <- 'var'
A[[colname]]
A[[1]]
If you want a uniform way to grab more than one column, it's necessary to appeal a bit:
Subset 'indexed' rows. While data.frame has row.names, data.table has its unique key feature. The best thing is
to avoid row.names entirely and take advantage of the existing optimizations in the case of data.table when
possible.
B <- A[A$var != 0, ]
# or...
B <- with(A, A[var != 0, ]) # data.table will silently index A by var before subsetting
Get a 1-column table, get a row as a vector. These are easy with what we have seen until now:
In the past (pre 1.9.6), your data.table was sped up by setting columns as keys to the table, particularly for large
tables. [See intro vignette page 5 of September 2015 version, where speed of search was 544 times better.] You
may find older code making use of this setting keys with 'setkey' or setting a 'key=' column when setting up the
table.
library(data.table)
DT <- data.table(
x = letters[1:5],
y = 5:1,
z = (1:5) > 3
)
#> DT
# x y z
#1: a 5 FALSE
#2: b 4 FALSE
#3: c 3 FALSE
#4: d 2 TRUE
#5: e 1 TRUE
Set your key with the setkey command. You can have a key with multiple columns.
setkey(DT, y)
tables()
> tables()
NAME NROW NCOL MB COLS KEY
[1,] DT 5 3 1 x,y,z y
Total: 1MB
#> DT
# x y z
#1: e 1 TRUE
#2: d 2 TRUE
#3: c 3 FALSE
#4: b 4 FALSE
#5: a 5 FALSE
Now it is unnecessary
Prior to v1.9.6 you had to have set a key for certain operations especially joining tables. The developers of
data.table have sped up and introduced a "on=" feature that can replace the dependency on keys. See SO answer
here for a detailed discussion.
In Jan 2017, the developers have written a vignette around secondary indices which explains the "on" syntax and
allows for other columns to be identified for fast indexing.
This does not sort the table (unlike key), but does allow for quick indexing using the "on" syntax. Note there can be
only one key, but you can use multiple secondary indices, which saves having to rekey and resort the table. This will
speed up your subsetting when changing the columns you want to subset on.
DT
# x y z
# 1: e 1 TRUE
# 2: d 2 TRUE
# 3: c 3 FALSE
# 4: b 4 FALSE
# 5: a 5 FALSE
# old way would have been rekeying DT from y to x, doing subset and
# perhaps keying back to y (now we save two sorts)
# This is a toy example above but would have been more valuable with big data sets
data("USArrests")
head(USArrests)
Use ?USArrests to find out more. First, convert to data.table. The names of states are row names in the original
data.frame.
library(data.table)
DT <- as.data.table(USArrests, keep.rownames=TRUE)
This is data in the wide form. It has a column for each variable. The data can also be stored in long form without
loss of information. The long form has one column that stores the variable names. Then, it has another column for
the variable values. The long form of USArrests looks like so.
We use the melt function to switch from wide form to long form.
By default, melt treats all columns with numeric data as variables with values. In USArrests, the variable UrbanPop
represents the percentage urban population of a state. It is different from the other variables, Murder, Assault and
Rape, which are violent crimes reported per 100,000 people. Suppose we want to retain UrbanPop column. We
achieve this by setting id.vars as follows.
Note that we have specified the names of the column containing category names (Murder, Assault, etc.) with
variable.name and the column containing the values with value.name. Our data looks like so.
Generating summaries with with split-apply-combine style approach is a breeze. For example, to summarize violent
crimes by state?
This gives:
State ViolentCrime
1: Alabama 270.4
2: Alaska 317.5
3: Arizona 333.1
4: Arkansas 218.3
5: California 325.6
6: Colorado 250.6
To recover data from the previous example, use dcast like so.
When the operation produces a list of values in each cell, dcast provides a fun.aggregate method to handle the
situation. Say I am interested in states with similar urban population when investigating crime rates. I add a column
Decile with computed information.
Now, casting Decile ~ Crime produces multiple values per cell. I can use fun.aggregate to determine how these
are handled. Both text and numerical values can be handle this way.
This gives:
This gives:
There are multiple states in each decile of the urban population. Use fun.aggregate to specify how these should be
handled.
This sums over the data for like states, giving the following.
The barplot() function is in the graphics package of the R's System Library. The barplot() function must be
supplied at least one argument. The R help calls this as heights, which must be either vector or a matrix. If it is
vector, its members are the various factor-levels.
> grades<-c("A+","A-","B+","B","C")
> Marks<-sample(grades,40,replace=T,prob=c(.2,.3,.25,.15,.1))
> Marks
[1] "A+" "A-" "B+" "A-" "A+" "B" "A+" "B+" "A-" "B" "A+" "A-"
[13] "A-" "B+" "A-" "A-" "A-" "A-" "A+" "A-" "A+" "A+" "C" "C"
[25] "B" "C" "B+" "C" "B+" "B+" "B+" "A+" "B+" "A-" "A+" "A-"
[37] "A-" "B" "C" "A+"
>
Notice that, the barplot() function places the factor levels on the x-axis in the lexicographical order of the levels.
Using the parameter names.arg, the bars in plot can be placed in the order as stated in the vector, grades.
The sizes of the factor-level names on the x-axis can be increased using cex.names parameter.
> gradTab
Algorithms Operating Systems Discrete Math
A- 13 10 7
A+ 10 7 2
B 4 2 14
B+ 8 19 12
C 5 2 5
plot(density(rnorm(100)),main="Normal density",xlab="x")
x=rnorm(100)
hist(x,prob=TRUE,main="Normal density + histogram")
lines(density(x),lty="dotted",col="red")
par()
par uses the arguments mfrow or mfcol to create a matrix of nrows and ncols c(nrows, ncols) which will serve as
a grid for your plots. The following example shows how to combine four plots in one graph:
par(mfrow=c(2,2))
plot(cars, main="Speed vs. Distance")
hist(cars$speed, main="Histogram of Speed")
boxplot(cars$dist, main="Boxplot of Distance")
boxplot(cars$speed, main="Boxplot of Speed")
The layout() is more flexible and allows you to specify the location and the extent of each plot within the final
combined graph. This function expects a matrix object as an input:
If you want to make a plot which has the y_values in vertical axis and the x_valuesin horizontal axis, you can use
the following commands:
You can type ?plot() in the console to read about more options.
Boxplot
You have some variables and you want to examine their Distributions
Histograms
Pie_charts
hist(ldeaths)
Here is an example of a matrix containing four sets of random draws, each with a different mean.
xmat <- cbind(rnorm(100, -3), rnorm(100, -1), rnorm(100, 1), rnorm(100, 3))
head(xmat)
# [,1] [,2] [,3] [,4]
# [1,] -3.072793 -2.53111494 0.6168063 3.780465
# [2,] -3.702545 -1.42789347 -0.2197196 2.478416
# [3,] -2.890698 -1.88476126 1.9586467 5.268474
# [4,] -3.431133 -2.02626870 1.1153643 3.170689
# [5,] -4.532925 0.02164187 0.9783948 3.162121
# [6,] -2.169391 -1.42699116 0.3214854 4.480305
One way to plot all of these observations on the same graph is to do one plot call followed by three more points
or lines calls.
Much more convenient in this situation is to use the matplot function, which only requires one call and
automatically takes care of axis limits and changing the aesthetics for each column to make them distinguishable.
Like plot, if given only one object, matplot assumes it's the y variable and uses the indices for x. However, x and y
can be specified explicitly.
plot(ecdf(rnorm(100)),main="Cumulative distribution",xlab="x")
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
boxplot(iris[,1],xlab="Sepal.Length",ylab="Length(in centemeters)",
main="Summary Charateristics of Sepal.Length(Iris Data)")
boxplot(Sepal.Length~Species,data = iris)
Bring order
To change order of the box in the plot you have to change the order of the categorical variable's levels.
If you want to specifie a better name to your groups you can use the Names parameter. It take a vector of the size of
the levels of categorical variable
boxplot(Sepal.Length~Species,data = iris,col=c("green","yellow","orange"))
Median
Whisker
Staple
Outliers
Example
par(mfrow=c(1,2))
# Default
boxplot(Sepal.Length ~ Species, data=iris)
# Modified
boxplot(Sepal.Length ~ Species, data=iris,
boxlty=2, boxlwd=3, boxfill="cornflowerblue", boxcol="darkblue",
medlty=2, medlwd=2, medcol="red", medpch=21, medcex=1, medbg="white",
whisklty=2, whisklwd=3, whiskcol="darkblue",
staplelty=2, staplelwd=2, staplecol="red",
outlty=3, outlwd=3, outcol="grey", outpch=NA
)
set.seed(47)
sweetsWide <- data.frame(date = 1:20,
chocolate = runif(20, min = 2, max = 4),
iceCream = runif(20, min = 0.5, max = 1),
candy = runif(20, min = 1, max = 3))
head(sweetsWide)
## date chocolate iceCream candy
## 1 1 3.953924 0.5890727 1.117311
## 2 2 2.747832 0.7783982 1.740851
## 3 3 3.523004 0.7578975 2.196754
## 4 4 3.644983 0.5667152 2.875028
## 5 5 3.147089 0.8446417 1.733543
## 6 6 3.382825 0.6900125 1.405674
To convert sweetsWide to long format for use with ggplot2, several useful functions from base R, and the packages
reshape2, data.table and tidyr (in chronological order) can be used:
head(sweetsLong)
## date sweet price
## 1 1 chocolate 3.953924
## 2 2 chocolate 2.747832
## 3 3 chocolate 3.523004
## 4 4 chocolate 3.644983
## 5 5 chocolate 3.147089
## 6 6 chocolate 3.382825
See also Reshaping data between long and wide forms for details on converting data between long and wide format.
The resulting sweetsLong has one column of prices and one column describing the type of sweet. Now plotting is
much simpler:
library(ggplot2)
ggplot(sweetsLong, aes(x = date, y = price, colour = sweet)) + geom_line()
library(ggplot2)
ggplot(iris, aes(x = Petal.Width, y = Petal.Length, color = Species)) +
geom_point()
This gives:
basic qplot
adding a smoother
set.seed(1)
colorful <- sample(c("red", "Red", "RED", "blue", "Blue", "BLUE", "green", "gren"),
size = 20,
replace = TRUE)
colorful <- factor(colorful)
table(colorful)
colorful
blue Blue BLUE green gren red Red RED
3 1 4 2 4 1 3 2
This table, however, doesn't represent the true distribution of the data, and the categories may effectively be
reduced to three types: Blue, Green, and Red. Three examples are provided. The first illustrates what seems like an
obvious solution, but won't actually provide a solution. The second gives a working solution, but is verbose and
computationally expensive. The third is not an obvious solution, but is relatively compact and computationally
efficient.
[1] Green Blue Red Red Blue Red Red Red Blue Red Green Green Green Blue
Red Green
[17] Red Green Green Red
Levels: Blue Blue Blue Green Green Red Red Red
Warning message:
In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, :
duplicated levels in factors are deprecated
Notice that there are duplicated levels. We still have three categories for "Blue", which doesn't complete our task of
consolidating levels. Additionally, there is a warning that duplicated levels are deprecated, meaning that this code
may generate an error in the future.
[1] Green Blue Red Red Blue Red Red Red Blue Red Green Green Green Blue
Red Green
This code generates the desired result, but requires the use of nested ifelse statements. While there is nothing
wrong with this approach, managing nested ifelse statements can be a tedious task and must be done carefully.
A less obvious way of consolidating levels is to use a list where the name of each element is the desired category
name, and the element is a character vector of the levels in the factor that should map to the desired category. This
has the added advantage of working directly on the levels attribute of the factor, without having to assign new
objects.
levels(colorful) <-
list("Blue" = c("blue", "Blue", "BLUE"),
"Green" = c("green", "gren"),
"Red" = c("red", "Red", "RED"))
[1] Green Blue Red Red Blue Red Red Red Blue Red Green Green Green Blue
Red Green
[17] Red Green Green Red
Levels: Blue Green Red
The time required to execute each of these approaches is summarized below. (For the sake of space, the code to
generate this summary is not shown)
Unit: microseconds
expr min lq mean median uq max neval cld
factor 78.725 83.256 93.26023 87.5030 97.131 218.899 100 b
ifelse 104.494 107.609 123.53793 113.4145 128.281 254.580 100 c
list_approach 49.557 52.955 60.50756 54.9370 65.132 138.193 100 a
The list approach runs about twice as fast as the ifelse approach. However, except in times of very, very large
amounts of data, the differences in execution time will likely be measured in either microseconds or milliseconds.
With such small time differences, efficiency need not guide the decision of which approach to use. Instead, use an
approach that is familiar and comfortable, and which you and your collaborators will understand on future review.
> f
[1] n n n c c c
Levels: c n
> levels(f)
If you want to change the ordering of the levels, then one option to to specify the levels manually:
Factors have a number of properties. For example, levels can be given labels:
When a level of the factor is no longer used, you can drop it using the droplevels() function:
In some situations the treatment of the default ordering of levels (alphabetic/lexical order) will be acceptable. For
example, if one justs want to plot the frequencies, this will be the result:
plot(f,col=1:length(levels(f)))
When it is possible, we can recreate the factor using the levels parameter with the order we want.
When the input levels are different than the desired output levels, we use the labels parameter which causes the
levels parameter to become a "filter" for acceptable input values, but leaves the final values of "levels" for the
factor vector as the argument to labels:
When there is one specific level that needs to be the first we can use relevel. This happens, for example, in the
context of statistical analysis, when a base category is necessary for testing hypothesis.
all.equal(f, g)
# [1] "Attributes: < Component “levels”: 2 string mismatches >"
all.equal(f, g, check.attributes = F)
# [1] TRUE
3. Reordering factors
There are cases when we need to reorder the levels based on a number, a partial result, a computed statistic, or
previous calculations. Let's reorder based on the frequencies of the levels
table(g)
# g
# n c W
# 20 14 17
The reorder function is generic (see help(reorder)), but in this context needs: x, in this case the factor; X, a
numeric value of the same length as x; and FUN, a function to be applied to X and computed by level of the x, which
determines the levels order, by default increasing. The result is the same factor with its levels reordered.
When there is a quantitative variable related to the factor variable, we could use other functions to reorder the
levels. Lets take the iris data (help("iris") for more information), for reordering the Species factor by using its
mean Sepal.Width.
miris$Species.o<-with(miris,reorder(Species,-Sepal.Width))
levels(miris$Species.o)
# [1] "setosa" "virginica" "versicolor"
The usual boxplot (say: with(miris, boxplot(Petal.Width~Species)) will show the especies in this order: setosa,
versicolor, and virginica. But using the ordered factor we get the species ordered by its mean Sepal.Width:
f1<-f
levels(f1)
# [1] "c" "n" "W"
levels(f1) <- c("upper","upper","CAP") #rename and grouping
levels(f1)
# [1] "upper" "CAP"
f2<-f1
levels(f2) <- c("upper","CAP", "Number") #add Number level, which is empty
levels(f2)
# [1] "upper" "CAP" "Number"
f2[length(f2):(length(f2)+5)]<-"Number" # add cases for the new level
table(f2)
# f2
# upper CAP Number
# 33 17 6
- Ordered factors
Finally, we know that ordered factors are different from factors, the first one are used to represent ordinal data,
and the second one to work with nominal data. At first, it does not make sense to change the order of levels for
ordered factors, but we can change its labels.
of1<-of
levels(of1)<- c("LOW", "MEDIUM", "HIGH")
levels(of1)
# [1] "LOW" "MEDIUM" "HIGH"
is.ordered(of1)
# [1] TRUE
of1
# [1] LOW LOW LOW LOW LOW LOW LOW MEDIUM MEDIUM HIGH HIGH HIGH HIGH
# Levels: LOW < MEDIUM < HIGH
Factors are used to represent variables that take values from a set of categories, known as Levels in R. For example,
some experiment could be characterized by the energy level of a battery, with four levels: empty, low, normal, and
full. Then, for 5 different sampling sites, those levels could be identified, in those terms, as follows:
Typically, in databases or other information sources, the handling of these data is by arbitrary integer indices
associated with the categories or levels. If we assume that, for the given example, we would assign, the indices as
follows: 1 = empty, 2 = low, 3 = normal, 4 = full, then the 5 samples could be coded as:
4, 4, 3, 1, 2
It could happen that, from your source of information, e.g. a database, you only have the encoded list of integers,
and the catalog associating each integer with each level-keyword. How can a factor of R be reconstructed from that
information?
Solution
set.seed(18)
ii <- sample(1:4, 20, replace=T)
ii
[1] 4 3 4 1 1 3 2 3 2 1 3 4 1 2 4 1 3 1 4 1
The first step is to make a factor, from the previous sequence, in which the levels or categories are exactly the
numbers from 1 to 4.
[1] 4 3 4 1 1 3 2 3 2 1 3 4 1 2 4 1 3 1 4 1
Levels: 1 2 3 4
Now simply, you have to dress the factor already created with the index tags:
[1] full normal full empty empty normal low normal low empty
[11] normal full empty low full empty normal empty full empty
Levels: empty low normal full
Is there a match?
grepl() is used to check whether a word or regular expression exists in a string or character vector. The function
returns a TRUE/FALSE (or "Boolean") vector.
Notice that we can check each string for the word "fox" and receive a Boolean vector in return.
grepl("fox", test_sentences)
#[1] TRUE FALSE
Match locations
grep takes in a character string and a regular expression. It returns a numeric vector of indexes.This will return
which sentence contains the word "fox" in it.
grep("fox", test_sentences)
#[1] 1
Matched values
Details
Since the "fox" pattern is just a word, rather than a regular expression, we could improve performance (with either
grep or grepl) by specifying fixed = TRUE.
To select sentences that don't match a pattern, one can use grep with invert = TRUE; or follow subsetting rules
with -grep(...) or !grepl(...).
In both grepl(pattern, x) and grep(pattern, x), the x parameter is vectorized, the pattern parameter is not. As
a result, you cannot use these directly to match pattern[1] against x[1], pattern[2] against x[2], and so on.
Summary of matches
After performing the e.g. the grepl command, maybe you want to get an overview about how many matches where
TRUE or FALSE. This is useful e.g. in case of big data sets. In order to do so run the summary command:
# find matches
matches <- grepl("fox", test_sentences)
# overview
summary(matches)
In R matching and replacement functions have two version: first match and global match:
gsub(pattern,replacement,text) will do the same as sub but for each occurrence of pattern
regexpr(pattern,text) will return the position of match for the first instance of pattern
set.seed(123)
teststring <- paste0(sample(letters,20),collapse="")
# teststring
#[1] "htjuwakqxzpgrsbncvyo"
Let's see how this works if we want to replace vowels by something else:
Now let's see how we can find a consonant immediately followed by one or more vowel:
regexpr("[^aeiou][aeiou]+",teststring)
#[1] 3
#attr(,"match.length")
#[1] 2
#attr(,"useBytes")
#[1] TRUE
gregexpr("[^aeiou][aeiou]+",teststring)
#[[1]]
#[1] 3 5 19
#attr(,"match.length")
#[1] 2 2 2
All this is really great, but this only give use positions of match and that's not so easy to get what is matched, and
here comes regmatches it's sole purpose is to extract the string matched from regexpr, but it has a different syntax.
Let's save our matches in a variable and then extract them from original string:
This may sound strange to not have a shortcut, but this allow extraction from another string by the matches of our
first one (think comparing two long vector where you know there's is a common pattern for the first but not for the
second, this allow an easy comparison):
Attention note: by default the pattern is not Perl Compatible Regular Expression, some things like lookarounds are
not supported, but each function presented here allow for perl=TRUE argument to enable them.
sub("brown","red", test_sentences)
#[1] "The quick red fox quickly" "jumps over the lazy dog"
Now, let's make the "fast" fox act "fastly". This won't do it:
sub only makes the first available replacement, we need gsub for global replacement:
The first acceleration is the usage of the perl = TRUE option. Even faster is the option fixed = TRUE. A complete
example would be:
In case of text mining, often a corpus gets used. A corpus cannot be used directly with grepl. Therefore, consider
this function:
r <- rle(dat)
r
# Run Length Encoding
# lengths: int [1:6] 1 3 1 1 2 2
# values : num [1:6] 1 2 3 1 4 1
r$values
# [1] 1 2 3 1 4 1
This captures that we first saw a run of 1's, then a run of 2's, then a run of 3's, then a run of 1's, and so on.
r$lengths
# [1] 1 3 1 1 2 2
We see that the initial run of 1's was of length 1, the run of 2's that followed was of length 3, and so on.
The variable x has three runs: a run of length 2 with value 1, a run of length 3 with value 2, and a run of length 1
with value 1. We might want to compute the mean value of variable y in each of the runs of variable x (these mean
values are 1.5, 4, and 6).
In base R, we would first compute the run-length encoding of the x variable using rle:
(r <- rle(dat$x))
# Run Length Encoding
# lengths: int [1:3] 2 3 1
# values : num [1:3] 1 2 1
Now we can use tapply to compute the mean y value for each run by grouping on the run id:
set.seed(144)
dat <- sample(rep(0:1, c(1, 1e5)), 1e7, replace=TRUE)
table(dat)
# 0 1
# 103 9999897
Storing 10 million entries will require significant space, but we can instead create a data frame with the run-length
encoding of this vector:
From the run-length encoding, we see that the first 52,818 values in the vector are 1's, followed by a single 0,
followed by 219,329 consecutive 1's, followed by a 0, and so on. The run-length encoding only has 207 entries,
requiring us to store only 414 values instead of 10 million values. As rle.df is a data frame, it can be stored using
standard functions like write.csv.
Decompressing a vector in run-length encoding can be accomplished in two ways. The first method is to simply call
rep, passing the values element of the run-length encoding as the first argument and the lengths element of the
run-length encoding as the second argument:
We can confirm that our decompressed data is identical to our original data:
The second method is to use R's built-in inverse.rle function on the rle object, for instance:
We can confirm again that this produces exactly the original dat:
identical(dat.inv, dat)
# [1] TRUE
library(data.table)
(DT <- data.table(x = c(1, 1, 2, 2, 2, 1), y = 1:6))
# x y
# 1: 1 1
# 2: 1 2
# 3: 2 3
# 4: 2 4
# 5: 2 5
# 6: 1 6
The variable x has three runs: a run of length 2 with value 1, a run of length 3 with value 2, and a run of length 1
with value 1. We might want to compute the mean value of variable y in each of the runs of variable x (these mean
values are 1.5, 4, and 6).
The data.table rleid function provides an id indicating the run id of each element of a vector:
rleid(DT$x)
# [1] 1 1 2 2 2 3
One can then easily group on this run ID and summarize the y data:
DT[,mean(y),by=.(x, rleid(x))]
# x rleid V1
# 1: 1 1 1.5
# 2: 2 2 4.0
# 3: 1 3 6.0
This code involves a for loop with a fast operation (cos(x[i-1]+1)), which often benefit from vectorization.
However, it is not trivial to vectorize this operation with base R, since R does not have a "cumulative cosine of x+1"
function.
One possible approach to speeding this function would be to implement it in C++, using the Rcpp package:
library(Rcpp)
cppFunction("NumericVector repeatedCosPlusOneRcpp(double first, int len) {
NumericVector x(len);
x[0] = first;
for (int i=1; i < len; ++i) {
x[i] = cos(x[i-1]+1);
}
return x;
}")
This often provides significant speedups for large computations while yielding the exact same results:
In this case, the Rcpp code generates a vector of length 1 million in 0.03 seconds instead of 1.31 seconds with the
base R approach.
One simple approach to speeding up such a function without rewriting a single line of code is byte compiling the
code using the R compile package:
library(compiler)
repeatedCosPlusOneCompiled <- cmpfun(repeatedCosPlusOne)
The resulting function will often be significantly faster while still returning the same results:
In this case, byte compiling sped up the tough-to-vectorize operation on a vector of length 1 million from 1.20
seconds to 0.34 seconds.
Remark
The essence of repeatedCosPlusOne, as the cumulative application of a single function, can be expressed more
transparently with Reduce:
library(microbenchmark)
microbenchmark(
repeatedCosPlusOne(1, 1e4),
repeatedCosPlusOne_vec(1, 1e4)
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
cld
#> repeatedCosPlusOne(1, 10000) 8.349261 9.216724 10.22715 10.23095 11.10817 14.33763 100
a
#> repeatedCosPlusOne_vec(1, 10000) 14.406291 16.236153 17.55571 17.22295 18.59085 24.37059 100
b
require(maps)
map()
The color of the outline can be changed by setting the color parameter, col, to either the character name or hex
value of a color:
require(maps)
map(col = "cornflowerblue")
require(maps)
map(fill = TRUE, col = c("cornflowerblue"))
require(maps)
map(fill = TRUE, col = c("cornflowerblue", "limegreen", "hotpink"))
In the example above colors from col are assigned arbitrarily to polygons in the map representing regions and
colors are recycled if there are fewer colors than polygons.
We can also use color coding to represent a statistical variable, which may optionally be described in a legend. A
map created as such is known as a "choropleth".
The following choropleth example sets the first argument of map(), which is database to "county" and "state" to
color code unemployment using data from the built-in datasets unemp and county.fips while overlaying state lines
in white:
require(maps)
if(require(mapproj)) { # mapproj is used for projection="polyconic"
# color US county map by 2009 unemployment rate
# match counties to map using FIPS county codes
# Based on J's solution to the "Choropleth Challenge"
# Code improvements by Hack-R (hack-r.github.io)
# load data
# unemp includes data for some counties not on the "lower 48 states" county
# map, such as those in Alaska, Hawaii, Puerto Rico, and some tiny Virginia
# cities
data(unemp)
data(county.fips)
# draw map
par(mar=c(1, 1, 2, 1) + 0.1)
map("county", col = colors[colorsmatched], fill = TRUE, resolution = 0,
lty = 0, projection = "polyconic")
map("state", col = "white", fill = FALSE, add = TRUE, lty = 1, lwd = 0.1,
projection="polyconic")
title("unemployment by county, 2009")
legend("topright", leg.txt, horiz = TRUE, fill = colors, cex=0.6)
}
Creating an attractive 50 state map is simple when leveraging Google Maps. Interfaces to Google's API include the
packages googleVis, ggmap, and RgoogleMaps.
require(googleVis)
The function gvisGeoChart() requires far less coding to create a choropleth compared to older mapping methods,
such as map() from the package maps. The colorvar parameter allows easy coloring of a statistical variable, at a
level specified by the locationvar parameter. The various options passed to options as a list allow customization
of the map's details such as size (height), shape (markers), and color coding (colorAxis and colors).
library(plotly)
map_data("county") %>%
group_by(group) %>%
plot_ly(x = ~long, y = ~lat) %>%
add_polygons() %>%
layout(
xaxis = list(title = "", showgrid = FALSE, showticklabels = FALSE),
yaxis = list(title = "", showgrid = FALSE, showticklabels = FALSE)
)
The next example is a "strictly native" approach that leverages the layout.geo attribute to set the aesthetics and
zoom level of the map. It also uses the database world.cities from maps to filter the Brazilian cities and plot them
on top of the "native" map.
The main variables: pophis a text with the city and its population (which is shown upon mouse hover); qis a ordered
factor from the population's quantile. ge has information for the layout of the maps. See the package
documentation for more information.
library(maps)
dfb <- world.cities[world.cities$country.etc=="Brazil",]
library(plotly)
dfb$poph <- paste(dfb$name, "Pop", round(dfb$pop/1e6,2), " millions")
dfb$q <- with(dfb, cut(pop, quantile(pop), include.lowest = T))
levels(dfb$q) <- paste(c("1st", "2nd", "3rd", "4th"), "Quantile")
dfb$q <- as.ordered(dfb$q)
ge <- list(
scope = 'south america',
showland = TRUE,
landcolor = toRGB("gray85"),
subunitwidth = 1,
countrywidth = 1,
subunitcolor = toRGB("white"),
countrycolor = toRGB("white")
)
The interface is piped, using a leaflet() function to initialize a map and subsequent functions adding (or
removing) map layers. Many kinds of layers are available, from markers with popups to polygons for creating
choropleth maps. Variables in the data.frame passed to leaflet() are accessed via function-style ~ quotation.
library(leaflet)
In the ui you call leafletOutput() and in the server you call renderLeaflet()
library(shiny)
library(leaflet)
ui <- fluidPage(
leafletOutput("my_leaf")
)
leaflet() %>%
addProviderTiles('Hydda.Full') %>%
setView(lat = -37.8, lng = 144.8, zoom = 10)
})
shinyApp(ui, server)
However, reactive inputs that affect the renderLeaflet expression will cause the entire map to be redrawn each
time the reactive element is updated.
Normally you use leaflet to create the static aspects of the map, and leafletProxy to manage the dynamic
elements, for example:
library(shiny)
library(leaflet)
ui <- fluidPage(
sliderInput(inputId = "slider",
label = "values",
min = 0,
max = 100,
value = 0,
step = 1),
leafletOutput("my_leaf")
)
leaflet() %>%
addProviderTiles('Hydda.Full') %>%
setView(lat = -37.8, lng = 144.8, zoom = 8)
})
## filter data
df_filtered <- reactive({
df[df$value >= input$slider, ]
})
shinyApp(ui, server)
v = "A"
w = c("A", "A")
However, a set contains only one copy of each element. R treats a vector like a set by taking only its distinct
elements, so the two vectors above are regarded as the same:
setequal(v, w)
# TRUE
Combining sets
x = c(1, 2, 3)
y = c(2, 4)
union(x, y)
# 1 2 3 4
intersect(x, y)
# 2
setdiff(x, y)
# 1 3
X = c(1, 1, 2)
Y = c(4, 5)
expand.grid(X, Y)
# Var1 Var2
# 1 1 4
# 2 1 4
# 3 2 4
# 4 1 5
# 5 1 5
# 6 2 5
The result is a data.frame with one column for each vector passed to it. Often, we want to take the Cartesian
product of sets rather than to expand a "grid" of vectors. We can use unique, lapply and do.call:
If you then want to apply a function to each resulting combination f(x,y), it can be added as another column:
This approach works for as many vectors as we need, but in the special case of two, it is sometimes a better fit to
have the result in a matrix, which can be achieved with outer:
uX = unique(X)
uY = unique(Y)
# 4 5
# 1 4 5
# 2 8 10
v = "A"
w = c("A", "A")
w %in% v
# TRUE TRUE
v %in% w
# TRUE
Each element on the left is treated individually and tested for membership in the set associated with the vector on
the right (consisting of all its distinct elements).
x = c(2, 1, 1, 2, 1)
unique(x)
# 2 1
duplicated(x)
# FALSE FALSE TRUE TRUE TRUE
anyDuplicated(x) > 0L is a quick way of checking whether a vector contains any duplicates.
A = 1:20
B = 10:30
xtab_set(A, B)
# inB
# inA FALSE TRUE
# FALSE 0 10
# TRUE 9 11
A Venn diagram, offered by various packages, can be used to visualize overlap counts across multiple sets.
tidyverse is the fast and elegant way to turn basic R into an enhanced tool, redesigned by Hadley/Rstudio. The
development of all packages included in tidyverse follow the principle rules of The tidy tools manifesto. But first,
let the authors describe their masterpiece:
The tidyverse is a set of packages that work in harmony because they share common data
representations and API design. The tidyverse package is designed to make it easy to install and load core
packages from the tidyverse in a single command.
The best place to learn about all the packages in the tidyverse and how they fit together is R for Data
Science. Expect to hear more about the tidyverse in the coming months as I work on improved package
websites, making citation easier, and providing a common home for discussions about data analysis with
the tidyverse.
(source))
Just with the ordinary R packages, you need to install and load the package.
install.package("tidyverse")
library("tidyverse")
The difference is, on a single command a couple of dozens of packages are installed/loaded. As a bonus, one may
rest assured that all the installed/loaded packages are of compatible versions.
Data import:
And modelling:
modelr: provides functions that help you create elegant pipelines when modelling
broom: easily extract the models into tidy data
knitr: the amazing general-purpose literate programming engine, with lightweight API's designed to give
users full control of the output without heavy coding work. SO_docs: one, two
rmarkdown: Rstudio's package for reproducible programming. SO_docs: one, two, three, four
library(tibble)
mtcars_tbl <- as_data_frame(mtcars)
One of the most notable differences between data.frames and tbl_dfs is how they print:
# A tibble: 32 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
# ... with 22 more rows
The printed output includes a summary of the dimensions of the table (32 x 11)
It includes the type of each column (dbl)
It prints a limited number of rows. (To change this use options(tibble.print_max = [number])).
Many functions in the dplyr package work naturally with tbl_dfs, such as group_by().
// [[Rcpp::plugins(name)]]
// built-in C++1y plugin for C++14 and C++17 standard under development
// [[Rcpp::plugins(cpp1y)]]
Below is an example of compiling a C++ function within R. Note the use of "" to surround the source.
# Calling function in R
exfun(1:5, 3)
// [[Rcpp::attribute]]
// [[Rcpp::export]]
that is placed directly above a declared function header when reading in a C++ file via sourceCpp().
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
double varRcpp(NumericVector x, bool bias = true){
int n = x.size();
require(Rcpp)
all.equal(muRcpp(x), mean(x))
## TRUE
all.equal(varRcpp(x), var(x))
## TRUE
// [[Rcpp::depends(Rcpp<PACKAGE>)]]
Examples:
sample(5)
# [1] 4 5 3 1 2
sample(10:15)
# [1] 11 15 12 10 14 13
randperm(a, k)
# Generates one random permutation of k of the elements a, if a is a vector,
# or of 1:a if a is a single integer.
# a: integer or numeric vector of some length n.
# k: integer, smaller as a or length(a).
# Examples
library(pracma)
randperm(1:10, 3)
[1] 3 7 9
randperm(10, 10)
[1] 4 5 10 8 2 7 6 9 3 1
Log-normal distribution with 0 mean and standard deviation of 1 (on log scale)
rlnorm(5, meanlog=0, sdlog=1)
[1] 0.8725009 2.9433779 0.3329107 2.5976206 2.8171894
Multinomial distribution with 5 object and 3 boxes using the specified probabilities
> sample(1:10,5)
[1] 6 9 2 7 10
> sample(1:10,5)
[1] 7 6 1 2 10
> rnorm(5)
[1] 0.4874291 0.7383247 0.5757814 -0.3053884 1.5117812
> rnorm(5)
[1] 0.38984324 -0.62124058 -2.21469989 1.12493092 -0.04493361
However, if we set the seed to something identical in both cases (most people use 1 for simplicity), we get two
identical samples:
> set.seed(1)
> sample(letters,2)
[1] "g" "j"
> set.seed(1)
> sample(letters,2)
[1] "g" "j"
> set.seed(1)
> rexp(5)
[1] 0.7551818 1.1816428 0.1457067 0.1397953 0.4360686
> set.seed(1)
> rexp(5)
[1] 0.7551818 1.1816428 0.1457067 0.1397953 0.4360686
First, a function appropriate for parallelization must be created. Consider the mtcars dataset. A regression on mpg
could be improved by creating a separate regression model for each level of cyl.
Create a function that can loop through all the possible iterations of zlevels. This is still in serial, but is an
important step as it determines the exact process that will be parallelized.
Parallel computing using parallel cannot access the global environment. Luckily, each function creates a local
environment parallel can access. Creation of a wrapper function allows for parallelization. The function to be
applied also needs to be placed within the environment.
parallel::stopCluster(parallelcluster)
The parallel package includes the entire apply() family, prefixed with par.
A simple use of the foreach loop is to calculate the sum of the square root and the square of all numbers from 1 to
100000.
library(foreach)
library(doSNOW)
The structure of the output of foreach is controlled by the .combine argument. The default output structure is a
list. In the code above, c is used to return a vector instead. Note that a calculation function (or operator) such as "+"
may also be used to perform a calculation and return a further processed object.
It is important to mention that the result of each foreach-loop is the last call. Thus, in this example k will be added
to the result.
Parameter Details
combine Function. Determines how the results of the loop are combined. Possible values are c, cbind,
.combine
rbind, "+", "*"...
A set of seeds must be generated and sent to each parallel process. This is automatically done in some packages
(parallel, snow, etc.), but must be explicitly addressed in others.
s <- seed
for (i in 1:numofcores) {
s <- nextRNGStream(s)
# send s to worker i as .Random.seed
}
Example
Create data
data(ToothGrowth)
The result from mcparallelDo returns in your targetEnvironment, e.g. .GlobalEnv, when it is complete with a
message (by default)
summary(interactionPredictorModel)
Other Examples
# Example of not returning a value until we return to the top level
for (i in 1:10) {
if (i == 1) {
> df3 <- data.frame(x = 1:3, y = c("a", "b", "c"), stringsAsFactors = FALSE)
> df3
## x y
## 1 1 a
## 2 2 b
## 3 3 c
> is.data.frame(df3[1])
## TRUE
> is.list(df3[1])
## TRUE
Subsetting a dataframe into a column vector can be accomplished using double brackets [[ ]] or the dollar sign
operator $.
> typeof(df3$x)
## "integer"
> is.vector(df3$x)
## TRUE
Subsetting a data as a two dimensional matrix can be accomplished using i and j terms.
Note: Subsetting by j (column) alone simplifies to the variable's own type, but subsetting by i alone returns a
data.frame, as the different variables may have different types and classes. Setting the drop parameter to FALSE
keeps the data frame.
> is.data.frame(df3[2, ])
## TRUE
The [ operator can also take a vector as the argument. For example, to select the first and third elements:
v1[c(1, 3)]
## [1] "a" "c"
Some times we may require to omit a particular value from the vector. This can be achieved using a negative sign(-)
before the index of that value. For example, to omit to omit the first value from v1, use v1[-1]. This can be
extended to more than one value in a straight forward way. For example, v1[-c(1,3)].
> v1[-1]
[1] "b" "c" "d"
> v1[-c(1,3)]
[1] "b" "d"
On some occasions, we would like to know, especially, when the length of the vector is large, index of a particular
> v1=="c"
[1] FALSE FALSE TRUE FALSE
> which(v1=="c")
[1] 3
If the atomic vector has names (a names attribute), it can be subset using a character vector of names:
v <- 1:3
names(v) <- c("one", "two", "three")
v
## one two three
## 1 2 3
v["two"]
## two
## 2
The [[ operator can also be used to index atomic vectors, with differences in that it accepts a indexing vector with a
length of one and strips any names present:
v[[c(1, 2)]]
## Error in v[[c(1, 2)]] :
## attempt to select more than one element in vectorIndex
v[["two"]]
## [1] 2
Vectors can also be subset using a logical vector. In contrast to subsetting with numeric and character vectors, the
logical vector used to subset has to be equal to the length of the vector whose elements are extracted, so if a logical
vector y is used to subset x, i.e. x[y], if length(y) < length(x) then y will be recycled to match length(x):
v[FALSE] # handy to discard elements but save the vector's type and basic structure
## named integer(0)
## a sample matrix
mat <- matrix(1:6, nrow = 2, dimnames = list(c("row1", "row2"), c("col1", "col2", "col3")))
mat[i,j] is the element in the i-th row, j-th column of the matrix mat. For example, an i value of 2 and a j value of
1 gives the number in the second row and the first column of the matrix. Omitting i or j returns all values in that
dimension.
mat[ , 3]
## row1 row2
## 5 6
mat[1, ]
# col1 col2 col3
# 1 3 5
When the matrix has row or column names (not required), these can be used for subsetting:
mat[ , 'col1']
# row1 row2
# 1 2
By default, the result of a subset will be simplified if possible. If the subset only has one dimension, as in the
examples above, the result will be a one-dimensional vector rather than a two-dimensional matrix. This default can
be overridden with the drop = FALSE argument to [:
Of course, dimensions cannot be dropped if the selection itself has two dimensions:
It is also possible to use a Nx2 matrix to select N individual elements from a matrix (like how a coordinate system
works). If you wanted to extract, in a vector, the entries of a matrix in the (1st row, 1st column), (1st row, 3rd
column), (2nd row, 3rd column), (2nd row, 1st column) this can be done easily by creating a index matrix
with those coordinates and using that to subset the matrix:
mat
# col1 col2 col3
# row1 1 3 5
# row2 2 4 6
mat[ind]
# [1] 1 5 6 2
In the above example, the 1st column of the ind matrix refers to rows in mat, the 2nd column of ind refers to
columns in mat.
l1[1]
## [[1]]
## [1] 1 2 3
l1['two']
## $two
## [1] "a" "b" "c"
l1[[2]]
## [1] "a" "b" "c"
l1[['two']]
## [1] "a" "b" "c"
Note the result of l1[2] is still a list, as the [ operator selects elements of a list, returning a smaller list. The [[
operator extracts list elements, returning an object of the type of the list element.
Elements can be indexed by number or a character string of the name (if it exists). Multiple elements can be
selected with [ by passing a vector of numbers or strings of names. Indexing with a vector of length > 1 in [ and
[[ returns a "list" with the specified elements and a recursive subset (if available), respectively:
l1[c(3, 1)]
## [[1]]
## [[1]][[1]]
## [1] 10
##
## [[1]][[2]]
## [1] 20
##
Compared to:
l1[[c(3, 1)]]
## [1] 10
l1[[3]][[1]]
## [1] 10
The $ operator allows you to select list elements solely by name, but unlike [ and [[, does not require quotes. As an
infix operator, $ can only take a single name:
l1$two
## [1] "a" "b" "c"
l1$t
## [1] "a" "b" "c"
l1[["t"]]
## NULL
l1[["t", exact = FALSE]]
## [1] "a" "b" "c"
Setting options(warnPartialMatchDollar = TRUE), a "warning" is given when partial matching happens with $:
l1$t
## [1] "a" "b" "c"
## Warning message:
## In l1$t : partial match of 't' to 'two'
R vectors are 1-indexed, so for example x[1] will return 11. We can also extract a sub-vector of x by passing a
vector of indices to the bracket operator:
> x[c(2,4,6)]
[1] 12 14 16
If we pass a vector of negative indices, R will return a sub-vector with the specified indices excluded:
We can also pass a boolean vector to the bracket operator, in which case it returns a sub-vector corresponding to
the coordinates where the indexing vector is TRUE:
> x[c(rep(TRUE,5),rep(FALSE,5))]
[1] 11 12 13 14 15 16
If the indexing vector is shorter than the length of the array, then it will be repeated, as in:
> x[c(TRUE,FALSE)]
[1] 11 13 15 17 19
> x[c(TRUE,FALSE,FALSE)]
[1] 11 14 17 20
For example, this is the case with "data.frame" (is.object(iris)) objects where [.data.frame and [[.data.frame
methods are defined and they are made to exhibit both "matrix"-like and "list"-like subsetting. With forcing an error
when subsetting a "data.frame", we see that, actually, a function [.data.frame was called when we -just- used [.
iris[invalidArgument, ]
## Error in `[.data.frame`(iris, invalidArgument, ) :
## object 'invalidArgument' not found
x[c(3, 2, 4)]
## We'd expect 'x[c(3, 2, 4)]' to be returned but this a custom `[` method and should have a
`?[.myClass` help page for its behaviour
## NULL
We can overcome the method dispatching of [ by using the equivalent non-generic .subset (and .subset2 for [[).
This is especially useful and efficient when programming our own "class"es and want to avoid work-arounds (like
unclass(x)) when computing on our "class"es efficiently (avoiding method dispatch and copying objects):
Operator A op B Meaning
+ A + B Addition of corresponding elements of A and B
- A-B Subtracts the elements of B from the corresponding elements of A
/ A/B Divides the elements of A by the corresponding elements of B
* A * B Multiplies the elements of A by the corresponding elements of B
^ A^(-1) For example, gives a matrix whose elements are reciprocals of A
For "true" matrix multiplication, as seen in Linear Algebra, use %*%. For example, multiplication of A with B is: A %*%
B. The dimensional requirements are that the ncol() of A be the same as nrow() of B
debug(mean)
mean(1:3)
All subsequent calls to the function will enter debugging mode. You can disable this behavior with undebug.
undebug(mean)
mean(1:3)
If you know you only want to enter the debugging mode of a function once, consider the use of debugonce.
debugonce(mean)
mean(1:3)
mean(1:3)
Once browser() is hit in the code the interactive interpreter will start. Any R code can be run as normal, and in
addition the following commands are present,
Command Meaning
c Exit browser and continue program
f Finish current loop or function \
n Step Over (evaluate next statement, stepping over function calls)
s Step Into (evaluate next statement, stepping into function calls)
where Print stack trace
r Invoke "resume" restart
Q Exit browser and quit
browser()
for(i in 1:100) {
a = a * b
}
}
toDebug()
library(devtools)
install_github("authorName/repositoryName")
devtools::install_github("tidyverse/ggplot2")
The above command will install the version of ggplot2 that corresponds to the master branch. To install from a
different branch of a repository use the ref argument to provide the name of the branch. For example, the
following command will install the dev_general branch of the googleway package.
Another option is to use the ghit package. It provides a lightweight alternative for installing packages from github:
install.packages("ghit")
ghit::install_github("google/CausalImpact")
To install a package that is in a private repository on Github, generate a personal access token at
http://www.github.com/settings/tokens/ (See ?install_github for documentation on the same). Follow these steps:
1. install.packages(c("curl", "httr"))
2. config = httr::config(ssl_verifypeer = FALSE)
3. install.packages("RCurl")
options(RCurlOptions = c(getOption("RCurlOptions"),ssl.verifypeer = FALSE, ssl.verifyhost =
FALSE ) )
4. getOption("RCurlOptions")
ssl.verifypeer ssl.verifyhost
FALSE FALSE
5. library(httr)
This prevents the common error: "Peer certificate cannot be authenticated with given CA certificates"
install_github("username/package_name",auth_token="abc")
Sys.setenv(GITHUB_PAT = "access_token")
devtools::install_github("organisation/package_name")
The PAT generated in Github is only visible once, i.e., when created initially, so its prudent to save that token in
.Rprofile. This is also helpful if the organisation has many private repositories.
Using CRAN
install.packages("dplyr")
More than one packages can be installed in one go by using the combine function c() and passing a series of
character vector of package names:
In some cases, install.packages may prompt for a CRAN mirror or fail, depending on the value of
getOption("repos"). To prevent this, specify a CRAN mirror as repos argument:
Using the repos argument it is also possible to install from other repositories. For complete information about all
the available options, run ?install.packages.
Most packages require functions, which were implemented in other packages (e.g. the package data.table). In
order to install a package (or multiple packages) with all the packages, which are used by this given package, the
argument dependencies should be set to TRUE):
Using Bioconductor
Bioconductor hosts a substantial collection of packages related to Bioinformatics. They provide their own package
management centred around the biocLite function:
By default this installs a subset of packages that provide the most commonly used functionality. Specific packages
can be installed by passing a vector of package names. For example, to install RImmPort from Bioconductor:
source("https://bioconductor.org/biocLite.R")
biocLite("RImmPort")
Another command that opens a window to choose downloaded zip or tar.gz source files is:
install.packages(file.choose(), repos=NULL)
Step 1: Go to Tools.
Step 3: In the Install From set it as Package Archive File (.zip; .tar.gz)
Step 4: Then Browse find your package file (say crayon_1.3.1.zip) and after some time (after it shows the Package path
and file name in the Package Archive tab)
Another way to install R package from local source is using install_local() function from devtools package.
library(devtools)
install_local("~/Downloads/dplyr-master.zip")
and then installing it in R. Any running R sessions with previous version of the package loaded will need to reload it.
unloadNamespace("my_package")
library(my_package)
A more convenient approach uses the devtools package to simplify the process. In an R session with the working
directory set to the package directory
pacman allows a user to compactly load all desired packages, installing any which are missing (and their
dependencies), with a single command, p_load. pacman does not require the user to type quotation marks around a
package name. Basic usage is as follows:
The only package requiring a library, require, or install.packages statement with this approach is pacman itself:
library(pacman)
p_load(data.table, dplyr, ggplot2)
In addition to saving time by requiring less code to manage packages, pacman also facilitates the construction of
reproducible code by installing any needed packages if and only if they are not already installed.
Since you may not be sure if pacman is installed in the library of a user who will use your code (or by yourself in
future uses of your own code) a best practice is to include a conditional statement to install pacman if it is not
already loaded:
if(!(require(pacman)) install.packages("pacman")
pacman::p_load(data.table, dplyr, ggplot2)
packageVersion("seqinr")
# [1] ‘3.3.3’
packageVersion("RWeka")
# [1] ‘0.4.29’
search()
OR
(.packages())
help(package = "dplyr")
data(package = "dplyr")
library(dplyr)
ls("package:dplyr")
The directory where your code stands will be referred as ./, and all the commands are meant to be executed from
a R prompt in this folder.
The documentation for your code has to be in a format which is very similar to LaTeX.
However, we will use a tool named roxygen in order to simplify the process:
install.packages("devtools")
library("devtools")
install.packages("roxygen2")
library("roxygen2")
The full man page for roxygen is available here. It is very similar to doxygen.
It is also recommanded to create a vignette (see the topic Creating vignettes), which is a full guide about your
package.
Assuming that your code is written for instance in files ./script1.R and ./script2.R, launch the following
command in order to create the file tree of your package:
package.skeleton(name="MyPackage", code_files=c("script1.R","script2.R"))
Then delete all the files in ./MyPackage/man/. You have now to compile the documentation:
roxygenize("MyPackage")
You should also generate a reference manual from your documentation using R CMD Rd2pdf MyPackage from a
command prompt started in ./.
Modify ./MyPackage/DESCRIPTION according to your needs. The fields Package, Version, License, Description,
Title, Author and Maintainer are mandatory, the other are optional.
If your package depends on others packages, specify them in a field named Depends (R version < 3.2.0) or Imports (R
version > 3.2.0).
2. Optional folders
Once you launched the skeleton build, ./MyPackage/ only had R/ and man/ subfolders. However, it can have some
others:
data/: here you can place the data that your library needs and that isn't code. It must be saved as dataset
with the .RData extension, and you can load it at runtime with data() and load()
tests/: all the code files in this folder will be ran at install time. If there is any error, the installation will fail.
src/: for C/C++/Fortran source files you need (using Rcpp...).
exec/: for other executables.
misc/: for barely everything else.
To build your package as a source tarball, you need to execute the following command, from a command prompt in
./ : R CMD build MyPackage
Simply create a new repository called MyPackage and upload everything in MyPackage/ to the master branch. Here
is an example.
install_package("MyPackage", "your_github_usename")
Through CRAN
Your package needs to comply to the CRAN Repository Policy. Including but not limited to: your package must be
cross-platforms (except some very special cases), it should pass the R CMD check test.
Here is the submission form. You must upload the source tarball.
A vignette is a long-form guide to your package. Function documentation is great if you know the name of
the function you need, but it’s useless otherwise. A vignette is like a book chapter or an academic paper:
it can describe the problem that your package is designed to solve, and then show the reader how to
solve it.
Requirements
Rmarkdown: install.packages("rmarkdown")
Pandoc
Vignette creation
devtools::use_vignette("MyVignette", "MyPackage")
The only addition to the original Markdown, is a tag that takes R code, runs it, captures the output, and translates it
into formatted Markdown:
```{r}
# Add two numbers together
add <- function(a, b) a + b
add(10, 20)
```
Thus, all the packages you will use in your vignettes must be listed as dependencies in ./DESCRIPTION.
# loop to create x
for (t in 2:n) x[t] <- 0.7 * x[t-1] + w[t]
plot(x,type='l')
Notice that our coefficient is close to the true value from the generated data
acf(fit$resid)
dnorm(0)
In the same way pnorm(0) gives .5. Again, this makes sense, because half of the distribution is to the left of 0.
rnorm(10)
If you want to change the parameters of a given distribution, simply change them like so
The dbinom() function gives the probabilities for various values of the binomial variable. Minimally it requires three
arguments. The first argument for this function must be a vector of quantiles(the possible values of the random
variable X). The second and third arguments are the defining parameters of the distribution, namely, n(the
number of independent trials) and p(the probability of success in each trial). For example, for a binomial
distribution with n = 5, p = 0.5, the possible values for X are 0,1,2,3,4,5. That is, the dbinom(x,n,p) function
gives the probability values P( X = x ) for x = 0, 1, 2, 3, 4, 5.
The binomial probability distribution plot can be displayed as in the following figure:
Note that the binomial distribution is symmetric when p = 0.5. To demonstrate that the binomial distribution is
negatively skewed when p is larger than 0.5, consider the following example:
When p is smaller than 0.5 the binomial distribution is positively skewed as shown below.
We will now illustrate the usage of the cumulative distribution function pbinom(). This function can be used to
calculate probabilities such as P( X <= x ). The first argument to this function is a vector of quantiles(values of x).
# Calculating Probabilities
# P(X <= 2) in a Bin(n=5,p=0.5) distribution
> pbinom(2,5,0.5)
[1] 0.5
The rbinom() is used to generate random samples of specified sizes with a given parameter values.
# Simulation
> xVal<-names(table(rbinom(1000,8,.5)))
> barplot(as.vector(table(rbinom(1000,8,.5))),names.arg =xVal,
main="Simulated Binomial Distribution\n (n=8,p=0.5)")
in one .R file, or
in two files: ui.R and server.R.
ui: A user interface script, controlling the layout and appearance of the application.
server: A server script which contains code to allow the application to react.
One file
library(shiny)
# Create the UI
ui <- shinyUI(fluidPage(
# Application title
titlePanel("Hello World!")
))
Two files
Create ui.R file
library(shiny)
library(shiny)
ui <- fluidPage(
label : title
choices : selected values
selected : The initially selected value (NULL for no selection)
inline : horizontal or vertical
width
library(shiny)
ui <- fluidPage(
radioButtons("radio",
label = HTML('<FONT color="red"><FONT size="5pt">Welcome</FONT></FONT><br> <b>Your
favorite color is red ?</b>'),
choices = list("TRUE" = 1, "FALSE" = 2),
selected = 1,
inline = T,
width = "100%"),
Showcase mode
Showcase mode displays your app alongside the code that generates it and highlights lines of code in server.R as it
runs them.
Launch Shiny app with the argument display.mode = "showcase", e.g., runApp("MyApp", display.mode =
"showcase").
Create file called DESCRIPTION in your Shiny app folder and add this line in it: DisplayMode: Showcase.
Reactive Log Visualizer provides an interactive browser-based tool for visualizing reactive dependencies and
execution in your application. To enable Reactive Log Visualizer, execute options(shiny.reactlog=TRUE) in R
console and or add that line of code in your server.R file. To start Reactive Log Visualizer, hit Ctrl+F3 on Windows or
Command+F3 on Mac when your app is running. Use left and right arrow keys to navigate in Reactive Log Visualizer.
library(shiny)
ui <- fluidPage(
selectInput("id_selectInput",
label = HTML('<B><FONT size="3">What is your favorite color ?</FONT></B>'),
multiple = TRUE,
choices = list("red" = "red", "green" = "green", "blue" = "blue", "yellow" = "yellow"),
selected = NULL),
br(), br(),
fluidRow(column(3, textOutput("text_choice"))))
label : title
choices : selected values
selected : The initially selected value (NULL for no selection)
multiple : TRUE or FALSE
width
size
selectize: TRUE or FALSE (for use or not selectize.js, change the display)
Your two files ui.R and server.Rhave to be in the same folder. You could then launch your app by running in the
console the shinyApp() function and by passing the path of the directory that contains the Shiny app.
shinyApp("path_to_the_folder_containing_the_files")
You can also launch the app directly from Rstudio by pressing the Run App button that appear on Rstudio when
you an ui.R or server.R file open.
Or you can simply write runApp() on the console if your working directory is Shiny App directory.
If you create your in one R file you can also launch it with the shinyApp() function.
in the console by adding path to a .R file containing the Shiny application with the parameter appFile:
shinyApp(appFile="path_to_my_R_file_containig_the_app")
# Create the UI
ui <- shinyUI(fluidPage(
titlePanel("Basic widgets"),
fluidRow(
column(3,
h3("Buttons"),
actionButton("action", label = "Action"),
br(),
br(),
submitButton("Submit")),
column(3,
h3("Single checkbox"),
checkboxInput("checkbox", label = "Choice A", value = TRUE)),
column(3,
checkboxGroupInput("checkGroup",
label = h3("Checkbox group"),
choices = list("Choice 1" = 1,
"Choice 2" = 2, "Choice 3" = 3),
selected = 1)),
column(3,
dateInput("date",
label = h3("Date input"),
fluidRow(
column(3,
dateRangeInput("dates", label = h3("Date range"))),
column(3,
fileInput("file", label = h3("File input"))),
column(3,
h3("Help text"),
helpText("Note: help text isn't a true widget,",
"but it provides an easy way to add text to",
"accompany other widgets.")),
column(3,
numericInput("num",
label = h3("Numeric input"),
value = 1))
),
fluidRow(
column(3,
radioButtons("radio", label = h3("Radio buttons"),
choices = list("Choice 1" = 1, "Choice 2" = 2,
"Choice 3" = 3),selected = 1)),
column(3,
selectInput("select", label = h3("Select box"),
choices = list("Choice 1" = 1, "Choice 2" = 2,
"Choice 3" = 3), selected = 1)),
column(3,
sliderInput("slider1", label = h3("Sliders"),
min = 0, max = 100, value = 50),
sliderInput("slider2", "",
min = 0, max = 100, value = c(25, 75))
),
column(3,
textInput("text", label = h3("Text input"),
value = "Enter text..."))
)
))
Often, spatial data is avaliable as an XY coordinate data set in tabular form. This example will show how to create a
spatial data set from an XY data set.
The packages rgdal and sp provide powerful functions. Spatial data in R can be stored as Spatial*DataFrame
(where * can be Points, Lines or Polygons).
At first, the working directory has to be set to the folder of the downloaded CSV data set. Furthermore, the package
rgdal has to be loaded.
setwd("D:/GeocodeExample/")
library(rgdal)
Afterwards, the CSV file storing cities and their geographical coordinates is loaded into R as a data.frame
Often, it is useful to get a glimpse of the data and its structure (e.g. column names, data types etc.).
head(xy)
str(xy)
This shows that the latitude and longitude columns are interpreted as character values, since they hold entries like
"-33.532". Yet, the later used function SpatialPointsDataFrame() which creates the spatial data set requires the
coordinate values to be of the data type numeric. Thus the two columns have to be converted.
Few of the values cannot be converted into numeric data and thus, NA values are created. They have to be removed.
xy <- xy[!is.na(xy$longitude),]
Finally, the XY data set can be converted into a spatial data set. This requires the coordinates and the specification
of the Coordinate Refrence System (CRS) in which the coordinates are stored.
The basic plot function can easily be used to sneak peak the produced spatial points.
ESRI shape files can easily be imported into R by using the function readOGR() from the rgdal package.
library(rgdal)
shp <- readORG(dsn = "/path/to/your/file", layer = "filename")
It is important to know, that the dsn must not end with / and the layer does not allow the file ending (e.g. .shp)
raster
Another possible way of importing shapefiles is via the raster library and the shapefile function:
library(raster)
shp <- shapefile("path/to/your/file.shp")
Note how the path definition is different from the rgdal import statement.
tmap
library(tmap)
sph <- read_shape("path/to/your/file.shp")
To select the first 10 rows of the "diamonds" dataset from the package ggplot2, for example:
data("diamonds")
head(diamonds)
# A tibble: 6 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
require(sqldf)
sqldf("select * from diamonds limit 10")
Notice in the example above that quoted strings within the SQL query are quoted using '' if the overall query is
quoted with "" (this also works in reverse).
sqldf("select count(*) from diamonds where carat > 1 and color = 'E'")
count(*)
1 1892
sqldf("select *, count(*) as cnt_big_E_colored_stones from diamonds where carat > 1 and color = 'E'
group by clarity")
If one would be interested what is the max price of the diamond according to the cut:
cut max(price)
1 Fair 18574
2 Good 18788
3 Ideal 18806
4 Premium 18823
5 Very Good 18818
In this example we are comparing the speeds of six equivalent data.table expressions for updating elements in a
group, based on a certain condition.
More specifically:
A data.table with 3 columns: id, time and status. For each id, I want to find the record with the
maximum time - then if for that record if the status is true, I want to set it to false if the time is > 7
library(microbenchmark)
library(data.table)
set.seed(20160723)
dt <- data.table(id = c(rep(seq(1:10000), each = 10)),
time = c(rep(seq(1:10000), 10)),
status = c(sample(c(TRUE, FALSE), 10000*10, replace = TRUE)))
setkey(dt, id, time) ## create copies of the data so the 'updates-by-reference' don't affect other
expressions
dt1 <- copy(dt)
dt2 <- copy(dt)
dt3 <- copy(dt)
dt4 <- copy(dt)
dt5 <- copy(dt)
dt6 <- copy(dt)
microbenchmark(
expression_1 = {
dt1[ dt1[order(time), .I[.N], by = id]$V1, status := status * time < 7 ]
},
expression_2 = {
dt2[,status := c(.SD[-.N, status], .SD[.N, status * time > 7]), by = id]
},
expression_3 = {
dt3[dt3[,.N, by = id][,cumsum(N)], status := status * time > 7]
},
expression_4 = {
y <- dt4[,.SD[.N],by=id]
dt4[y, status := status & time > 7]
},
expression_5 = {
y <- dt5[, .SD[.N, .(time, status)], by = id][time > 7 & status]
dt5[y, status := FALSE]
},
expression_6 = {
dt6[ dt6[, .I == .I[which.max(time)], by = id]$V1 & time > 7, status := FALSE]
},
# Unit: milliseconds
# expr min lq mean median uq max neval
# expression_1 11.646149 13.201670 16.808399 15.643384 18.78640 26.321346 10
# expression_2 8051.898126 8777.016935 9238.323459 8979.553856 9281.93377 12610.869058 10
# expression_3 3.208773 3.385841 4.207903 4.089515 4.70146 5.654702 10
# expression_4 15.758441 16.247833 20.677038 19.028982 21.04170 36.373153 10
# expression_5 7552.970295 8051.080753 8702.064620 8861.608629 9308.62842 9722.234921 10
# expression_6 18.403105 18.812785 22.427984 21.966764 24.66930 28.607064 10
References
proc.time()
This is particularly useful for benchmarking specific lines of code. For example:
t1 <- proc.time()
fibb <- function (n) {
if (n < 3) {
return(c(0,1)[n])
} else {
return(fibb(n - 2) + fibb(n -1))
}
}
print("Time one")
print(proc.time() - t1)
t2 <- proc.time()
fibb(30)
print("Time two")
print(proc.time() - t2)
source('~/.active-rstudio-document')
system.time() is a wrapper for proc.time() that returns the elapsed time for a particular command/expression.
Note that the returned object, of class proc.time, is slightly more complicated than it appears on the surface:
str(t1)
## Class 'proc_time' Named num [1:5] 0 0 0.002 0 0
## ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...
system.time(print("hello world"))
This is because system.time is essentially a wrapper function for proc.time, which measures in seconds. As
printing "hello world" takes less than a second it appears that the time taken is less than a second, however this is
not true. To see this we can use the package microbenchmark:
library(microbenchmark)
microbenchmark(print("hello world"))
# Unit: microseconds
# expr min lq mean median uq max neval
# print("hello world") 26.336 29.984 44.11637 44.6835 45.415 158.824 100
Here we can see after running print("hello world") 100 times, the average time taken was in fact 44
microseconds. (Note that running this code will print "hello world" 100 times onto the console.)
We can compare this against an equivalent procedure, cat("hello world\n"), to see if it is faster than
print("hello world"):
microbenchmark(cat("hello world\n"))
# Unit: microseconds
# expr min lq mean median uq max neval
# cat("hello world\\n") 14.093 17.6975 23.73829 19.319 20.996 119.382 100
Alternatively one can compare two procedures within the same microbenchmark call:
system.time(print("hello world"))
system.time({
library(numbers)
Primes(1,10^5)
})
system.time(fibb(30))
library(lineprof)
library(forecast)
l <- lineprof(auto.arima(AirPassengers))
shine(l)
This will provide you with a shiny app, which allows you to delve deeper into every function call. This enables you to
see with ease what is causing your R code to slow down. There is a screenshot of the shiny app below:
Each of these options will be shown in code; a comparison of the computational time to execute each option will be
shown; and lastly a discussion of the differences will be given.
vapply Function
column_mean_vapply <- vapply(mtcars, mean, numeric(1))
colMeans Function
column_mean_colMeans <- colMeans(mtcars)
Efficiency comparison
The results of benchmarking these four approaches is shown below (code not displayed)
Unit: microseconds
expr min lq mean median uq max neval cld
poor 240.986 262.0820 287.1125 275.8160 307.2485 442.609 100 d
optimal 220.313 237.4455 258.8426 247.0735 280.9130 362.469 100 c
vapply 107.042 109.7320 124.4715 113.4130 132.6695 202.473 100 a
colMeans 155.183 161.6955 180.2067 175.0045 194.2605 259.958 100 b
Notice that the optimized for loop edged out the poorly constructed for loop. The poorly constructed for loop is
constantly increasing the length of the output object, and at each change of the length, R is reevaluating the class of
the object.
Some of this overhead burden is removed by the optimized for loop by declaring the type of output object and its
length before starting the loop.
In this example, however, the use of an vapply function doubles the computational efficiency, largely because we
Use of the colMeans function is a touch slower than the vapply function. This difference is attributable to some
error checks performed in colMeans and mainly to the as.matrix conversion (because mtcars is a data.frame) that
weren't performed in the vapply function.
class(squared_deviance)
length(squared_deviance)
What if we want a data.frame as a result? Well, there are many options for transforming a list into other objects.
However, and maybe the simplest in this case, will be to store the for results in a data.frame.
The result will be the same event though we use the character option (B).
while (condition) {
## do something
## in loop body
where condition is evaluated prior to entering the loop body. If condition evaluates to TRUE, the code inside of the
loop body is executed, and this process repeats until condition evaluates to FALSE (or a break statement is
reached; see below). Unlike the for loop, if a while loop uses a variable to perform incremental iterations, the
variable must be declared and initialized ahead of time, and must be updated within the loop body. For example,
the following loops accomplish the same task:
for (i in 0:4) {
cat(i, "\n")
}
# 0
# 1
# 2
# 3
# 4
i <- 0
while (i < 5) {
cat(i, "\n")
i <- i + 1
}
# 0
# 1
# 2
# 3
# 4
In the while loop above, the line i <- i + 1 is necessary to prevent an infinite loop.
Additionally, it is possible to terminate a while loop with a call to break from inside the loop body:
iter <- 0
while (TRUE) {
if (runif(1) < 0.25) {
break
} else {
iter <- iter + 1
}
}
iter
#[1] 4
In this example, condition is always TRUE, so the only way to terminate the loop is with a call to break inside the
body. Note that the final value of iter will depend on the state of your PRNG when this example is run, and should
produce different results (essentially) each time the code is executed.
The repeat construct is essentially the same as while (TRUE) { ## something }, and has the following form:
repeat ({
## do something
## in loop body
})
The extra {} are not required, but the () are. Rewriting the previous example using repeat,
More on break
It's important to note that break will only terminate the immediately enclosing loop. That is, the following is an infinite
loop:
while (TRUE) {
while (TRUE) {
cat("inner loop\n")
break
}
cat("outer loop\n")
}
With a little creativity, however, it is possible to break entirely from within a nested loop. As an example, consider
the following expression, which, in its current state, will loop infinitely:
while (TRUE) {
cat("outer loop body\n")
while (TRUE) {
cat("inner loop body\n")
x <- runif(1)
if (x < .3) {
break
} else {
cat(sprintf("x is %.5f\n", x))
}
}
}
One possibility is to recognize that, unlike break, the return expression does have the ability to return control
across multiple levels of enclosing loops. However, since return is only valid when used within a function, we
cannot simply replace break with return() above, but also need to wrap the entire expression as an anonymous
function:
(function() {
while (TRUE) {
cat("outer loop body\n")
while (TRUE) {
cat("inner loop body\n")
x <- runif(1)
if (x < .3) {
return()
} else {
cat(sprintf("x is %.5f\n", x))
}
}
}
Alternatively, we can create a dummy variable (exit) prior to the expression, and activate it via <<- from the inner
loop when we are ready to terminate:
set.seed(20)
df1 <- data.frame(ID = rep(c("A", "B", "C"), each = 3), V1 = rnorm(9), V2 = rnorm(9))
m1 <- as.matrix(df1[-1])
There are many ways to do this. Using base R, the best option would be colSums
Here, we removed the first column as it is non-numeric and did the sum of each column, specifying the na.rm =
TRUE (in case there are any NAs in the dataset)
Or
For matrices, if we want to loop through columns, then use apply with MARGIN = 1
library(dplyr)
df1 %>%
summarise_at(vars(matches("^V\\d+")), sum, na.rm = TRUE)
Here, we are passing a regular expression to match the column names that we need to get the sum in
summarise_at. The regex will match all columns that start with V followed by one or more numbers (\\d+).
A data.table option is
library(data.table)
setDT(df1)[, lapply(.SD, sum, na.rm = TRUE), .SDcols = 2:ncol(df1)]
We convert the 'data.frame' to 'data.table' (setDT(df1)), specified the columns to be applied the function in
.SDcols and loop through the Subset of Data.table (.SD) and get the sum.
df1 %>%
group_by(ID) %>%
summarise_at(vars(matches("^V\\d+")), sum, na.rm = TRUE)
In cases where we need the sum of all the columns, summarise_each can be used instead of summarise_at
df1 %>%
group_by(ID) %>%
summarise_each(funs(sum(., na.rm = TRUE)))
library(jsonlite)
## vector to JSON
toJSON(c(1,2,3))
# [1,2,3]
fromJSON('[1,2,3]')
# [1] 1 2 3
toJSON(list(myVec = c(1,2,3)))
# {"myVec":[1,2,3]}
fromJSON('{"myVec":[1,2,3]}')
# $myVec
# [1] 1 2 3
## list structures
lst <- list(a = c(1,2,3),
b = list(letters[1:6]))
toJSON(lst)
# {"a":[1,2,3],"b":[["a","b","c","d","e","f"]]}
fromJSON('{"a":[1,2,3],"b":[["a","b","c","d","e","f"]]} ')
# $a
# [1] 1 2 3
#
# $b
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] "a" "b" "c" "d" "e" "f"
toJSON(df)
#
[{"id":1,"val":"a"},{"id":2,"val":"b"},{"id":3,"val":"c"},{"id":4,"val":"d"},{"id":5,"val":"e"},{"i
d":6,"val":"f"},{"id":7,"val":"g"},{"id":8,"val":"h"},{"id":9,"val":"i"},{"id":10,"val":"j"}]
googleway_issues$url
# [1] "https://api.github.com/repos/SymbolixAU/googleway/issues/20"
"https://api.github.com/repos/SymbolixAU/googleway/issues/19"
# [3] "https://api.github.com/repos/SymbolixAU/googleway/issues/14"
"https://api.github.com/repos/SymbolixAU/googleway/issues/11"
# [5] "https://api.github.com/repos/SymbolixAU/googleway/issues/9"
"https://api.github.com/repos/SymbolixAU/googleway/issues/5"
# [7] "https://api.github.com/repos/SymbolixAU/googleway/issues/2"
require(RODBC)
con = odbcConnectExcel("myfile.xlsx") # open a connection to the Excel file
sqlTables(con)$TABLE_NAME # show all sheets
df = sqlFetch(con, "Sheet1") # read a sheet
df = sqlQuery(con, "select * from [Sheet1 $]") # read a sheet (alternative SQL syntax)
close(con) # close the connection to the file
library(RODBC)
cn <- odbcDriverConnect(connection="Driver={SQL
Server};server=localhost;database=Atilla;trusted_connection=yes;")
tbl <- sqlQuery(cn, 'select top 10 * from table_1')
This will connect to a SQL Server instance. For more information on what your connection string should look like,
visit connectionstrings.com
Also, since there's no database specified, you should make sure you fully qualify the object you're wanting to query
like this databasename.schema.objectname
e.g. ymd() for parsing a date with the year followed by the month followed by the day, e.g. "2016-07-22", or
ymd_hms() for parsing a datetime in the order year, month, day, hours, minutes, seconds, e.g. "2016-07-22
13:04:47".
The functions are able to recognize most separators (such as /, -, and whitespace) without additional arguments.
They also work with inconsistent separators.
Dates
library(lubridate)
Datetimes
Utility functions
Datetimes can be parsed using ymd_hms variants including ymd_hm and ymd_h. All datetime functions can accept a tz
timezone argument akin to that of as.POSIXct or strptime, but which defaults to "UTC" instead of the local
timezone.
ymd_hms(x)
## [1] "2016-07-24 13:01:02 UTC" "2016-07-23 14:02:01 UTC"
## [3] "2016-07-25 15:03:00 UTC"
lubridate also includes three functions for parsing datetimes with a formatting string like as.POSIXct or strptime:
library(lubridate)
is.instant("helloworld")
## [1] FALSE
is.instant(60)
## [1] FALSE
Durations measure the exact amount of time that occurs between two instants.
duration(60, "seconds")
## [1] "60s"
duration(2, "minutes")
## [1] "120s (~2 minutes)"
Note: Units larger than weeks are not used due to their variability.
Durations can be created using dseconds, dminutes and other duration helper functions.
Run ?quick_durations for complete list.
dseconds(60)
## [1] "60s"
dhours(2)
## [1] "7200s (~2 hours)"
dyears(1)
## [1] "31536000s (~365 days)"
today_start + dhours(5)
as.duration(span)
[1] "43199s (~12 hours)"
Periods measure the change in clock time that occurs between two instants.
Periods can be created using period function as well other helper functions like seconds, hours, etc. To get a
complete list of period helper functions, Run ?quick_periods.
period(1, "hour")
## [1] "1H 0M 0S"
hours(1)
## [1] "1H 0M 0S"
period(6, "months")
## [1] "6m 0d 0H 0M 0S"
months(6)
## [1] "6m 0d 0H 0M 0S"
years(1)
## [1] "1y 0m 0d 0H 0M 0S"
is.period(years(1))
## [1] TRUE
is.period(dyears(1))
## [1] FALSE
year(date)
## 2016
minute(date)
## 42
day(date) <- 31
## "2016-07-31 03:42:35 IST"
force_tz returns a the date-time that has the same clock time as x in the new time zone.
## [1] "2016-07-21"
## "2016-07-21 UTC"
## [1] "2016-07-21"
round_date() takes a date-time object and rounds it to the nearest integer value of the specified time unit.
round_date(now_dt, "minute")
## [1] "2016-07-22 13:53:00 IST"
round_date(now_dt, "hour")
round_date(now_dt, "year")
## [1] "2017-01-01 IST"
floor_date() takes a date-time object and rounds it down to the nearest integer value of the specified time unit.
floor_date(now_dt, "minute")
## [1] "2016-07-22 13:53:00 IST"
floor_date(now_dt, "hour")
## [1] "2016-07-22 13:00:00 IST"
floor_date(now_dt, "year")
## [1] "2016-01-01 IST"
ceiling_date() takes a date-time object and rounds it up to the nearest integer value of the specified time unit.
ceiling_date(now_dt, "minute")
## [1] "2016-07-22 13:54:00 IST"
ceiling_date(now_dt, "hour")
## [1] "2016-07-22 14:00:00 IST"
ceiling_date(now_dt, "year")
## [1] "2017-01-01 IST"
#Convert this vector to a ts object with 100 monthly observations starting in July
x <- ts(x, start = c(1900, 7), freq = 12)
#Convert this vector to a ts object with 100 daily observations and weekly frequency starting in
the first week of 1900
x <- ts(x, start = c(1900, 1), freq = 7)
#Call all weeks including and after the 10th week of 1900
window(x, start = c(1900, 10))
1 "ts"
In the spirit of Exploratory Data Analysis (EDA) a good first step is to look at a plot of your time-series data:
cycle(AirPassengers)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1949 1 2 3 4 5 6 7 8 9 10 11 12
1950 1 2 3 4 5 6 7 8 9 10 11 12
1951 1 2 3 4 5 6 7 8 9 10 11 12
1952 1 2 3 4 5 6 7 8 9 10 11 12
1953 1 2 3 4 5 6 7 8 9 10 11 12
1954 1 2 3 4 5 6 7 8 9 10 11 12
1955 1 2 3 4 5 6 7 8 9 10 11 12
1956 1 2 3 4 5 6 7 8 9 10 11 12
1957 1 2 3 4 5 6 7 8 9 10 11 12
1958 1 2 3 4 5 6 7 8 9 10 11 12
1959 1 2 3 4 5 6 7 8 9 10 11 12
1960 1 2 3 4 5 6 7 8 9 10 11 12
Here is a common usage of strsplit: break a character vector along a comma separator:
[[2]]
[1] "hat" "scarf" "food"
[[3]]
[1] "woman" "man" "child"
As hinted above, the split argument is not limited to characters, but may follow a pattern dictated by a regular
expression. For example, temp2 is identical to temp above except that the separators have been altered for each
item. We can take advantage of the fact that the split argument accepts regular expressions to alleviate the
irregularity in the vector.
temp2 <- c("this, that, other", "hat,scarf ,food", "woman; man ; child")
myList2 <- strsplit(temp2, split=" ?[,;] ?")
myList2
[[1]]
[1] "this" "that" "other"
[[2]]
[1] "hat" "scarf" "food"
[[3]]
[1] "woman" "man" "child"
Notes:
1. breaking down the regular expression syntax is out of scope for this example.
2. Sometimes matching regular expressions can slow down a process. As with many R functions that allow the
use of regular expressions, the fixed argument is available to tell R to match on the split characters literally.
To scrape the table of milestones from the Wikipedia page on R, the code would look like
library(rvest)
While this returns a data.frame, note that as is typical for scraped data, there is still further data cleaning to be
done: here, formatting dates, inserting NAs, and so on.
Note that data in a less consistently rectangular format may take looping or other further munging to successfully
parse. If the website makes use of jQuery or other means to insert content, read_html may be insufficient to
scrape, and a more robust scraper like RSelenium may be necessary.
In this example which I created to track my answers posted here to stack overflow. The overall flow is to login, go to
a web page collect information, add it a dataframe and then move to the next page.
#Dataframe Clean-up
names(results)<-c("Votes", "Answer", "Date", "Accepted", "HyperLink")
results$Votes<-as.integer(as.character(results$Votes))
results$Accepted<-ifelse(results$Accepted=="answer-votes default", 0, 1)
The loop in this case is limited to only 5 pages, this needs to change to fit your application. I replaced the user
specific values with ******, hopefully this will provide some guidance for you problem.
The name comes from the link function used, the logit or log-odds function. The inverse function of the logit is called
the logistic function and is given by:
This function takes a value between ]-Inf;+Inf[ and returns a value between 0 and 1; i.e the logistic function takes a
linear predictor and returns a probability.
Logistic regression can be performed using the glm function with the option family = binomial (shortcut for
family = binomial(link="logit"); the logit being the default link function for the binomial family).
In this example, we try to predict the fate of the passengers aboard the RMS Titanic.
summary(titanic.train)
The output:
Call:
glm(formula = survived ~ pclass + sex + age, family = binomial, data = titanic)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6452 -0.6641 -0.3679 0.6123 2.5615
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.552261 0.342188 10.381 < 2e-16 ***
pclass2nd -1.170777 0.211559 -5.534 3.13e-08 ***
pclass3rd -2.430672 0.195157 -12.455 < 2e-16 ***
The first thing displayed is the call. It is a reminder of the model and the options specified.
Next we see the deviance residuals, which are a measure of model fit. This part of output shows the
distribution of the deviance residuals for individual cases used in the model.
The next part of the output shows the coefficients, their standard errors, the z-statistic (sometimes called a
Wald z-statistic), and the associated p-values.
The qualitative variables are "dummified". A modality is considered as the reference. The reference
modality can be change with I in the formula.
All four predictors are statistically significant at a 0.1 % level.
The logistic regression coefficients give the change in the log odds of the outcome for a one unit
increase in the predictor variable.
To see the odds ratio (multiplicative change in the odds of survival per unit increase in a predictor
variable), exponentiate the parameter.
To see the confidence interval (CI) of the parameter, use confint.
Below the table of coefficients are fit indices, including the null and deviance residuals and the Akaike
Information Criterion (AIC), which can be used for comparing model performance.
When comparing models fitted by maximum likelihood to the same data, the smaller the AIC, the
better the fit.
One measure of model fit is the significance of the overall model. This test asks whether the model
with predictors fits significantly better than a model with just an intercept (i.e., a null model).
exp(coef(titanic.train)[3])
pclass3rd
0.08797765
With this model, compared to the first class, the 3rd class passengers have about a tenth of the odds of survival.
confint(titanic.train)
The test statistic is distributed chi-squared with degrees of freedom equal to the differences in degrees of freedom
between the current and the null model (i.e., the number of predictor variables in the model).
However, sometimes it is more convenient to have a long format, in which all variables are in one column and the
values are in a second column.
Base R, as well as third party packages can be used to simplify this process. For each of the options, the mtcars
dataset will be used. By default, this dataset is in a long format. In order for the packages to work, we will insert the
row names as the first column.
Base R
There are two functions in base R that can be used to convert between wide and long format: stack() and
unstack().
However, these functions can become very complex for more advanced use cases. Luckily, there are other options
using third party packages.
This package uses gather() to convert from wide to long and spread() to convert from long to wide.
library(tidyr)
long <- gather(data, variable, value, 2:12) # where variable is the name of the
# variable column, value indicates the name of the value column and 2:12 refers to
The data.table package extends the reshape2 functions and uses the function melt() to go from wide to long and
dcast() to go from long to wide.
library(data.table)
long <- melt(data,'observation',2:12,'variable', 'value')
long # shows the long result
wide <- dcast(long, observation ~ variable)
wide # shows the wide result (~data)
Note that the data.frame is unbalanced, that is, unit 2 is missing an observation in the first period, while units 3 and
4 are missing observations in the second period. Also, note that there are two variables that vary over the periods:
counts and values, and two that do not vary: identifier and location.
Long to Wide
Notice that the missing time periods are filled in with NAs.
In reshaping wide, the "v.names" argument specifies the columns that vary over time. If the location variable is not
necessary, it can be dropped prior to reshaping with the "drop" argument. In dropping the only non-varying / non-
id column from the data.frame, the v.names argument becomes unnecessary.
Wide to Long
reshape(df.wide, direction="long")
Now the simple syntax will produce an error about undefined columns.
With column names that are more difficult for the reshape function to automatically parse, it is sometimes
necessary to add the "varying" argument which tells reshape to group particular variables in wide format for the
transformation into long format. This argument takes a list of vectors of variable names or indices.
reshape(df.wide, idvar="identifier",
varying=list(c(3,5,7), c(4,6,8)), direction="long")
In reshaping long, the "v.names" argument can be provided to rename the resulting varying variables.
Sometimes the specification of "varying" can be avoided by use of the "sep" argument which tells reshape what part
of the variable name specifies the value argument and which specifies the time argument.
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.12.2/jquery.min.js"></script>
Now we can use jQuery to alter the DOM (document object model) of our presentation. In other words: we alter the
HTML structure of the document. As soon as the presentation is loaded ($(document).ready(function() { ...
})), we select all slides, that do not have the class attributes .title-slide, .backdrop, or .segue and add the tag
<footer></footer> right before each slide is 'closed' (so before </slide>). The attribute label carries the content
that will be displayed later on.
(the other properties can be ignored but might have to be modified if the presentation uses a different style
template).
---
title: "Adding a footer to presentaion slides"
author: "Martin Schmelzer"
date: "26 Juli 2016"
output: ioslides_presentation
---
This is slide 1.
## Slide 2
This is slide 2
# Test
## Slide 3
And slide 3.
To knit the script, either use the render function or use the shortcut button in Rstudio.
---
title: "Rstudio exemple of a rmd file"
author: 'stack user'
date: "22 July 2016"
The header is used to define the general parameters and the metadata.
## R Markdown
```{r cars}
summary(cars)
```
## Including Plots
```{r echo=FALSE}
plot(pressure)
```
x <- 1
Variables passed into a function and then reassigned are overwritten, but only inside the function.
foo(1)
x
Variables assigned in a higher environment than a function exist within that function, without being passed.
foo()
Some parameters, especially those for graphics, can only be set globally. This small function is common when
This fails:
foo()
This works:
y <- 3
z <- bar()
return(z)
}
foo()
foo()
Global assignment is highly discouraged. Use of a wrapper function or explicitly calling variables from another local
environment is greatly preferred.
A commonly created environment is one which encloses package:base or a subenvironment within package:base.
Since e2 inherits from e1, a is 3 in both e1 and e2. However, assigning a within e2 does not change the value of a in
e1.
This function takes two vectors, and shuffles their contents together, then performs the function testStat on the
shuffled vectors. The result of teststat is added to trials, which is the return value.
It does this N = 10^5 times. Note that the value N could very well have been a parameter to the function.
This leaves us with a new set of data, trials, the set of means that might result if there truly is no relationship
between the two variables.
Let's see what our observation looks like on a histogram of our test statistic.
hist(result)
abline(v=observedMeanDifference, col = "blue")
It doesn't look like our observed result is very likely to occur by random chance...
We want to calculate the p-value, the likeliehood of the original observed result if their is no relationship between
the two variables.
With TRUE every time the value of result is greater than or equal to the observedMean.
The function mean will interpret this vector as 1 for TRUE and 0 for FALSE, and give us the percentage of 1's in the
mix, ie the number of times our shuffled vector mean difference surpassed or equalled what we observed.
Finally, we multiply by 2 because the distribution of our test statistic is highly symmetric, and we really want to
know which results are "more extreme" than our observed result.
###############################################################################
# Load data from UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets.html)
urlfile <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
x <- getURL(urlfile, ssl.verifypeer = FALSE)
adults <- read.csv(textConnection(x), header=F)
# adults <-read.csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
header=F)
names(adults)=c('age','workclass','fnlwgt','education','educationNum',
'maritalStatus','occupation','relationship','race',
'sex','capitalGain','capitalLoss','hoursWeek',
'nativeCountry','income')
# clean up data
adults$income <- ifelse(adults$income==' <=50K',0,1)
# binarize all factors
library(caret)
dmy <- dummyVars(" ~ .", data = adults)
adultsTrsf <- data.frame(predict(dmy, newdata = adults))
###############################################################################
# what we're trying to predict adults that make more than 50k
outcomeName <- c('income')
# list of features
predictors <- names(adultsTrsf)[!names(adultsTrsf) %in% outcomeName]
# play around with settings of xgboost - eXtreme Gradient Boosting (Tree) library
# https://github.com/tqchen/xgboost/wiki/Parameters
# max.depth - maximum depth of the tree
# nrounds - the max number of iterations
# train
bst <- xgboost(data = as.matrix(trainSet[,predictors]),
label = trainSet[,outcomeName],
max.depth=depth, nround=rounds,
objective = "reg:linear", verbose=0)
gc()
# predict
predictions <- predict(bst, as.matrix(testSet[,predictors]), outputmargin=TRUE)
err <- rmse(as.numeric(testSet[,outcomeName]), as.numeric(predictions))
cv <- 30
trainSet <- adultsTrsf[1:trainPortion,]
cvDivider <- floor(nrow(trainSet) / (cv+1))
###########################################################################
# Test both models out on full data set
That means that when approaching a problem that at first glance requires "by row operations", such as calculating
the means of each row, one needs to ask themselves:
What are the classes of the data sets I'm dealing with?
Is there an existing compiled code that can achieve this without the need of repetitive evaluation of R
functions?
If not, can I do these operation by columns instead by row?
Finally, is it worth spending a lot of time on developing complicated vectorized code instead of just running a
simple apply loop? In other words, is the data big/sophisticated enough that R can't handle it efficiently using
a simple loop?
Putting aside the memory pre-allocation issue and growing object in loops, we will focus in this example on how to
possibly avoid apply loops, method dispatching or re-evaluating R functions within loops.
apply(mtcars, 1, mean)
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout
Valiant Duster 360
29.90727 29.98136 23.59818 38.73955 53.66455
35.04909 59.72000
Merc 240D Merc 230 Merc 280 Merc 280C Merc 450SE
Merc 450SL Merc 450SLC
24.63455 27.23364 31.86000 31.78727 46.43091
46.50000 46.35000
Cadillac Fleetwood Lincoln Continental Chrysler Imperial Fiat 128 Honda Civic
Toyota Corolla Toyota Corona
66.23273 66.05855 65.97227 19.44091 17.74227
18.81409 24.88864
Dodge Challenger AMC Javelin Camaro Z28 Pontiac Firebird Fiat X1-9
Porsche 914-2 Lotus Europa
47.24091 46.00773 58.75273 57.37955 18.92864
24.77909 24.88027
Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
60.97182 34.50818 63.15545 26.26273
1. First, we converted a data.frame to a matrix. (Note that his happens within the apply function.) This is both
inefficient and dangerous. a matrix can't hold several column types at a time. Hence, such conversion will
probably lead to loss of information and some times to misleading results (compare apply(iris, 2, class)
with str(iris) or with sapply(iris, class)).
2. Second of all, we performed an operation repetitively, one time for each row. Meaning, we had to evaluate
some R function nrow(mtcars) times. In this specific case, mean is not a computationally expensive function,
hence R could likely easily handle it even for a big data set, but what would happen if we need to calculate
the standard deviation by row (which involves an expensive square root operation)? Which brings us to the
next point:
rowMeans(mtcars)
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout
Valiant Duster 360
29.90727 29.98136 23.59818 38.73955 53.66455
35.04909 59.72000
Merc 240D Merc 230 Merc 280 Merc 280C Merc 450SE
Merc 450SL Merc 450SLC
24.63455 27.23364 31.86000 31.78727 46.43091
46.50000 46.35000
Cadillac Fleetwood Lincoln Continental Chrysler Imperial Fiat 128 Honda Civic
Toyota Corolla Toyota Corona
66.23273 66.05855 65.97227 19.44091 17.74227
18.81409 24.88864
Dodge Challenger AMC Javelin Camaro Z28 Pontiac Firebird Fiat X1-9
Porsche 914-2 Lotus Europa
47.24091 46.00773 58.75273 57.37955 18.92864
24.77909 24.88027
Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
60.97182 34.50818 63.15545 26.26273
This involves no by row operations and therefore no repetitive evaluation of R functions. However, we still
converted a data.frame to a matrix. Though rowMeans has an error handling mechanism and it won't run on a data
set that it can't handle, it's still has an efficiency cost.
rowMeans(iris)
Error in rowMeans(iris) : 'x' must be numeric
But still, can we do better? We could try instead of a matrix conversion with error handling, a different method that
will allow us to use mtcars as a vector (because a data.frame is essentially a list and a list is a vector).
Reduce(`+`, mtcars)/ncol(mtcars)
[1] 29.90727 29.98136 23.59818 38.73955 53.66455 35.04909 59.72000 24.63455 27.23364 31.86000
31.78727 46.43091 46.50000 46.35000 66.23273 66.05855
[17] 65.97227 19.44091 17.74227 18.81409 24.88864 47.24091 46.00773 58.75273 57.37955 18.92864
24.77909 24.88027 60.97182 34.50818 63.15545 26.26273
Now for possible speed gain, we lost column names and error handling (including NA handling).
Another example would be calculating mean by group, using base R we could try
Still, we are basically evaluating an R function in a loop, but the loop is now hidden in an internal C function (it
matters little whether it is a C or an R loop).
Could we avoid it? Well there is a compiled function in R called rowsum, hence we could do:
rowsum(mtcars[-2], mtcars$cyl)/table(mtcars$cyl)
mpg disp hp drat wt qsec vs am gear carb
4 26.66364 105.1364 82.63636 4.070909 2.285727 19.13727 0.9090909 0.7272727 4.090909 1.545455
A this point we may question whether our current data structure is the most appropriate one. Is a data.frame is the
best practice? Or should one just switch to a matrix data structure in order to gain efficiency?
By row operations will get more and more expensive (even in matrices) as we start to evaluate expensive functions
each time. Lets us consider a variance calculation by row example.
set.seed(100)
m <- matrix(sample(1e2), 10)
m
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 8 33 39 86 71 100 81 68 89 84
[2,] 12 16 57 80 32 82 69 11 41 92
[3,] 62 91 53 13 42 31 60 70 98 79
[4,] 66 94 29 67 45 59 20 96 64 1
[5,] 36 63 76 6 10 48 85 75 99 2
[6,] 18 4 27 19 44 56 37 95 26 40
[7,] 3 24 21 25 52 51 83 28 49 17
[8,] 46 5 22 43 47 74 35 97 77 65
[9,] 55 54 78 34 50 90 30 61 14 58
[10,] 88 73 38 15 9 72 7 93 23 87
apply(m, 1, var)
[1] 871.6556 957.5111 699.2111 941.4333 1237.3333 641.8222 539.7889 759.4333 500.4889
1255.6111
On the other hand, one could also completely vectorize this operation by following the formula of variance
anyNA(vec)
# [1] TRUE
is.na(vec)
# [1] FALSE FALSE FALSE TRUE FALSE
ìs.na returns a logical vector that is coerced to integer values under arithmetic operations (with FALSE=0, TRUE=1).
We can use this to find out how many missing values there are:
sum(is.na(vec))
# [1] 1
Extending this approach, we can use colSums and is.na on a data frame to count NAs per column:
colSums(is.na(airquality))
# Ozone Solar.R Wind Temp Month Day
# 37 7 0 0 0 0
The naniar package (currently on github but not CRAN) offers further tools for exploring missing values.
It is also possible to indicate that more than one symbol needs to be read as NA:
Similarly, NAs can be written with customized strings using the na argument to write.csv. Other tools for reading
and writing tables have similar options.
class(NA)
#[1] "logical"
This is convenient, since it can easily be coerced to other atomic vector types, and is therefore usually the only NA
If you do need a single NA value of another type, use NA_character_, NA_integer_, NA_real_ or NA_complex_. For
missing values of fancy classes, subsetting with NA_integer_ usually works; for example, to get a missing-value
Date:
class(Sys.Date()[NA_integer_])
# [1] "Date"
NA | TRUE
# [1] TRUE
# TRUE | TRUE is TRUE and FALSE | TRUE is also TRUE.
NA | FALSE
# [1] NA
# TRUE | FALSE is TRUE but FALSE | FALSE is FALSE.
NA & TRUE
# [1] NA
# TRUE & TRUE is TRUE but FALSE & TRUE is FALSE.
NA & FALSE
# [1] FALSE
# TRUE & FALSE is FALSE and FALSE & FALSE is also FALSE.
These properties are helpful if you want to subset a data set based on some columns that contain NA.
df <- data.frame(v1=0:9,
v2=c(rep(1:2, each=4), NA, NA),
v3=c(NA, letters[2:10]))
df[df$v2 == 1, ]
v1 v2 v3
#1 0 1 <NA>
#2 1 1 b
#3 2 1 c
#4 3 1 d
#NA NA NA <NA>
#NA.1 NA NA <NA>
The primary packages for fitting hierarchical (alternatively "mixed" or "multilevel") linear models in R are nlme
(older) and lme4 (newer). These packages differ in many minor ways but should generally result in very similar fitted
models.
library(nlme)
library(lme4)
m1.nlme <- lme(Reaction~Days,random=~Days|Subject,data=sleepstudy,method="REML")
m1.lme4 <- lmer(Reaction~Days+(Days|Subject),data=sleepstudy,REML=TRUE)
all.equal(fixef(m1.nlme),fixef(m1.lme4))
## [1] TRUE
Differences to consider:
The unofficial GLMM FAQ provides more information, although it is focused on generalized linear mixed models
(GLMMs).
R comes with built-in functionals, of which perhaps the most well-known are the apply family of functions. Here is a
description of some of the most common apply functions:
lapply() = takes a list as an argument and applies the specified function to the list.
sapply() = the same as lapply() but attempts to simplify the output to a vector or a matrix.
vapply() = a variant of sapply() in which the output object's type must be specified.
mapply() = like lapply() but can pass multiple vectors as input to the specified function. Can be simplified
like sapply().
Map() is an alias to mapply() with SIMPLIFY = FALSE.
lapply()
lapply(variable, FUN)
lapply(seq_along(variable), FUN)
sapply()
mapply()
mapply() works much like lapply() except it can take multiple vectors as input (hence the m for multivariate).
At this point, we can take two approaches to inserting the names into the data.frame.
If you're a fan of magrittr style pipes, you can accomplish the entire task in a single chain (though it may not be
prudent to do so if you need any of the intermediary objects, such as the model objects themselves):
library(magrittr)
library(broom)
Combined <- lapply(1:4,
function(i) mtcars[sample(1:nrow(mtcars),
size = nrow(mtcars),
replace = TRUE), ]) %>%
lapply(function(BD) lm( mpg ~ qsec + wt + factor(am), data = BD)) %>%
lapply(tidy) %>%
setNames(paste0("Boot", seq_along(.))) %>%
mapply(function(nm, dframe) cbind(nm, dframe),
nm = names(.),
dframe = .,
SIMPLIFY = FALSE) %>%
do.call("rbind", .)
firstly a vector of the file names to be accessed must be created, there are multiple options for this:
Using list.files() with a regex search term for the file type, requires knowledge of regular expressions
(regex) if other files of same type are in the directory.
readRDS is specific to .rds files and will change depending on the application of the process.
This is not necessarily faster than a for loop from testing but allows all files to be an element of a list without
assigning them explicitly.
Finally, we often need to load multiple packages at once. This trick can do it quite easily by applying library() to all
libraries that we wish to import:
lapply(c("jsonlite","stringr","igraph"),library,character.only=TRUE)
Users can create their own functionals to varying degrees of complexity. The following examples are from
Functionals by Hadley Wickham:
In the first case, randomise accepts a single argument f, and calls it on a sample of Uniform random variables. To
demonstrate equivalence, we call set.seed below:
set.seed(123)
randomise(mean)
set.seed(123)
mean(runif(1e3))
#[1] 0.4972778
set.seed(123)
randomise(max)
#[1] 0.9994045
set.seed(123)
max(runif(1e3))
#[1] 0.9994045
The second example is a re-implementation of base::lapply, which uses functionals to apply an operation (f) to
each element in a list (x). The ... parameter allows the user to pass additional arguments to f, such as the na.rm
option in the mean function:
require(RWeka)
require(tau)
require(tm)
require(tm.plugin.webmining)
require(wordcloud)
inspect(corpus)
wordlist <- c("lfvn", "lifevantage", "protandim", "truescience", "company", "fiscal", "nasdaq")
We can make a major leap to n-gram word clouds and in doing so we’ll see how to make almost any text-mining
analysis flexible enough to handle n-grams by transforming our TDM.
The initial difficulty you run into with n-grams in R is that tm, the most popular package for text mining, does not
inherently support tokenization of bi-grams or n-grams. Tokenization is the process of representing a word, part of
a word, or group of words (or symbols) as a single data element called a token.
Fortunately, we have some hacks which allow us to continue using tm with an upgraded tokenizer. There’s more
than one way to achieve this. We can write our own simple tokenizer using the textcnt() function from tau:
# BigramTokenize
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
From this point you can proceed much as in the 1-gram case:
The example above is reproduced with permission from Hack-R's data science blog. Additional commentary may be
found in the original article.
In order to complete the analysis data must be in long format (see reshaping data topic). aov() is a wrapper around
the lm() function, using Wilkinson-Rogers formula notation y~f where y is the response (independent) variable and
f is a factor (categorical) variable representing group membership. If f is numeric rather than a factor variable, aov()
will report the results of a linear regression in ANOVA format, which may surprise inexperienced users.
The aov() function uses Type I (sequential) Sum of Squares. This type of Sum of Squares tests all of the (main and
interaction) effects sequentially. The result is that the first effect tested is also assigned shared variance between it
and other effects in the model. For the results from such a model to be reliable, data should be balanced (all groups
are of the same size).
When the assumptions for Type I Sum of Squares do not hold, Type II or Type III Sum of Squares may be applicable.
Type II Sum of Squares test each main effect after every other main effect, and thus controls for any overlapping
variance. However, Type II Sum of Squares assumes no interaction between the main effects.
Lastly, Type III Sum of Squares tests each main effect after every other main effect and every interaction. This
makes Type III Sum of Squares a necessity when an interaction is present.
Type II and Type III Sums of Squares are implemented in the Anova() function.
summary(mtCarsAnovaModel)
One can also extract the coefficients of the underlying lm() model:
coefficients(mtCarsAnovaModel)
Using the mtcars data sets as an example, demonstrating the difference between Type II and Type III when an
interaction is tested.
Response: wt
Sum Sq Df F value Pr(>F)
(Intercept) 25.8427 1 82.4254 1.524e-09 ***
factor(cyl) 4.0124 2 6.3988 0.005498 **
factor(am) 1.7389 1 5.5463 0.026346 *
factor(cyl):factor(am) 0.0668 2 0.1065 0.899371
Residuals 8.1517 26
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
library(glcm)
library(raster)
plot(rglcm)
The textural features can also be calculated in all 4 directions (0°, 45°, 90° and 135°) and then combined to one
rotation-invariant texture. The key for this is the shift parameter:
plot(rglcm1)
library(raster)
library(mmand)
Afterwards, the raster layer has to be converted into an array wich is used as input for the erode() function.
Besides erode(), also the morphological functions dilate(), opening() and closing() can be applied like this.
plot(rErode)
In the example below a survival model is fit and used for prediction, scoring, and performance analysis using the
package randomForestSRC from CRAN.
require(randomForestSRC)
x1 y
1 0.9604353 1.3549648
2 0.3771234 0.2961592
3 0.7844242 0.6942191
4 0.9860443 1.5348900
5 0.1942237 0.4629535
6 0.7442532 -0.0672639
In the example below we plot 2 predicted curves and vary sex between the 2 sets of new data, to visualize its effect:
require(survival)
s <- with(lung,Surv(time,status))
install.packages('survminer')
source("https://bioconductor.org/biocLite.R")
biocLite("RTCGA.clinical") # data for examples
library(RTCGA.clinical)
survivalTCGA(BRCA.clinical, OV.clinical,
extract.cols = "admin.disease_code") -> BRCAOV.survInfo
library(survival)
fit <- survfit(Surv(times, patient.vital_status) ~ admin.disease_code,
data = BRCAOV.survInfo)
library(survminer)
ggsurvplot(fit, risk.table = TRUE)
ggsurvplot(
fit, # survfit object with calculated statistics.
risk.table = TRUE, # show risk table.
pval = TRUE, # show p-value of log-rank test.
conf.int = TRUE, # show confidence intervals for
# point estimaes of survival curves.
xlim = c(0,2000), # present narrower X axis, but not affect
# survival estimates.
break.time.by = 500, # break X axis in time intervals by 500.
ggtheme = theme_RTCGA(), # customize plot and risk table with a theme.
risk.table.y.text.col = T, # colour risk table text annotations.
risk.table.y.text = FALSE # show bars instead of names in text annotations
# in legend of risk table
)
http://r-addict.com/2016/05/23/Informative-Survival-Plots.html
########################################################
# Try part: define the expression(s) you want to "try" #
########################################################
{
# Just to highlight:
# If you want to use more than one R expression in the "try part"
# then you'll have to use curly brackets.
# Otherwise, just write the single expression you want to try and
########################################################################
# Condition handler part: define how you want conditions to be handled #
########################################################################
###############################################
# Final part: define what should happen AFTER #
# everything has been tried and/or handled #
###############################################
finally = {
message(paste("Processed URL:", url))
message("Some message at the end\n")
}
)
return(out)
}
Let's define a vector of URLs where one element isn't a valid URL
urls <- c(
"http://stat.ethz.ch/R-manual/R-devel/library/base/html/connections.html",
"http://en.wikipedia.org/wiki/Xz",
"I'm no URL"
)
head(y[[1]])
# [1] "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\">"
# [2] "<html><head><title>R: Functions to Manipulate Connections</title>"
# [3] "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">"
# [4] "<link rel=\"stylesheet\" type=\"text/css\" href=\"R.css\">"
# [5] "</head><body>"
y[[3]]
# [1] NA
The easiest way to share a (preferable small) data frame is to use a basic function dput(). It will export an R object
in a plain text form.
Note: Before making the example data below, make sure you're in an empty folder you can write to. Run getwd() and
read ?setwd if you need to change folders.
Then, anyone can load the precise R object to their GlobalEnvironment using the dget() function.
df <- dget('df.txt')
For larger R objects, there are a number of ways of saving them reproducibly. See Input and output .
Starting from 2014-09-17, the authors of the package make daily copies of the whole CRAN package repository to
their own mirror repository -- Microsoft R Archived Network. So, to avoid package reproduciblity issues when
creating a reproducible R project, all you need is to:
1. Make sure that all your packages (and R version) are up-to-date.
2. Include checkpoint::checkpoint('YYYY-MM-DD') line in your code.
checkpoint will create a directory .checkpoint in your R_home directory ("~/"). To this technical directory it will
install all the packages, that are used in your project. That means, checkpoint looks through all the .R files in your
project directory to pick up all the library() or require() calls and install all the required packages in the form
they existed at CRAN on the specified date.
The Fourier transform is called the frequency domain representation of the original signal. The term Fourier
transform refers to both the frequency domain representation and the mathematical operation that associates the
frequency domain representation to a function of time. The Fourier transform is not limited to functions of time,
but in order to have a unified language, the domain of the original function is commonly referred to as the time
domain. For many functions of practical interest one can define an operation that reverses this: the inverse Fourier
transformation, also called Fourier synthesis, of a frequency domain representation combines the contributions of
all the different frequencies to recover the original function of time.
Linear operations performed in one domain (time or frequency) have corresponding operations in the other
domain, which are sometimes easier to perform. The operation of differentiation in the time domain corresponds
to multiplication by the frequency, so some differential equations are easier to analyze in the frequency domain.
Also, convolution in the time domain corresponds to ordinary multiplication in the frequency domain. Concretely,
this means that any linear time-invariant system, such as an electronic filter applied to a signal, can be expressed
relatively simply as an operation on frequencies. So significant simplification is often achieved by transforming time
functions to the frequency domain, performing the desired operations, and transforming the result back to time.
Harmonic analysis is the systematic study of the relationship between the frequency and time domains, including
the kinds of functions or operations that are "simpler" in one or the other, and has deep connections to almost all
areas of modern mathematics.
Functions that are localized in the time domain have Fourier transforms that are spread out across the frequency
domain and vice versa. The critical case is the Gaussian function, of substantial importance in probability theory
and statistics as well as in the study of physical phenomena exhibiting normal distribution (e.g., diffusion), which
with appropriate normalizations goes to itself under the Fourier transform. Joseph Fourier introduced the
transform in his study of heat transfer, where Gaussian functions appear as solutions of the heat equation.
The Fourier transform can be formally defined as an improper Riemann integral, making it an integral transform,
although this definition is not suitable for many applications requiring a more sophisticated integration theory.
For example, many relatively simple applications use the Dirac delta function, which can be treated formally as if it
were a function, but the justification requires a mathematically more sophisticated viewpoint. The Fourier
transform can also be generalized to functions of several variables on Euclidean space, sending a function of 3-
dimensional space to a function of 3-dimensional momentum (or a function of space and time to a function of 4-
momentum).
This idea makes the spatial Fourier transform very natural in the study of waves, as well as in quantum mechanics,
where it is important to be able to represent wave solutions either as functions either of space or momentum and
sometimes both. In general, functions to which Fourier methods are applicable are complex-valued, and possibly
vector-valued. Still further generalization is possible to functions on groups, which, besides the original Fourier
transform on ℝ or ℝn (viewed as groups under addition), notably includes the discrete-time Fourier transform
(DTFT, group = ℤ), the discrete Fourier transform (DFT, group = ℤ mod N) and the Fourier series or circular Fourier
transform (group = S1, the unit circle ≈ closed finite interval with endpoints identified). The latter is routinely
# Sine waves
xs <- seq(-2*pi,2*pi,pi/100)
wave.1 <- sin(3*xs)
wave.2 <- sin(10*xs)
par(mfrow = c(1, 2))
plot(xs,wave.1,type="l",ylim=c(-1,1)); abline(h=0,lty=3)
plot(xs,wave.2,type="l",ylim=c(-1,1)); abline(h=0,lty=3)
# Complex Wave
wave.3 <- 0.5 * wave.1 + 0.25 * wave.2
plot(xs,wave.3,type="l"); title("Eg complex wave"); abline(h=0,lty=3)
Some concepts:
The fundamental period, T, is the period of all the samples taken, the time between the first sample and the
last
The sampling rate, sr, is the number of samples taken over a time period (aka acquisition frequency). For
simplicity we will make the time interval between samples equal. This time interval is called the sample
interval, si, which is the fundamental period time divided by the number of samples N. So, si=TN
The fundamental frequency, f0, which is 1T. The fundamental frequency is the frequency of the repeating
pattern or how long the wavelength is. In the previous waves, the fundamental frequency was 12π. The
title("Repeating pattern")
points(repeat.xs,wave.3.repeat,type="l",col="red");
abline(h=0,v=c(-2*pi,0),lty=3)
Important note: if you use RStudio, you can have a separate .Rprofile in every RStudio project directory.
Here are some examples of code that you might include in an .Rprofile file.
This will allow you to not have to install all the packages again with each R version update.
# library location
.libPaths("c:/R_home/Rpackages/win")
Sometimes it is useful to have a shortcut for a long R expression. A common example of this setting an active
binding to access the last top-level expression result without having to type out .Last.value:
This is bad practice and should generally be avoided because it separates package loading code from the scripts
where those packages are actually used.
See Also
Options
# Select default CRAN mirror for package installation.
options(repos=c(CRAN="https://cran.gis-lab.info/"))
# No scientific notation.
options(scipen=10)
# No graphics in menus.
options(menu.graphics=FALSE)
Custom Functions
# Invisible environment to mask defined functions
.env = new.env()
dplyr's philosophy is to have small functions that do one thing well. The five simple functions (filter, arrange,
SELECT, mutate, and summarise) can be used to reveal new ways to describe data. When combined with group_by,
these functions can be used to calculate group wise summary statistics.
Syntax commonalities
We will use the built-in mtcars dataset to explore dplyr's single table verbs. Before converting the type of mtcars to
tbl_df (since it makes printing cleaner), we add the rownames of the dataset as a column using rownames_to_column
function from the tibble package.
# A tibble: 6 x 12
# cars mpg cyl disp hp drat wt qsec vs am gear carb
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#5 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#6 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
filter
filter helps subset rows that match certain criteria. The first argument is the name of the data.frame and the
second (and subsequent) arguments are the criteria that filter the data (these criteria should evaluate to either TRUE
or FALSE)
filter(mtcars_tbl, cyl == 4)
# A tibble: 11 x 12
# cars mpg cyl disp hp drat wt qsec vs am gear carb
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#2 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
We can pass multiple criteria separated by a comma. To subset the cars which have either 4 or 6 cylinders - cyl and
have 5 gears - gear:
# A tibble: 3 x 12
# cars mpg cyl disp hp drat wt qsec vs am gear carb
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
#2 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
#3 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
filter selects rows based on criteria, to select rows by position, use slice. slice takes only 2 arguments: the first
one is a data.frame and the second is integer row values.
slice(mtcars_tbl, 6:9)
# A tibble: 4 x 12
# cars mpg cyl disp hp drat wt qsec vs am gear carb
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Valiant 18.1 6 225.0 105 2.76 3.46 20.22 1 0 3 1
#2 Duster 360 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
#3 Merc 240D 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2
#4 Merc 230 22.8 4 140.8 95 3.92 3.15 22.90 1 0 4 2
Or:
arrange
arrange is used to sort the data by a specified variable(s). Just like the previous verb (and all other functions in
dplyr), the first argument is a data.frame, and consequent arguments are used to sort the data. If more than one
variable is passed, the data is first sorted by the first variable, and then by the second variable, and so on..
arrange(mtcars_tbl, hp)
# A tibble: 32 x 12
# cars mpg cyl disp hp drat wt qsec vs am gear carb
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#2 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#3 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#4 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#5 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
To arrange the data by miles per gallon - mpg in descending order, followed by number of cylinders - cyl:
# A tibble: 32 x 12
# cars mpg cyl disp hp drat wt qsec vs am gear carb
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#2 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#3 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#4 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#5 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
#6 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# ... with 26 more rows
select
SELECT is used to select only a subset of variables. To select only mpg, disp, wt, qsec, and vs from mtcars_tbl:
# A tibble: 32 x 5
# mpg disp wt qsec vs
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 21.0 160.0 2.620 16.46 0
#2 21.0 160.0 2.875 17.02 0
#3 22.8 108.0 2.320 18.61 1
#4 21.4 258.0 3.215 19.44 1
#5 18.7 360.0 3.440 17.02 0
#6 18.1 225.0 3.460 20.22 1
# ... WITH 26 more ROWS
: notation can be used to select consecutive columns. To select columns from cars through disp and vs through
carb:
# A tibble: 32 x 8
# cars mpg cyl disp vs am gear carb
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Mazda RX4 21.0 6 160.0 0 1 4 4
#2 Mazda RX4 Wag 21.0 6 160.0 0 1 4 4
#3 Datsun 710 22.8 4 108.0 1 1 4 1
#4 Hornet 4 Drive 21.4 6 258.0 1 0 3 1
#5 Hornet Sportabout 18.7 8 360.0 0 0 3 2
#6 Valiant 18.1 6 225.0 1 0 3 1
# ... WITH 26 more ROWS
or SELECT(mtcars_tbl, -(hp:qsec))
For datasets that contain several columns, it can be tedious to select several columns by name. To make life easier,
there are a number of helper functions (such as starts_with(), ends_with(), contains(), matches(),
num_range(), one_of(), and everything()) that can be used in SELECT. To learn more about how to use them, see
?select_helpers and ?select.
Note: While referring to columns directly in SELECT(), we use bare column names, but quotes should be used while
# A tibble: 32 x 2
# cylinders displacement
# <dbl> <dbl>
#1 6 160.0
#2 6 160.0
#3 4 108.0
#4 6 258.0
#5 8 360.0
#6 6 225.0
# ... WITH 26 more ROWS
# A tibble: 32 x 12
# cars mpg cylinders displacement hp drat wt qsec vs
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0
#2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0
#3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1
#4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1
#5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0
#6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1
# ... with 26 more rows, and 3 more variables: am <dbl>, gear <dbl>, carb <dbl>
mutate
mutate can be used to add new columns to the data. Like all other functions in dplyr, mutate doesn't add the newly
created columns to the original data. Columns are added at the end of the data.frame.
# A tibble: 32 x 14
# cars mpg cyl disp hp drat wt qsec vs am gear carb weight_ton
weight_pounds
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
<dbl>
#1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1.3100
2620
#2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 1.4375
2875
#3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 1.1600
2320
#4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 1.6075
3215
#5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 1.7200
3440
#6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 1.7300
3460
# ... with 26 more rows
To retain only the newly created columns, use transmute instead of mutate:
# A tibble: 32 x 2
# weight_ton weight_pounds
# <dbl> <dbl>
#1 1.3100 2620
#2 1.4375 2875
#3 1.1600 2320
#4 1.6075 3215
#5 1.7200 3440
#6 1.7300 3460
# ... with 26 more rows
summarise
summarise calculates summary statistics of variables by collapsing multiple values to a single value. It can calculate
multiple statistics and we can name these summary columns in the same statement.
To calculate the mean and standard deviation of mpg and disp of all cars in the dataset:
# A tibble: 1 x 4
# mean_mpg sd_mpg mean_disp sd_disp
# <dbl> <dbl> <dbl> <dbl>
#1 20.09062 6.026948 230.7219 123.9387
group_by
group_by can be used to perform group wise operations on data. When the verbs defined above are applied on this
grouped data, they are automatically applied to each group separately.
# A tibble: 3 x 3
# cyl mean_mpg sd_mpg
# <dbl> <dbl> <dbl>
#1 4 26.66364 4.509828
#2 6 19.74286 1.453567
#3 8 15.10000 2.560048
We select columns from cars through hp and gear, order the rows by cyl and from highest to lowest mpg, group the
data by gear, and finally subset only those cars have mpg > 20 and hp > 75
Maybe we are not interested the intermediate results, we can achieve the same result as above by wrapping the
function calls:
filter(
group_by(
arrange(
select(
mtcars_tbl, cars:hp
), cyl, desc(mpg)
), cyl
),mpg > 20, hp > 75
)
This can be a little difficult to read. So, dplyr operations can be chained using the pipe %>% operator. The above
code transalates to:
mtcars_tbl %>%
select(cars:hp) %>%
arrange(cyl, desc(mpg)) %>%
group_by(cyl) %>%
filter(mpg > 20, hp > 75)
mtcars_tbl %>%
summarise_all(n_distinct)
# A tibble: 1 x 12
# cars mpg cyl disp hp drat wt qsec vs am gear carb
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#1 32 25 3 27 22 22 29 30 2 2 3 6
mtcars_tbl %>%
group_by(cyl) %>%
summarise_all(n_distinct)
# A tibble: 3 x 12
# cyl cars mpg disp hp drat wt qsec vs am gear carb
# <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#1 4 11 9 11 10 10 11 11 2 2 3 2
#2 6 7 6 5 4 5 6 7 2 2 3 3
Note that we just had to add the group_by statement and the rest of the code is the same. The output now consists
of three rows - one for each unique value of cyl.
mtcars_tbl %>%
group_by(cyl) %>%
summarise_at(c("mpg", "disp", "hp"), mean)
# A tibble: 3 x 4
# cyl mpg disp hp
# <dbl> <dbl> <dbl> <dbl>
#1 4 26.66364 105.1364 82.63636
#2 6 19.74286 183.3143 122.28571
#3 8 15.10000 353.1000 209.21429
helper functions (?select_helpers) can be used in place of column names to select specific columns
To apply multiple functions, either pass the function names as a character vector:
mtcars_tbl %>%
group_by(cyl) %>%
summarise_at(c("mpg", "disp", "hp"),
c("mean", "sd"))
mtcars_tbl %>%
group_by(cyl) %>%
summarise_at(c("mpg", "disp", "hp"),
funs(mean, sd))
# A tibble: 3 x 7
# cyl mpg_mean disp_mean hp_mean mpg_sd disp_sd hp_sd
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 26.66364 105.1364 82.63636 4.509828 26.87159 20.93453
#2 6 19.74286 183.3143 122.28571 1.453567 41.56246 24.26049
#3 8 15.10000 353.1000 209.21429 2.560048 67.77132 50.97689
Column names are now be appended with function names to keep them distinct. In order to change this, pass the
name to be appended with the function:
mtcars_tbl %>%
group_by(cyl) %>%
summarise_at(c("mpg", "disp", "hp"),
c(Mean = "mean", SD = "sd"))
mtcars_tbl %>%
group_by(cyl) %>%
summarise_at(c("mpg", "disp", "hp"),
funs(Mean = mean, SD = sd))
# A tibble: 3 x 7
# cyl mpg_Mean disp_Mean hp_Mean mpg_SD disp_SD hp_SD
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 26.66364 105.1364 82.63636 4.509828 26.87159 20.93453
Take the mean of all columns that are numeric grouped by cyl:
mtcars_tbl %>%
group_by(cyl) %>%
summarise_if(is.numeric, mean)
# A tibble: 3 x 11
# cyl mpg disp hp drat wt qsec
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 26.66364 105.1364 82.63636 4.070909 2.285727 19.13727
#2 6 19.74286 183.3143 122.28571 3.585714 3.117143 17.97714
#3 8 15.10000 353.1000 209.21429 3.229286 3.999214 16.77214
# ... with 4 more variables: vs <dbl>, am <dbl>, gear <dbl>,
# carb <dbl>
However, some variables are discrete, and mean of these variables doesn't make sense.
mtcars_tbl %>%
group_by(cyl) %>%
summarise_if(function(x) is.numeric(x) & n_distinct(x) > 6, mean)
# A tibble: 3 x 7
# cyl mpg disp hp drat wt qsec
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 26.66364 105.1364 82.63636 4.070909 2.285727 19.13727
#2 6 19.74286 183.3143 122.28571 3.585714 3.117143 17.97714
#3 8 15.10000 353.1000 209.21429 3.229286 3.999214 16.77214
library(dplyr)
library(magrittr)
df <- mtcars
df$cars <- rownames(df) #just add the cars names to the df
df <- df[,c(ncol(df),1:(ncol(df)-1))] # and place the names in the first column
To compute statistics we use summarize and the appropriate functions. In this case n() is used for counting the
number of cases.
df %>%
summarize(count=n(),mean_mpg = mean(mpg, na.rm = TRUE),
min_weight = min(wt),max_weight = max(wt))
It is possible to compute the statistics by groups of the data. In this case by Number of cylinders and Number of
forward gears
df %>%
group_by(cyl, gear) %>%
summarize(count=n(),mean_mpg = mean(mpg, na.rm = TRUE),
min_weight = min(wt),max_weight = max(wt))
But if we want to use other features such as summarize or filter we need to use interp function from lazyeval
package
caret helps in these scenarios, independent of the actual learning algorithms used.
The heart of the preProcess() function is the method argument. Method operations are applied in this order:
1. Zero-variance filter
2. Near-zero variance filter
3. Box-Cox/Yeo-Johnson/exponential transformation
4. Centering
5. Scaling
6. Range
7. Imputation
8. PCA
9. ICA
10. Spatial Sign
Below, we take the mtcars data set and perform centering, scaling, and a spatial sign transform.
This will extract all files in "bar.zip" to the "foo" directory, which will be created if necessary. Tilde expansion is
done automatically from your working directory. Alternatively, you can pass the whole path name to the zipfile.
Suppose that a fair die is rolled 10 times. What is the probability of throwing exactly two sixes?
The number of sandwhich ordered in a restaurant on a given day is known to follow a Poisson distribution with a
mean of 20. What is the probability that exactly eighteen sandwhich will be ordered tomorrow?
To find the value of the pdf at x=2.5 for a normal distribution with a mean of 5 and a standard deviation of 2, use
the command:
# r-noweb-file.Rnw
\documentclass{article}
<<echo=FALSE,cache=FALSE>>=
knitr::opts_chunk$set(echo=FALSE, cache=TRUE)
knitr::read_chunk('r-file.R')
@
\begin{document}
This is an Rnw file (R noweb). It contains a combination of LateX and R.
One we have called the read\_chunk command above we can reference sections of code in the r-file.R
script.
<<Chunk1>>=
@
\end{document}
When using this approach we keep our code in a separate R file as shown below.
## r-file.R
## note the specific comment style of a single pound sign followed by four dashes
x <- seq(1:10)
y <- rev(seq(1:10))
plot(x,y)
# r-noweb-file.Rnw
\documentclass{article}
<<my-label>>=
print("This is an R Code Chunk")
x <- seq(1:10)
@
\end{document}
# r-noweb-file.Rnw
\documentclass{article}
\begin{document}
This is an Rnw file (R noweb). It contains a combination of LateX and R.
<<code-chunk-label>>=
print("This is an R Code Chunk")
x <- seq(1:10)
y <- seq(1:10)
plot(x,y) # Brownian motion
@
\end{document}
R> library(RCurl)
R> library(XML)
R> url <- "http://www.imdb.com/chart/top"
R> top <- getURL(url)
R> parsed_top <- htmlParse(top, encoding = "UTF-8")
R> top_table <- readHTMLTable(parsed_top)[[1]]
R> head(top_table[1:10, 1:3])
---
title: "Including Bibliography"
author: "John Doe"
output: pdf_document
bibliography: references.bib
---
# Abstract
@R_Core_Team_2016
# References
---
title: "Including LaTeX Preample Commands in RMarkdown"
header-includes:
- \renewcommand{\familydefault}{cmss}
- \usepackage[cm, slantedGreek]{sfmath}
- \usepackage[T1]{fontenc}
output: pdf_document
---
# Section 1
As you can see, this text uses the Computer Moden Font!
---
title: "Including LaTeX Preample Commands in RMarkdown"
output:
pdf_document:
includes:
in_header: includes.tex
---
# Section 1
As you can see, this text uses the Computer Modern Font!
Here, the content of includes.tex are the same three commands we included with header-includes.
A possible third option is to write your own LaTex template and include it with template. But this covers a lot more
of the structure than only the preamble.
---
title: "My Template"
author: "Martin Schmelzer"
output:
pdf_document:
template: myTemplate.tex
---
knitr
xtable
pander
---
title: "Printing Tables"
author: "Martin Schmelzer"
date: "29 Juli 2016"
output: pdf_document
---
How can I stop xtable printing the comment ahead of each table?
options(xtable.comment = FALSE)
R-markdown is a markdown file with embedded blocks of R code called chunks. There are two types of R code
chunks: inline and block.
`r 2*2`
2*2
````
And they come with several possible options. Here are the main ones (but there are many others):
echo (boolean) controls wether the code inside chunk will be included in the document
include (boolean) controls wether the output should be included in the document
fig.width (numeric) sets the width of the output figures
fig.height (numeric) sets the height of the output figures
fig.cap (character) sets the figure captions
They are written in a simple tag=value format like in the example above.
Below is a basic example of R-markdown file illustrating the way R code chunks are embedded inside r-markdown.
# Title #
```
gain
```
```
The R knitr package can be used to evaluate R chunks inside R-markdown file and turn it into a regular markdown
file.
The following steps are needed in order to turn R-markdown file into pdf/html:
In addition to the above knitr package has wrapper functions knit2html() and knit2pdf() that can be used to
produce the final document without the intermediate step of manually converting it to the markdown format:
If the above example file was saved as income.Rmd it can be converted to a pdf file using the following R commands:
library(knitr)
knit2pdf("income.Rmd", "income.pdf")
x <- as.matrix(mtcars)
One can use heatmap.2 - a more recent optimized version of heatmap, by loading the following library:
require(gplots)
heatmap.2(x)
To add a title, x- or y-label to your heatmap, you need to set the main, xlab and ylab:
heatmap.2(x, main = "My main title: Overview of car features", xlab="Car features", ylab = "Car
brands")
If you wish to define your own color palette for your heatmap, you can set the col parameter by using the
colorRampPalette function:
As you can notice, the labels on the y axis (the car names) don't fit in the figure. In order to fix this, the user can
tune the margins parameter:
Non-Directed Network
Directed Network
Reduce(`*`, 1:10)
Filter given a predicate function and a list of values returns a filtered list containing only values for whom
predicate function is TRUE.
Filter(is.character, list(1,"a",2,"b",3,"c"))
Find given a predicate function and a list of values returns the first value for which the predicate function is TRUE.
Find(is.character, list(1,"a",2,"b",3,"c"))
Position given a predicate function and a list of values returns the position of the first value in the list for which the
predicate function is TRUE.
Position(is.character, list(1,"a",2,"b",3,"c"))
Negate inverts a predicate function making it return FALSE for values where it returned TRUE and vice versa.
One can ask for user input using the readline command:
The user can then give any answer, such as a number, a character, vectors, and scanning the result is here to make
sure that the user has given a proper answer. For example:
However, it is to be noted that this code be stuck in a never-ending loop, as user input is saved as a character.
To start working with Sparks distributed dataframes, you must connect your R program with an existing Spark
Cluster.
library(SparkR)
sc <- sparkR.init() # connection to Spark context
sqlContext <- sparkRSQL.init(sc) # connection to SQL context
There is an Apache Spark introduction topic with install instructions. Basically, you can employ a Spark Cluster
locally via java (see instructions) or use (non-free) cloud applications (e.g. Microsoft Azure [topic site], IBM).
Caching can optimize computation in Spark. Caching stores data in memory and is a special case of persistence.
Here is explained what happens when you cache an RDD in Spark.
Why:
Basically, caching saves an interim partial result - usually after transformations - of your original data. So, when you
use the cached RDD, the already transformed data from memory is accessed without recomputing the earlier
transformations.
How:
Here is an example how to quickly access large data (here 3 GB big csv) from in-memory storage when accessing it
more then once:
library(SparkR)
# next line is needed for direct csv import:
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.4.0" "sparkr-
shell"')
sc <- sparkR.init()
sqlContext <- sparkRSQL.init(sc)
From csv:
For csv's, you need to add the csv package to the environment before initiating the Spark context:
Then, you can load the csv either by infering the data schema of the data in the columns:
If you want your code to be copy-pastable, remove prompts such as R>, >, or + at the beginning of each new line.
Some Docs authors prefer to not make copy-pasting easy, and that is okay.
Console output
Console output should be clearly distinguished from code. Common approaches include:
Assignment
= and <- are fine for assigning R objects. Use white space appropriately to avoid writing code that is difficult to
parse, such as x<-1 (ambiguous between x <- 1 and x < -1)
Code comments
Be sure to explain the purpose and function of the code itself. There isn't any hard-and-fast rule on whether this
explanation should be in prose or in code comments. Prose may be more readable and allows for longer
explanations, but code comments make for easier copy-pasting. Keep both options in mind.
Sections
Many examples are short enough to not need sections, but if you use them, start with H1.
Make it minimal and get to the point. Complications and digressions are counterproductive.
Include both working code and prose explaining it. Neither one is sufficient on its own.
Don't rely on external sources for data. Generate data or use the datasets library if possible:
library(help = "datasets")
Refer to built-in docs like ?data.frame whenever relevant. The SO Docs are not an attempt to replace the
built-in docs. It is important to make sure new R users know that the built-in docs exist as well as how to find
them.
This example illustrates a couple common situations. See the links at the end for other resources.
Writing
Before making the example data below, make sure you're in a folder you want to write to. Run getwd() to verify the folder
you're in and read ?setwd if you need to change folders.
set.seed(1)
for (i in 1:3)
write.table(
data.frame(id = 1:2, v = sample(letters, 2)),
file = sprintf("file201%s.csv", i)
)
Reading
We have three similarly-formatted files (from the last section) to read in. Since these files are related, we should
store them together after reading in, in a list:
# $file2011.csv
# id v
# 1 1 g
# 2 2 j
#
# $file2012.csv
# id v
# 1 1 o
# 2 2 w
#
# $file2013.csv
# id v
# 1 1 f
# 2 2 w
To work with this list of files, first examine the structure with str(file_contents), then read about stacking the list
with ?rbind or iterating over the list with ?lapply.
Further resources
import() can also read from compressed directories, URLs (HTTP or HTTPS), and the clipboard. A comprehensive
list of all supported file formats is available on the rio package github repository.
It is even possible to specify some further parameters related to the specific file format you are trying to read,
passing them directly within the import() function:
import("example.csv", format = ",") #for csv file where comma is used as separator
import("example.csv", format = ";") #for csv file where semicolon is used as separator
Section 92.2: Read and write Stata, SPSS and SAS files
The packages foreign and haven can be used to import and export files from a variety of other statistical packages
like Stata, SPSS and SAS and related software. There is a read function for each of the supported data types to
import the files.
The foreign package can read in stata (.dta) files for versions of Stata 7-12. According to the development page, the
read.dta is more or less frozen and will not be updated for reading in versions 13+. For more recent versions of
Stata, you can use either the readstata13 package or haven. For readstata13, the files are
The SAScii package provides functions that will accept SAS SET import code and construct a text file that can be
processed with read.fwf. It has proved very robust for import of large public-released datasets. Support is at
https://github.com/ajdamico/SAScii
To export data frames to other statistical packages you can use the write functions write.foreign(). This will write
2 files, one containing the data and one containing instructions the other package needs to read the data.
File stored by the SPSS can also be read with read.spss in this way:
R package Uses
xlsx Java
XLconnect Java
openxlsx C++
readxl C++
RODBC ODBC
gdata Perl
For the packages that use Java or ODBC it is important to know details about your system because you may have
compatibility issues depending on your R version and OS. For instance, if you are using R 64 bits then you also must
have Java 64 bits to use xlsx or XLconnect.
Some examples of reading excel files with each package are provided below. Note that many of the packages have
the same or very similar function names. Therefore, it is useful to state the package explicitly, like
package::function. The package openxlsx requires prior installation of RTools.
xlsx::read.xlsx("Book1.xlsx", sheetIndex=1)
xlsx::read.xlsx("Book1.xlsx", sheetName="Sheet1")
XLConnect automatically imports the pre-defined Excel cell-styles embedded in Book1.xlsx. This is useful when you
wish to format your workbook object and export a perfectly formatted Excel document. Firstly, you will need to
create the desired cell formats in Book1.xlsx and save them, for example, as myHeader, myBody and myPcts. Then,
after loading the workbook in R (see above):
The cell styles are now saved in your R environment. In order to assign the cell styles to certain ranges of your data,
you need to define the range and then assign the style:
Note that XLConnect is easy, but can become extremely slow in formatting. A much faster, but more cumbersome
formatting option is offered by openxlsx.
library(openxlsx)
#colNames: If TRUE, the first row of data will be used as column names.
#rowNames: If TRUE, first column of data will be used as row names.
The sheet, which should be read into R can be selected either by providing its position in the sheet argument:
openxlsx::read.xlsx("spreadsheet1.xlsx", sheet = 1)
Additionally, openxlsx can detect date columns in a read sheet. In order to allow automatic detection of dates, an
argument detectDates should be set to TRUE:
Excel files can be imported as a data frame into R using the readxl package.
library(readxl)
readxl::read_excel("spreadsheet1.xls")
readxl::read_excel("spreadsheet2.xlsx")
readxl::read_excel("spreadsheet.xls", sheet = 1)
readxl::read_excel("spreadsheet.xls", sheet = "summary")
The argument col_names = TRUE sets the first row as the column names.
The argument col_types can be used to specify the column types in the data as a vector.
library(RODBC)
Connecting with an SQL engine in this approach, Excel worksheets can be queried similar to database tables
including JOIN and UNION operations. Syntax follows the JET/ACE SQL dialect. NOTE: Only data access DML
statements, specifically SELECT can be run on workbooks, considered not updateable queries.
Even other workbooks can be queried from the same ODBC channel pointing to a current workbook:
example here
library(feather)
write_feather(df, path)
head(df2)
## A tibble: 6 x 11
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
head(df)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Note to users: Feather should be treated as alpha software. In particular, the file format is likely to evolve
over the coming year. Do not use Feather for long-term data storage.
Using the package RMySQL we can easily query MySQL as well as MariaDB databases and store the result in an R
dataframe:
library(RMySQL)
Using limits
It is also possible to define a limit, e.g. getting only the first 100,000 rows. In order to do so, just change the SQL
query regarding the desired limit. The mentioned package will consider these options. Example:
The code connects to the server HOSTNAME as USERNAME with PASSWORD, tries to open the database TweetCollector
and read the collection Tweets. The query tries to read the field i.e. column Text.
The results is a dataframe with columns as the yielded data set. In case of this example, the dataframe contains the
column Text, e.g. documents$Text.
library(rgdal)
readOGR(dsn = "path\to\the\folder\containing\the\shapefile", layer = "map")
To export a shapefile use thewriteOGR function. The first argument is the spatial object produced in R. dsn and
layer are the same as above. The obligatory 4. argument is the driver used to generate the shapefile. The function
ogrDrivers() lists all available drivers. If you want to export a shapfile to ArcGis or QGis you could use driver =
"ESRI Shapefile".
tmap package has a very convenient function read_shape(), which is a wrapper for rgdal::reagOGR(). The
read_shape() function simplifies the process of importing a shapefile a lot. On the downside, tmap is quite heavy.
library(raster)
r <- stack("C:/Program Files/R/R-3.2.3/doc/html/logo.jpg")
plot(r)
plot(r[[1]])
saveRDS/readRDS only handle a single R object. However, they are more flexible than the multi-object storage
approach in that the object name of the restored object need not be the same as the object name when the object
was stored.
Using an .rds file, for example, saving the iris dataset we would use:
To load:
load("myIrisAndCarsData.Rdata")
Subsetting
[1] 1 3 5 7 9
Here the logical expression was expanded to the length of the vector.
# the string
str <- "1+1"
eval(str)
[1] "1+1"
is.expression(parsed.str)
[1] TRUE
eval(parsed.str)
[1] 2
grep('5', mystring)
# [1] 1
grep('@', mystring)
# [1] 5
grep('number', mystring)
# [1] 1 2 3
grep('5|8', mystring)
# [1] 1 2
grep('com|org', mystring)
# [1] 5 6
To match a literal character, you have to escape the string with a backslash (\). However, R tries to look for escape
characters when creating strings, so you actually need to escape the backslash itself (i.e. you need to double escape
regular expression characters.)
grep('\.org', tricky)
# Error: '\.' is an unrecognized escape in character string starting "'\."
grep('\\.org', tricky)
# [1] 1
If you want to match one of several characters, you can wrap those characters in brackets ([])
grep('[13]', mystring)
It may be useful to indicate character sequences. E.g. [0-4] will match 0, 1, 2, 3, or 4, [A-Z] will match any
uppercase letter, [A-z] will match any uppercase or lowercase letter, and [A-z0-9] will match any letter or number
(i.e. all alphanumeric characters)
grep('[0-4]', mystring)
# [1] 3 4
grep('[A-Z]', mystring)
# [1] 1 2 4 5 6
R also has several shortcut classes that can be used in brackets. For instance, [:lower:] is short for a-z, [:upper:]
is short for A-Z, [:alpha:] is A-z, [:digit:] is 0-9, and [:alnum:] is A-z0-9. Note that these whole expressions
must be used inside brackets; for instance, to match a single digit, you can use [[:digit:]] (note the double
brackets). As another example, [@[:digit:]/] will match the characters @, / or 0-9.
grep('[[:digit:]]', mystring)
# [1] 1 2 3 4
grep('[@[:digit:]/]', mystring)
# [1] 1 2 3 4 5 7
Brackets can also be used to negate a match with a carat (^). For instance, [^5] will match any character other than
"5".
Look-ahead/look-behind
"(?<=A)B" matches an appearance of the letter B only if it's preceded by A, i.e. "ABACADABRA" would be
matched, but "abacadabra" and "aBacadabra" would not.
\\d{4}(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])
The above expression considers dates from year: 0000-9999, months between: 01-12 and days 01-31.
For example:
Note: It validates the date syntax, but we can have a wrong date with a valid syntax, for example: 20170229 (2017 it
is not a leap year).
If you want to validate a date, it can be done via this user defined function:
Then
Note that the pattern argument (which is optional if it appears first and only needs partial spelling) is the only
argument to require this doubling or pairing. The replacement argument does not require the doubling of
characters needing to be escaped. If you wanted all the linefeeds and 4-space occurrences replaces with tabs it
would be:
regex <-
"(A[LKSZR])|(C[AOT])|(D[EC])|(F[ML])|(G[AU])|(HI)|(I[DLNA])|(K[SY])|(LA)|(M[EHDAINSOT])|(N[EVHJMYCD
])|(MP)|(O[HKR])|(P[WAR])|(RI)|(S[CD])|(T[NX])|(UT)|(V[TIA])|(W[AVIY])"
For example:
> test <- c("AL", "AZ", "AR", "AJ", "AS", "DC", "FM", "GU","PW", "FL", "AJ", "AP")
> grepl(us.states.pattern, test)
[1] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
>
Note:
If you want to verify only the 50 States, then we recommend to use the R-dataset: state.abb from state, for
example:
> data(state)
> test %in% state.abb
[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
We get TRUE only for 50-States abbreviations: AL, AZ, AR, FL.
Validates a phone number in the form of: +1-xxx-xxx-xxxx, including optional leading/trailing blanks at the
beginning/end of each group of numbers, but not in the middle, for example: +1-xxx-xxx-xx xx is not valid. The -
delimiter can be replaced by blanks: xxx xxx xxx or without delimiter: xxxxxxxxxx. The +1 prefix is optional.
Valid cases:
Invalid cases:
Note:
combn(LETTERS, 3)
With replacement
For the special case of pairs, outer can be used, putting each vector into a cell:
With replacement
length(letters)^5
[1] 11881376
library(deSolve)
## -----------------------------------------------------------------------------
## Define R-function
## ----------------------------------------------------------------------------
## -----------------------------------------------------------------------------
## Define parameters and variables
## -----------------------------------------------------------------------------
## Solve the ODEs
## -----------------------------------------------------------------------------
out <- ode(y = yini, times = times, func = Lorenz, parms = parms)
## -----------------------------------------------------------------------------
## Plot the results
## -----------------------------------------------------------------------------
plot(out, lwd = 2)
plot(out[,"X"], out[,"Y"],
type = "l", xlab = "X",
ylab = "Y", main = "butterfly")
## -----------------------------------------------------------------------------
## Define R-function
## -----------------------------------------------------------------------------
dP <- rG * P * (1 - P/K) - rI * P * C
dC <- rI * P * C * AE - rM * C
## -----------------------------------------------------------------------------
## Define parameters and variables
## -----------------------------------------------------------------------------
## -----------------------------------------------------------------------------
## Solve the ODEs
## -----------------------------------------------------------------------------
out <- ode(y = yini, times = times, func = LV, parms = parms)
## -----------------------------------------------------------------------------
## Plot the results
## -----------------------------------------------------------------------------
matplot(out[ ,1], out[ ,2:4], type = "l", xlab = "time", ylab = "Conc",
main = "Lotka-Volterra", lwd = 2)
legend("topright", c("prey", "predator", "sum"), col = 1:3, lty = 1:3)
## -----------------------------------------------------------------------------
## Define parameters and variables
## -----------------------------------------------------------------------------
## -----------------------------------------------------------------------------
## Define R-function
## -----------------------------------------------------------------------------
yb <- r * sin(w * t)
xb <- sqrt(L * L - yb * yb)
Ll <- sqrt(xl^2 + yl^2)
Lr <- sqrt((xr - xb)^2 + (yr - yb)^2)
dxl <- ul; dyl <- vl; dxr <- ur; dyr <- vr
c1 <- xb * xl + yb * yl
c2 <- (xl - xr)^2 + (yl - yr)^2 - L * L
return(list(c(dxl, dyl, dxr, dyr, dul, dvl, dur, dvr, c1, c2)))
})
}
#include <R.h>
/*----------------------------------------------------------------------
initialising the parameter common block
----------------------------------------------------------------------
*/
void init_C(void (* daeparms)(int *, double *)) {
int N = 8;
daeparms(&N, parms);
}
/* Compartments */
#define xl y[0]
#define yl y[1]
#define xr y[2]
#define yr y[3]
#define lam1 y[8]
#define lam2 y[9]
/*----------------------------------------------------------------------
the residual function
----------------------------------------------------------------------
*/
void caraxis_C (int *neq, double *t, double *y, double *ydot,
double *yout, int* ip)
{
double yb, xb, Lr, Ll;
yb = r * sin(w * *t) ;
xb = sqrt(L * L - yb * yb);
Ll = sqrt(xl * xl + yl * yl) ;
Lr = sqrt((xr-xb)*(xr-xb) + (yr-yb)*(yr-yb));
ydot[0] = y[4];
ydot[1] = y[5];
ydot[2] = y[6];
ydot[3] = y[7];
ydot[8] = xb * xl + yb * yl;
ydot[9] = (xl-xr) * (xl-xr) + (yl-yr) * (yl-yr) - L*L;
}
", fill = TRUE)
sink()
system("R CMD SHLIB caraxis_C.c")
dyn.load(paste("caraxis_C", .Platform$dynlib.ext, sep = ""))
dllname_C <- dyn.load(paste("caraxis_C", .Platform$dynlib.ext, sep = ""))[[1]]
external daeparms
integer, parameter :: N = 8
double precision parms(N)
common /myparms/parms
c----------------------------------------------------------------
c rate of change
c----------------------------------------------------------------
subroutine caraxis_fortran(neq, t, y, ydot, out, ip)
implicit none
integer neq, IP(*)
double precision t, y(neq), ydot(neq), out(*)
double precision eps, M, k, L, L0, r, w, g
common /myparms/ eps, M, k, L, L0, r, w, g
double precision xl, yl, xr, yr, ul, vl, ur, vr, lam1, lam2
double precision yb, xb, Ll, Lr, dxl, dyl, dxr, dyr
double precision dul, dvl, dur, dvr, c1, c2
yb = r * sin(w * t)
xb = sqrt(L * L - yb * yb)
Ll = sqrt(xl**2 + yl**2)
Lr = sqrt((xr - xb)**2 + (yr - yb)**2)
dxl = ul
dyl = vl
dxr = ur
dyr = vr
c1 = xb * xl + yb * yl
c2 = (xl - xr)**2 + (yl - yr)**2 - L * L
sink()
system("R CMD SHLIB caraxis_fortran.f")
dyn.load(paste("caraxis_fortran", .Platform$dynlib.ext, sep = ""))
dllname_fortran <- dyn.load(paste("caraxis_fortran", .Platform$dynlib.ext, sep = ""))[[1]]
library(microbenchmark)
R <- function(){
out <- ode(y = yini, times = times, func = caraxis_R,
parms = parameter)
}
C <- function(){
out <- ode(y = yini, times = times, func = "caraxis_C",
initfunc = "init_C", parms = parameter,
dllname = dllname_C)
}
all.equal(tail(R()), tail(fortran()))
all.equal(R()[,2], fortran()[,2])
all.equal(R()[,2], C()[,2])
Make a benchmark (Note: On your machine the times are, of course, different):
summary(bench)
We see clearly, that R is slow in contrast to the definition in C and fortran. For big models it's worth to translate the
problem in a compiled language. The package cOde is one possibility to translate ODEs from R to C.
You can manually detect numerical variance below your own threshold:
data("GermanCredit")
variances<-apply(GermanCredit, 2, var)
variances[which(variances<=0.0025)]
Or, you can use the caret package to find near zero variance. An advantage here is that is defines near zero
variance not in the numerical calculation of variance, but rather as a function of rarity:
"nearZeroVar diagnoses predictors that have one unique value (i.e. are zero variance predictors) or
predictors that are have both of the following characteristics: they have very few unique values relative to
the number of samples and the ratio of the frequency of the most common value to the frequency of the
second most common value is large..."
library(caret)
names(GermanCredit)[nearZeroVar(GermanCredit)]
library(VIM)
data(sleep)
colMeans(is.na(sleep))
In this case, we may want to remove NonD and Dream, which each have around 20% missing values (your cutoff
may vary)
# pick only one out of each highly correlated pair's mirror image
correlationMatrix[upper.tri(correlationMatrix)]<-0
# find features that are highly correlated with another feature at the +- 0.85 level
apply(correlationMatrix,2, function(x) any(abs(x)>=0.85))
I'll want to look at what MPG is correlated to so strongly, and decide what to keep and what to toss. Same for cyl
and disp. Alternatively, I might need to combine some strongly correlated features.
---
title: "Writing an academic paper in R"
author: "Author"
date: "Date"
output:
pdf_document:
number_sections: yes
toc: yes
bibliography: bibliography.bib
---
@ARTICLE{Meyer2000,
AUTHOR="Bernd Meyer",
TITLE="A constraint-based framework for diagrammatic reasoning",
JOURNAL="Applied Artificial Intelligence",
VOLUME= "14",
ISSUE = "4",
PAGES= "327--344",
YEAR=2000
}
To cite an author mentioned in your .bib file write @ and the bibkey, e.g. Meyer2000.
# Introduction
# Summary
# References
Rendering the RMD file via RStudio (Ctrl+Shift+K) or via console rmarkdown::render("<path-to-your-RMD-file">)
results in the following output:
# Introduction
# Summary
# References
Rendering this file results in the same output as in example "Specifying a bibliography".
To use another style then the default one, the following code is used:
---
title: "Writing an academic paper in R"
author: "Author"
date: "Date"
output:
pdf_document:
number_sections: yes
toc: yes
bibliography: bibliography.bib
csl: elsevier-harvard.csl
---
# Summary
# Reference
Create a sequence of step-length one from the smallest to the largest value for each row in a matrix.
(function() { 1 })()
[1] 1
is equivalent to
f <- function() { 1 })
f()
[1] 1
One can easily define their own snippet template, i.e. like the one below
The option is Edit Snippets in the Global Options -> Code menu.
A function can be very simple, to the point of being being pretty much pointless. It doesn't even need to take an
argument:
What's between the curly braces { } is the function proper. As long as you can fit everything on a single line they
aren't strictly needed, but can be useful to keep things organized.
A function can be very simple, yet highly specific. This function takes as input a vector (vec in this example) and
outputs the same vector with the vector's length (6 in this case) subtracted from each of the vector's elements.
Notice that length() is in itself a pre-supplied (i.e. Base) function. You can of course use a previously self-made
function within another self-made function, as well as assign variables and perform other operations while
spanning several lines:
msdf(vec2, 5)
mult subl
1 10.0 -2.0
2 12.5 -1.5
3 15.0 -1.0
4 17.5 -0.5
multiplier=4 makes sure that 4 is the default value of the argument multiplier, if no value is given when calling
The above are all examples of named functions, so called simply because they have been given names (one, two,
subtract.length etc.)
There are 4 variants of color schemes: magma, plasma, inferno, and viridis (default). They are chosen with the
option parameter and are coded as A, B, C, and D, correspondingly. To have an impression of the 4 color schemes,
look at the maps:
(image souce)
Nice feature of the viridis color scheme is integration with ggplot2. Within the package two ggplot2-specific
functions are defined: scale_color_viridis() and scale_fill_viridis(). See the example below:
library(viridis)
library(ggplot2)
library(cowplot)
output <- plot_grid(gg1,gg2, labels = c('B','D'),label_size = 20)
print(output)
An example of use
color_glimpse(blues9)
The output is a function that takes n (number) as input and produces a color vector of length n according to the
selected palette.
pal(10)
[1] "#023FA5" "#6371AF" "#959CC3" "#BEC1D4" "#DBDCE0" "#E0DBDC" "#D6BCC0" "#C6909A" "#AE5A6D"
"#8E063B"
The Color Universal Design from the University of Tokyo proposes the following palettes:
An example of use
library(ggplot2)
colorRampPalette creates a function that interpolate a set of given colors to create new color palettes. This output
function takes n (number) as input and produces a color vector of length n interpolating the initial colors.
rgb(0,1,0)
hclust expects a distance matrix, not the original data. We compute the tree using the default parameters and
Cut the tree to give four clusters and replot the data coloring the points by cluster. k is the desired number of
clusters.
rhc_def_4 = cutree(ruspini_hc_defaults,k=4)
plot(ruspini, pch=20, asp=1, col=rhc_def_4)
scaled_ruspini_hc_defaults = hclust(dist(scale(ruspini)))
srhc_def_4 = cutree(scaled_ruspini_hc_defaults,4)
plot(ruspini, pch=20, asp=1, col=srhc_def_4)
set.seed(656)
x = c(rnorm(150, 0, 1), rnorm(150,9,1), rnorm(150,4.5,1))
y = c(rnorm(150, 0, 1), rnorm(150,0,1), rnorm(150,5,1))
XYdf = data.frame(x,y)
plot(XYdf, pch=20)
hclust found two outliers and put everything else into one big cluster. To get the "real" clusters, you may need to set
k higher.
XYs6 = cutree(XY_sing,k=6)
table(XYs6)
XYs6
1 2 3 4 5 6
148 150 1 149 1 1
plot(XYdf, pch=20, col=XYs6)
######################################################
## RF Classification Example
set.seed(656) ## for reproducibility
S_RF_Class = randomForest(Gp ~ ., data=Soils[,c(4,6:14)])
Gp_RF = predict(S_RF_Class, Soils[,6:14])
length(which(Gp_RF != Soils$Gp)) ## No Errors
This example tested on the training data, but illustrates that RF can make very good models.
######################################################
## RF Regression Example
set.seed(656) ## for reproducibility
S_RF_Reg = randomForest(pH ~ ., data=Soils[,6:14])
pH_RF = predict(S_RF_Reg, Soils[,6:14])
library(opencpu)
opencpu$start(port = 5936)
After this code is executed, you can use URLs to access the functions of the R session. The result could be XML,
html, JSON or some other defined formats.
The call is asynchronous, meaning that the R session is not blocked while waiting for the call to finish (contrary to
shiny).
Random Forest classifier objects can be created in R by preparing the class variable as factor, which is already
apparent in the iris data set. Therefore we can easily create a Random Forest by:
library(randomForest)
rf
# Call:
# randomForest(x = iris[, 1:4], y = iris$Species, ntree = 500, do.trace = 100)
# Type of random forest: classification
# Number of trees: 500
# No. of variables tried at each split: 2
#
# OOB estimate of error rate: 4%
# Confusion matrix:
# setosa versicolor virginica class.error
# setosa 50 0 0 0.00
# versicolor 0 47 3 0.06
# virginica 0 3 47 0.06
parameters Description
x a data frame holding the describing variables of the classes
the classes of the individual obserbations. If this vector is factor, a classification model is created, if
y
not a regression model is created.
ntree The number of individual CART trees built
do.trace every ith step, the out-of-the-box errors overall and for each class are returned
# export to html
texreg::htmlreg(list(fit1,fit2,fit3),file='models.html')
# export to doc
texreg::htmlreg(list(fit1,fit2,fit3),file='models.doc')
There are several additional handy parameters in texreg::htmlreg() function. Here is a use case for the most
helpful parameters.
# export to html
texreg::htmlreg(list(fit1,fit2,fit3),file='models.html',
single.row = T,
custom.model.names = LETTERS[1:3],
leading.zero = F,
digits = 3)
Printing (as seen in the console) might suffice for a plain-text document to be viewed in monospaced font:
Note: Before making the example data below, make sure you're in an empty folder you can write to. Run getwd() and
read ?setwd if you need to change folders.
..w = options()$width
options(width = 500) # reduce text wrapping
sink(file = "mytab.txt")
summary(mtcars)
sink()
options(width = ..w)
rm(..w)
Writing to CSV (or another common format) and then opening in a spreadsheet editor to apply finishing touches is
another option:
Note: Before making the example data below, make sure you're in an empty folder you can write to. Run getwd() and
read ?setwd if you need to change folders.
write.csv(mtcars, file="mytab.csv")
Further resources
knitr::kable
stargazer
tables::tabular
texreg
xtable
Further Resources
R provides several mechanisms to simulate the OO paradigm, let's apply S4 Object System for implementing this
pattern.
PROBLEM ENUNCIATION
We need to parse a file where each line provides information about a person, using a delimiter (";"), but some
information provided is optional, and instead of providing an empty field, it is missing. On each line we can have the
following information: Name;[Address;]Phone. Where the address information is optional, sometimes we have it
and sometimes don’t, for example:
The second line does not provide address information. Therefore the number of delimiters may be deferent like in
this case with one delimiter and for the other lines two delimiters. Because the number of delimiters may vary, one
way to atack this problem is to recognize the presence or not of a given field based on its pattern. In such case we
can use a regular expression for identifying such patterns. For example:
Notes:
I am considering the most common pattern of US addresses and phones, it can be easy extended to consider
more general situations.
In R the sign "\" has special meaning for character variables, therefore we need to escape it.
In order to simplify the process of defining regular expressions a good recommendation is to use the
following web page: regex101.com, so you can play with it, with a given example, until you get the expected
result for all possible combinations.
The idea is to identify each line field based on previously defined patterns. The State pattern define the following
entities (classes) that collaborate to control the specific behavior (The State Pattern is a behavior pattern):
Context: Stores the context information of the parsing process, i.e. the current state and handles the entire
State Machine Process. For each state, an action is executed (handle()), but the context delegates it, based
on the state, on the action method defined for a particular state (handle() from State class). It defines the
interface of interest to clients. Our Context class can be defined like this:
Attributes: state
Methods: handle(), ...
State: The abstract class that represents any state of the State Machine. It defines an interface for
encapsulating the behavior associated with a particular state of the context. It can be defined like this:
Attributes: name, pattern
Methods: doAction(), isState (using pattern attribute verify whether the input argument belong to
this state pattern or not), …
Concrete States (state sub-classes): Each subclass of the class State that implements a behavior associated
with a state of the Context. Our sub-classes are: InitState, NameState, AddressState, PhoneState. Such
classes just implements the generic method using the specific logic for such states. No additional attributes
are required.
Note: It is a matter of preference how to name the method that carries out the action, handle(), doAction() or
goNext(). The method name doAction() can be the same for both classes (Stateor Context) we preferred to name
as handle() in the Context class for avoiding a confusion when defining two generic methods with the same input
arguments, but different class.
PERSON CLASS
setClass(Class = "Person",
slots = c(name = "character", address = "character", phone = "character")
)
It is a good recommendation to initialize the class attributes. The setClass documentation suggests using a generic
method labeled as "initialize", instead of using deprecated attributes such as: prototype, representation.
setMethod("initialize", "Person",
definition = function(.Object, name = NA_character_,
address = NA_character_, phone = NA_character_) {
.Object@name <- name
.Object@address <- address
.Object@phone <- phone
.Object
}
)
Because the initialize method is already a standard generic method of package methods, we need to respect the
> initialize
It returns the entire function definition, you can see at the top who the function is defined like:
Therefore when we use setMethod we need to follow exaclty the same syntax (.Object).
Another existing generic method is show, it is equivalent toString() method from Java and it is a good idea to have
a specific implementation for class domain:
Note: We use the same convention as in the default toString() Java implementation.
Let's say we want to save the parsed information (a list of Person objects) into a dataset, then we should be able
first to convert a list of objects to into something the R can transform (for example coerce the object as a list). We
can define the following additional method (for more detail about this see the post)
R does not provide a sugar syntax for OO because the language was initially conceived to provide valuable
functions for Statisticians. Therefore each user method requires two parts: 1) the Definition part (via setGeneric)
and 2) the implementation part (via setMethod). Like in the above example.
STATE CLASS
setMethod("initialize", "State",
definition = function(.Object, name = NA_character_, pattern = NA_character_) {
.Object@name <- name
.Object@pattern <- pattern
.Object
}
)
Every sub-class from State will have associated a name and pattern, but also a way to identify whether a given
input belongs to this state or not (isState() method), and also implement the corresponding actions for this state
(doAction() method).
In order to understand the process, let's define the transition matrix for each state based on the input received:
Note: The cell [row, col]=[i,j] represents the destination state for the current state j, when it receives the input
i.
It means that under the state Name it can receive two inputs: an address or a phone number. Another way to
represents the transaction table is using the following UML State Machine diagram:
STATE SUB-CLASSES
Init State:
setMethod("initialize", "InitState",
definition = function(.Object, name = "init", pattern = NA_character_) {
.Object@name <- name
.Object@pattern <- pattern
.Object
}
)
In R to indicate a class is a sub-class of other class is using the attribute contains and indicating the class name of
the parent class.
The initial state does not have associated a pattern, it just represents the beginning of the process, then we initialize
the class with an NA value.
Now lets to implement the generic methods from the State class:
For this particular state (without pattern), the idea it just initializes the parsing process expecting the first field will
be a name, otherwise it will be an error.
The doAction method provides the transition and updates the context with the information extracted. Here we are
accessing to context information via the @-operator. Instead, we can define get/set methods, to encapsulate this
process (as it is mandated in OO best practices: encapsulation), but that would add four more methods per get-set
without adding value for the purpose of this example.
It is a good recommendation in all doAction implementation, to add a safeguard when the input argument is not
properly identified.
Name State
setMethod("initialize","NameState",
definition=function(.Object, name="name",
pattern = "^([A-Z]'?\\s+)* *[A-Z]+(\\s+[A-Z]{1,2}\\.?,? +)*[A-Z]+((-|\\s+)[A-Z]+)*$") {
.Object@pattern <- pattern
.Object@name <- name
.Object
}
)
We use the function grepl for verifying the input belongs to a given pattern.
setMethod(f="isState", signature="NameState",
definition=function(obj, input) {
result <- grepl(obj@pattern, input, perl=TRUE)
return(result)
}
)
Here we consider to possible transitions: one for Address state and the other one for Phone state. In all cases we
update the context information:
The way to identify the state is to invoke the method: isState() for a particular state. We create a default specific
states (addressState, phoneState) and then ask for a particular validation.
The logic for the other sub-classes (one per state) implementation is very similar.
Address State
setMethod("initialize", "AddressState",
definition = function(.Object, name="address",
pattern = "^\\s[0-9]{1,4}(\\s+[A-Z]{1,2}[0-9]{1,2}[A-Z]{1,2}|[A-Z\\s0-9]+)$") {
.Object@pattern <- pattern
.Object@name <- name
.Object
setMethod(f="isState", signature="AddressState",
definition=function(obj, input) {
result <- grepl(obj@pattern, input, perl=TRUE)
return(result)
}
)
Phone State
setMethod("initialize", "PhoneState",
definition = function(.Object, name = "phone",
pattern = "^\\s*(\\+1(-|\\s+))*[0-9]{3}(-|\\s+)[0-9]{3}(-|\\s+)[0-9]{4}$") {
.Object@pattern <- pattern
.Object@name <- name
.Object
}
)
Here is where we add the person information into the list of persons of the context.
CONTEXT CLASS
Now the lets to explain the Context class implementation. We can define it considering the following attributes:
setClass(Class = "Context",
slots = c(state = "State", persons = "list", person = "Person")
)
Where
Note: Optionally, we can add a name to identify the context by name in case we are working with more than one
parser type.
setMethod(f="initialize", signature="Context",
definition = function(.Object) {
.Object@state <- new("InitState")
.Object@persons <- list()
.Object@person <- new("Person")
return(.Object)
}
)
With such generic methods, we control the entire behavior of the parsing process:
handle() method, delegates on doAction() method from the current state of the context:
First, we split the original line in an array using the delimiter to identify each element via the R-function strsplit(),
then iterate for each element as an input value for a given state. The handle() method returns again the context
with the updated information (state, person, persons attribute).
Becuase R makes a copy of the input argument, we need to return the context (obj):
Finally, lets to test the entire solution. Define the lines to parse where for the second line the address information is
missing.
s <- c(
"GREGORY BROWN; 25 NE 25TH; +1-786-987-6543",
"DAVID SMITH;786-123-4567",
"ALAN PEREZ; 25 SE 50TH; +1-786-987-5553"
)
df <- as.df(context)
> df
name address phone
1 GREGORY BROWN 25 NE 25TH +1-786-987-6543
2 DAVID SMITH <NA> 786-123-4567
3 ALAN PEREZ 25 SE 50TH +1-786-987-5553
> show(context@persons[[1]])
Person@[name='GREGORY BROWN', address='25 NE 25TH', phone='+1-786-987-6543']
>show(new("PhoneState"))
PhoneState@[name='phone', pattern='^\s*(\+1(-|\s+))*[0-9]{3}(-|\s+)[0-9]{3}(-|\s+)[0-9]{4}$']
$address
[1] "25 NE 25TH"
$phone
[1] "+1-786-987-6543"
>
CONCLUSION
This example shows how to implement the State pattern, using one of the available mechanisms from R for using
the OO paradigm. Nevertheless, the R OO solution is not user-friendly and differs so much from other OOP
languages. You need to switch your mindset because the syntax is completely different, it reminds more the
functional programming paradigm. For example instead of: object.setID("A1") as in Java/C#, for R you have to
invoke the method in this way: setID(object, "A1"). Therefore you always have to include the object as an input
argument to provide the context of the function. On the same way, there is no special this class attribute and
either a "." notation for accessing methods or attributes of the given class. It is more error prompt because to
refer a class or methods is done via attribute value ("Person", "isState", etc.).
Said the above, S4 class solution, requires much more lines of codes than a traditional Java/C# languages for doing
simple tasks. Anyway, the State Pattern is a good and generic solution for such kind of problems. It simplifies the
process delegating the logic into a particular state. Instead of having a big if-else block for controlling all
situations, we have smaller if-else blocks inside on each State sub-class implementation for implementing the
action to carry out in each state.
## example data
set.seed(123)
df <- data.frame(
name = rep(c("firstName", "secondName"), each=4),
numbers = rep(1:4, 2),
value = rnorm(8)
)
df
# name numbers value
# 1 firstName 1 -0.56047565
# 2 firstName 2 -0.23017749
# 3 firstName 3 1.55870831
# 4 firstName 4 0.07050839
# 5 secondName 1 0.12928774
# 6 secondName 2 1.71506499
# 7 secondName 3 0.46091621
# 8 secondName 4 -1.26506123
spread(data = df,
key = numbers,
value = value)
# name 1 2 3 4
# 1 firstName -0.5604756 -0.2301775 1.5587083 0.07050839
# 2 secondName 0.1292877 1.7150650 0.4609162 -1.26506123
spread(data = df,
key = name,
value = value)
# numbers firstName secondName
# 1 1 -0.56047565 0.1292877
# 2 2 -0.23017749 1.7150650
# 3 3 1.55870831 0.4609162
# 4 4 0.07050839 -1.2650612
library(tidyr)
## example data
df <- read.table(text =" numbers firstName secondName
1 1 1.5862639 0.4087477
2 2 0.1499581 0.9963923
We can gather the columns together using 'numbers' as the key column:
gather(data = df,
key = numbers,
value = myValue)
# numbers numbers myValue
# 1 1 firstName 1.5862639
# 2 2 firstName 0.1499581
# 3 3 firstName 0.4117353
# 4 4 firstName -0.4926862
# 5 1 secondName 0.4087477
# 6 2 secondName 0.9963923
# 7 3 secondName 0.3740009
# 8 4 secondName 0.4437916
The following example shows how you can reorder a vector of names of the form "surname, forename" into a
vector of the form "forename surname".
library(randomNames)
set.seed(1)
If you only need the surname you could just address the first pairs of parentheses.
2,14,14,14,19
2,14,19
gsub("(\\d+)(,\\1)+","\\1", "2,14,14,14,19")
[1] "2,14,19"
It works also for more than one different repetition, for example:
1. (\\d+): A group 1 delimited by () and finds any digit (at least one). Remember we need to use the double
backslash (\\) here because for a character variable a backslash represents special escape character for
literal string delimiters (\" or \'). \d\ is equivalent to: [0-9].
2. ,: A punctuation sign: , (we can include spaces or any other delimiter)
3. \\1: An identical string to the group 1, i.e.: the repeated number. If that doesn't happen, then the pattern
doesn't match.
one,two,two,three,four,four,five,six
Then, just replace \d by \w, where \w matches any word character, including: any letter, digit or underscore. It is
equivalent to [a-zA-Z0-9_]:
Then, the above pattern includes as a particular case duplicated digits case.
For instance, the summarise() function use non-standard evaluation but relies on the summarise_() which uses
standard evaluation.
The lazyeval library makes it easy to turn standard evaluation function into NSE functions.
library(dplyr)
library(lazyeval)
Filtering
NSE version
filter(mtcars, cyl == 8)
filter(mtcars, cyl < 6)
filter(mtcars, cyl < 6 & vs == 1)
Summarise
NSE version
summarise(mtcars, mean(disp))
summarise(mtcars, mean_disp = mean(disp))
SE version
Mutate
NSE version
mutate_(
.data = mtcars,
.dots = list(
"displ_l" = lazyeval::interp(
~ x / 61.0237, x = quote(disp)
)
)
)
Note that throughout this example, set.seed is used to ensure that the example code is reproducible. However,
sample will work without explicitly calling set.seed.
Random permutation
In the simplest form, sample creates a random permutation of a vector of integers. This can be accomplished with:
set.seed(1251)
sample(x = 10)
[1] 7 1 4 8 6 3 10 5 2 9
When given no other arguments, sample returns a random permutation of the vector from 1 to x. This can be useful
when trying to randomize the order of the rows in a data frame. This is a common task when creating
randomization tables for trials, or when selecting a random subset of rows for analysis.
library(datasets)
set.seed(1171)
iris_rand <- iris[sample(x = 1:nrow(iris)),]
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
> head(iris_rand)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
145 6.7 3.3 5.7 2.5 virginica
5 5.0 3.6 1.4 0.2 setosa
85 5.4 3.0 4.5 1.5 versicolor
137 6.3 3.4 5.6 2.4 virginica
128 6.1 3.0 4.9 1.8 virginica
105 6.5 3.0 5.8 2.2 virginica
Using sample, we can also simulate drawing from a set with and without replacement. To sample without
replacement (the default), you must provide sample with a set to be drawn from and the number of draws. The set
to be drawn from is given as a vector.
set.seed(7043)
sample(x = LETTERS,size = 7)
Note that if the argument to size is the same as the length of the argument to x, you are creating a random
permutation. Also note that you cannot specify a size greater than the length of x when doing sampling without
replacement.
set.seed(7305)
sample(x = letters,size = 26)
[1] "x" "z" "y" "i" "k" "f" "d" "s" "g" "v" "j" "o" "e" "c" "m" "n" "h" "u" "a" "b" "l" "r" "w" "t"
"q" "p"
To make random draws from a set with replacement, you use the replace argument to sample. By default, replace
is FALSE. Setting it to TRUE means that each element of the set being drawn from may appear more than once in the
final result.
set.seed(5062)
sample(x = c("A","B","C","D"),size = 8,replace = TRUE)
By default, when you use sample, it assumes that the probability of picking each element is the same. Consider it as
a basic "urn" problem. The code below is equivalent to drawing a colored marble out of an urn 20 times, writing
down the color, and then putting the marble back in the urn. The urn contains one red, one blue, and one green
marble, meaning that the probability of drawing each color is 1/3.
set.seed(6472)
sample(x = c("Red","Blue","Green"),
size = 20,
replace = TRUE)
Suppose that, instead, we wanted to perform the same task, but our urn contains 2 red marbles, 1 blue marble, and
1 green marble. One option would be to change the argument we send to x to add an additional Red. However, a
better choice is to use the prob argument to sample.
The prob argument accepts a vector with the probability of drawing each element. In our example above, the
probability of drawing a red marble would be 1/2, while the probability of drawing a blue or a green marble would
be 1/4.
set.seed(28432)
sample(x = c("Red","Blue","Green"),
size = 20,
replace = TRUE,
prob = c(0.50,0.25,0.25))
Counter-intuitively, the argument given to prob does not need to sum to 1. R will always transform the given
arguments into probabilities that total to 1. For instance, consider our above example of 2 Red, 1 Blue, and 1 Green.
set.seed(28432)
frac_prob_example <- sample(x = c("Red","Blue","Green"),
size = 200,
replace = TRUE,
prob = c(0.50,0.25,0.25))
set.seed(28432)
numeric_prob_example <- sample(x = c("Red","Blue","Green"),
size = 200,
replace = TRUE,
prob = c(2,1,1))
> identical(frac_prob_example,numeric_prob_example)
[1] TRUE
The major restriction is that you cannot set all the probabilities to be zero, and none of them can be less than zero.
You can also utilize prob when replace is set to FALSE. In that situation, after each element is drawn, the
proportions of the prob values for the remaining elements give the probability for the next draw. In this situation,
you must have enough non-zero probabilities to reach the size of the sample you are drawing. For example:
set.seed(21741)
sample(x = c("Red","Blue","Green"),
size = 2,
replace = FALSE,
prob = c(0.8,0.19,0.01))
In this example, Red is drawn in the first draw (as the first element). There was an 80% chance of Red being drawn,
a 19% chance of Blue being drawn, and a 1% chance of Green being drawn.
For the next draw, Red is no longer in the urn. The total of the probabilities among the remaining items is 20% (19%
for Blue and 1% for Green). For that draw, there is a 95% chance the item will be Blue (19/20) and a 5% chance it will
be Green (1/20).
set.seed(1643)
samp1 <- sample(x = 1:5,size = 200,replace = TRUE)
set.seed(1643)
samp2 <- sample(x = 1:5,size = 200,replace = TRUE)
Note that parallel processing requires special treatment of the random seed, described more elsewhere.
The four systems are: S3, S4, Reference Classes, and S6.
Section 118.1: S3
The S3 object system is a very simple OO system in R.
Every object has an S3 class. It can be get (got?) with the function class.
> class(3)
[1] "numeric"
When using a generic function, R uses the first element of the class that has an available generic.
For example:
x = 1:3
x
[1] 1 2 3
typeof(x)
#[1] "integer"
x[2] = "hi"
x
#[1] "1" "hi" "3"
typeof(x)
#[1] "character"
Notice that at first, x is of type integer. But when we assigned x[2] = "hi", all the elements of x were coerced into
character as vectors in R can only hold data of single type.
Standalone R scripts are not executed by the program R (R.exe under Windows), but by a program called Rscript
(Rscript.exe), which is included in your R installation by default.
To hint at this fact, standalone R scripts start with a special line called Shebang line, which holds the following
content: #!/usr/bin/env Rscript. Under Windows, an additional measure is needed, which is detailled later.
The following simple standalone R script saves a histogram under the file name "hist.png" from numbers it receives
as input:
#!/usr/bin/env Rscript
You can see several key elements of a standalone R script. In the first line, you see the Shebang line. Followed by
that, cat("....\n") is used to print a message to the user. Use file("stdin") whenever you want to specify "User
input on console" as a data origin. This can be used instead of a file name in several data reading functions (scan,
read.table, read.csv,...). After the user input is converted from strings to numbers, the plotting begins. There, it
can be seen, that plotting commands which are meant to be written to a file must be enclosed in two commands.
These are in this case png(.) and dev.off(). The first function depends on the desired output file format (other
common choices being jpeg(.) and pdf(.)). The second function, dev.off() is always required. It writes the plot
to the file and ends the plotting process.
The standalone script's file must first be made executable. This can happen by right-clicking the file, opening
"Properties" in the opening menu and checking the "Executable" checkbox in the "Permissions" tab. Alternatively,
chmod +x PATH/TO/SCRIPT/SCRIPTNAME.R
Windows
For each standalone script, a batch file must be written with the following contents:
A batch file is a normal text file, but which has a *.bat extension except a *.txt extension. Create it using a text
editor like notepad (not Word) or similar and put the file name into quotation marks "FILENAME.bat") in the save
dialog. To edit an existing batch file, right-click on it and select "Edit".
You have to adapt the code shown above everywhere XXX... is written:
Explanation of the elements in the code: The first part "C:\...\Rscript.exe" tells Windows where to find the
Rscript.exe program. The second part "%~dp0\XXX.R" tells Rscript to execute the R script you've written which
resides in the same folder as the batch file (%~dp0 stands for the batch file folder). Finally, %* forwards any
command line arguments you give to the batch file to the R script.
If you double-click on the batch file, the R script is executed. If you drag files on the batch file, the corresponding file
names are given to the R script as command line arguments.
Installing littler
From R:
install.packages("littler")
ln -s /home/*USER*/R/x86_64-pc-linux-gnu-library/3.4/littler/bin/r /usr/local/bin/r
With r from littler it is possible to execute standalone R scripts without any changes to the script. Example script:
Note that no shebang is at the top of the scripts. When saved as for example hist.r, it is directly callable from the
system command:
r hist.r
It is also possible to create executable R scripts with littler, with the use of the shebang
#!/usr/bin/env r
at the top of the script. The corresponding R script has to be made executable with chmod +X /path/to/script.r
and is directly callable from the system terminal.
In particular I found the following two links useful (last checked in May 2017):
Link 1
Link 2
R Libraries
library("devtools")
library("twitteR")
library("ROAuth")
Supposing you have your keys You have to run the following code
setup_twitter_oauth(api_key,api_secret)
Change XXXXXXXXXXXXXXXXXXXXXX to your keys (if you have Setup your tweeter account you know which keys I
mean).
Let's now suppose we want to download tweets on coffee. The following code will do it
and you can check your tweets with the head function.
head(coffee_tweets)
A term frequency is a dictionary, in which to each token is assigned a weight. In the first example, we construct a
term frequency matrix from a corpus corpus (a collection of documents) with the R package tm.
require(tm)
doc1 <- "drugs hospitals doctors"
doc2 <- "smog pollution environment"
doc3 <- "doctors hospitals healthcare"
doc4 <- "pollution environment water"
corpus <- c(doc1, doc2, doc3, doc4)
tm_corpus <- Corpus(VectorSource(corpus))
In this example, we created a corpus of class Corpus defined by the package tm with two functions Corpus and
VectorSource, which returns a VectorSource object from a character vector. The object tm_corpus is a list our
documents with additional (and optional) metadata to describe each document.
str(tm_corpus)
List of 4
$ 1:List of 2
..$ content: chr "drugs hospitals doctors"
..$ meta :List of 7
.. ..$ author : chr(0)
.. ..$ datetimestamp: POSIXlt[1:1], format: "2017-06-03 00:31:34"
.. ..$ description : chr(0)
.. ..$ heading : chr(0)
.. ..$ id : chr "1"
.. ..$ language : chr "en"
.. ..$ origin : chr(0)
.. ..- attr(*, "class")= chr "TextDocumentMeta"
..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
[truncated]
Once we have a Corpus, we can proceed to preprocess the tokens contained in the Corpus to improve the quality of
the final output (the term frequency matrix). To do this we use the tm function tm_map, which similarly to the apply
family of functions, transform the documents in the corpus by applying a function to each document.
Following these transformations, we finally create the term frequency matrix with
which gives a
as.matrix(tdm)
Docs
Terms character(0) character(0) character(0) character(0)
doctor 1 0 1 0
drug 1 0 0 0
environ 0 1 0 1
healthcar 0 0 1 0
hospit 1 0 1 0
pollut 0 1 0 1
smog 0 1 0 0
water 0 0 0 1
Each row represents the frequency of each token - that as you noticed have been stemmed (e.g. environment to
environ) - in each document (4 documents, 4 columns).
In the previous lines, we have weighted each pair token/document with the absolute frequency (i.e. the number of
instances of the token that appear in the document).
The output from all the lines in the chunk will appear beneath the chunk.
Since a chunk produces its output beneath the chunk, when having multiple lines of code in a single chunk that
produces multiples outputs it is often helpful to split into multiple chunks such that each chunk produces one
output.
To do this, select the code to you want to split into a new chunk and press Ctrl + Alt + I (OS X: Cmd + Option + I)
Running or Re-Running individual chunks by pressing Run for all the chunks present in a document can be painful.
We can use Run All from the Insert menu in the toolbar to Run all the chunks present in the notebook. Keyboard
shortcut is Ctrl + Alt + R (OS X: Cmd + Option + R)
There’s also a option Restart R and Run All Chunks command (available in the Run menu on the editor toolbar),
which gives you a fresh R session prior to running all the chunks.
We also have options like Run All Chunks Above and Run All Chunks Below to run chunks Above or Below from a
selected chunk.
You can change the type of output by using the output options as "pdf_document" or "html_notebook"
CODE:
OUTPUT:
aggregate(formula,function,data)
The following code shows various ways of using the aggregate function.
CODE:
OUTPUT:
CODE:
OUTPUT:
> library(dplyr)
>
> df = data.frame(group=c("Group 1","Group 1","Group 2","Group 2","Group 2"), subgroup =
c("A","A","A","A","B"),value = c(2,2.5,1,2,1.5))
> print(df)
group subgroup value
1 Group 1 A 2.0
2 Group 1 A 2.5
3 Group 2 A 1.0
https://vincentarelbundock.github.io/Rdatasets/datasets.html
Example
Swiss Fertility and Socioeconomic Indicators (1888) Data. Let's check the difference in fertility based of rurality and
domination of Catholic population.
library(tidyverse)
swiss %>%
ggplot(aes(x = Agriculture, y = Fertility,
color = Catholic > 50))+
geom_point()+
stat_ellipse()
Eurostat
Even though eurostat package has a function search_eurostat(), it does not find all the relevant datasets
available. This, it's more convenient to browse the code of a dataset manually at the Eurostat website: Countries
Database, or Regional Database. If the automated download does not work, the data can be grabbed manually at
via Bulk Download Facility.
library(tidyverse)
library(lubridate)
library(forcats)
library(eurostat)
library(geofacet)
library(viridis)
library(ggthemes)
library(extrafont)
neet %>%
filter(geo %>% paste %>% nchar == 2,
sex == "T", age == "Y18-24") %>%
group_by(geo) %>%
mutate(avg = values %>% mean()) %>%
ungroup() %>%
ggplot(aes(x = time %>% year(),
y = values))+
geom_path(aes(group = 1))+
geom_point(aes(fill = values), pch = 21)+
scale_x_continuous(breaks = seq(2000, 2015, 5),
labels = c("2000", "'05", "'10", "'15"))+
scale_y_continuous(expand = c(0, 0), limits = c(0, 40))+
scale_fill_viridis("NEET, %", option = "B")+
facet_geo(~ geo, grid = "eu_grid1")+
labs(x = "Year",
y = "NEET, %",
title = "Young people neither in employment nor in education and training in Europe",
subtitle = "Data: Eurostat Regional Database, 2000-2016",
caption = "ikashnitsky.github.io")+
theme_few(base_family = "Roboto Condensed", base_size = 15)+
theme(axis.text = element_text(size = 10),
panel.spacing.x = unit(1, "lines"),
legend.position = c(0, 0),
legend.justification = c(0, 0))
Human Mortality Database is a project of the Max Planck Institute for Demographic Research that gathers and pre-
process human mortality data for those countries, where more or less reliable statistics is available.
Please note, the arguments user_hmd and pass_hmd are the login credentials at the website of Human Mortality
Database. In order to access the data, one needs to create an account at http://www.mortality.org/ and provide
their own credentials to the readHMDweb() function.
for (i in 1:length(exposures)) {
di <- exposures[[i]]
sr_agei <- di %>% select(Year,Age,Female,Male) %>%
filter(Year %in% 2012) %>%
select(-Year) %>%
transmute(country = names(exposures)[i],
age = Age, sr_age = Male / Female * 100)
sr_age[[i]] <- sr_agei
}
sr_age <- bind_rows(sr_age)
# finaly - plot
df_plot %>%
ggplot(aes(age, sr_age, color = country, group = country))+
geom_hline(yintercept = 100, color = 'grey50', size = 1)+
geom_line(size = 1)+
scale_y_continuous(limits = c(0, 120), expand = c(0, 0), breaks = seq(0, 120, 20))+
scale_x_continuous(limits = c(0, 90), expand = c(0, 0), breaks = seq(0, 80, 20))+
xlab('Age')+
ylab('Sex ratio, males per 100 females')+
facet_wrap(~country, ncol=6)+
theme_minimal(base_family = "Roboto Condensed", base_size = 15)+
theme(legend.position='none',
panel.border = element_rect(size = .5, fill = NA))
Gapminder
library(tidyverse)
library(gapminder)
gapminder %>%
ggplot(aes(x = year, y = lifeExp,
color = continent))+
geom_jitter(size = 1, alpha = .2, width = .75)+
stat_summary(geom = "path", fun.y = mean, size = 1)+
theme_minimal()
Let's see how the world has converged in male life expectancy at birth over 1950-2015.
library(tidyverse)
library(forcats)
library(wpp2015)
library(ggjoy)
library(viridis)
data(UNlocations)
data(e0M)
e0M %>%
filter(country %in% countries) %>%
select(-last.observed) %>%
gather(period, value, 3:15) %>%
ggplot(aes(x = value, y = period %>% fct_rev()))+
geom_joy(aes(fill = period))+
scale_fill_viridis(discrete = T, option = "B", direction = -1,
begin = .1, end = .9)+
labs(x = "Male life expectancy at birth",
y = "Period",
title = "The world convergence in male life expectancy at birth since 1950",
subtitle = "Data: UNPD World Population Prospects 2015 Revision",
caption = "ikashnitsky.github.io")+
theme_minimal(base_family = "Roboto Condensed", base_size = 15)+
theme(legend.position = "none")
Result:
# Sum an iterable
sum(x)
# Print a string
print("hello world")
d <- c(1, 0, 1)
only_1_3 <- a[d == 1]
Matrices
mat <- matrix(c(1,2,3,4), nrow = 2, ncol = 2)
dimnames(mat) <- list(c(), c("a", "b", "c"))
mat[,] == mat
Dataframes
df <- data.frame(qualifiers = c("Buy", "Sell", "Sell"),
symbols = c("AAPL", "MSFT", "GOOGL"),
values = c(326.0, 598.3, 201.5))
df$symbols == df[[2]]
df$symbols == df[["symbols"]]
df[[2, 1]] == "AAPL"
Lists
l <- list(a = 500, "aaa", 98.2)
length(l) == 3
class(l[1]) == "list"
class(l[[1]]) == "numeric"
class(l$a) == "numeric"
Environments
env <- new.env()
env[["foo"]] = "bar"
env2 <- env
env[["foo"]] == "BAR"
get("foo", envir = env) == "BAR"
rm("foo", envir = env)
env[["foo"]] == NULL
Open R Console (NOT RStudio, this doesn't work from RStudio) and run the following code to install the package
and initiate update.
install.packages("installr")
library("installr")
updateR()
Now it asks if you want to copy your packages fro the older version of R to Newer version of R. Once you choose yes
all the package are copied to the newer version of R.
You can even move your Rprofile.site from older version to keep all your customised settings.
version