0% found this document useful (0 votes)
28 views2 pages

Data Import

The document provides an overview of how to import tabular data into R using the readr package. It demonstrates how to read in csv, tsv, and fixed width files and customize aspects of the import such as column selection, type specification, and handling missing values. The back page shows additional packages for importing other types of data like databases, JSON, and Excel files.

Uploaded by

Florence Cheang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views2 pages

Data Import

The document provides an overview of how to import tabular data into R using the readr package. It demonstrates how to read in csv, tsv, and fixed width files and customize aspects of the import such as column selection, type specification, and handling missing values. The back page shows additional packages for importing other types of data like databases, JSON, and Excel files.

Uploaded by

Florence Cheang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Data import with the tidyverse : : CHEATSHEET

Read Tabular Data with readr


read_*(file, col_names = TRUE, col_types = NULL, col_select = NULL, id = NULL, locale, n_max = Inf, One of the first steps of a project is to import OTHER TYPES OF DATA
skip = 0, na = c("", "NA"), guess_max = min(1000, n_max), show_col_types = TRUE) See ?read_delim outside data into R. Data is o en stored in Try one of the following
tabular formats, like csv files or spreadsheets. packages to import other types of files:
A|B|C
A B C read_delim("file.txt", delim = "|") Read files with any delimiter. If no The front page of this sheet shows • haven - SPSS, Stata, and SAS files
1 2 3 delimiter is specified, it will automatically guess. how to import and save text files into • DBI - databases
1|2|3 4 5 NA To make file.txt, run: write_file("A|B|C\n1|2|3\n4|5|NA", file = "file.txt")
4|5|NA R using readr. • jsonlite - json
The back page shows how to import • xml2 - XML
A B C read_csv("file.csv") Read a comma delimited file with period • httr - Web APIs
A,B,C spreadsheet data from Excel files
1 2 3 decimal marks. • rvest - HTML (Web Scraping)
1,2,3 4 5 NA write_file("A,B,C\n1,2,3\n4,5,NA", file = "file.csv") using readxl or Google Sheets using
4,5,NA googlesheets4. • readr::read_lines() - text data

read_csv2("file2.csv") Read semicolon delimited files with comma


Column Specification with readr
A B C
A;B;C
1.5 2 3 decimal marks.
1,5;2;3 4.5 5 NA write_file("A;B;C\n1,5;2;3\n4,5;5;NA", file = "file2.csv")
4,5;5;NA Column specifications define what data type each
USEFUL COLUMN ARGUMENTS
column of a file will be imported as. By default
A B C read_tsv("file.tsv") Read a tab delimited file. Also read_table(). readr will generate a column spec when a file is Hide col spec message
ABC read_*(file, show_col_types = FALSE)
1 2 3 read_fwf("file.tsv", fwf_widths(c(2, 2, NA))) Read a fixed width file. read and output a summary.
123 4 5 NA write_file("A\tB\tC\n1\t2\t3\n4\t5\tNA\n", file = "file.tsv")
4 5 NA spec(x) Extract the full column specification for Select columns to import
the given imported data frame. Use names, position, or selection helpers.
read_*(file, col_select = c(age, earn))
USEFUL READ ARGUMENTS spec(x)
# cols(
A B C No header 1 2 3 Skip lines # age = col_integer(), age is an
integer Guess column types
1 2 3 read_csv("file.csv", col_names = FALSE) 4 5 NA read_csv("file.csv", skip = 1) # edu = col_character(),
4 5 NA # earn = col_double() To guess a column type, read_ *() looks at the
A B C Read a subset of lines # ) first 1000 rows of data. Increase with guess_max.
x y z Provide header 1 2 3 read_csv("file.csv", n_max = 1) edu is a read_*(file, guess_max = Inf)
A B C read_csv("file.csv", earn is a double (numeric) character
1 2 3 col_names = c("x", "y", "z")) A B C Read values as missing
4 5 NA
NA 2 3 read_csv("file.csv", na = c("1")) COLUMN TYPES DEFINE COLUMN SPECIFICATION
4 5 NA
Read multiple files into a single table Each column type has a function and Set a default type
read_csv(c(“f1.csv”, “f2.csv”, “f3.csv"), Specify decimal marks corresponding string abbreviation. read_csv(
A;B;C
id = "origin_file") read_delim("file2.csv", locale = file,
1,5;2;3,0 locale(decimal_mark = ",")) • col_logical() - "l"
col_type = list(.default = col_double())
• col_integer() - "i" )
• col_double() - "d"
Save Data with readr • col_number() - "n"
Use column type or string abbreviation
read_csv(
• col_character() - "c" file,
write_*(x, file, na = "NA", append, col_names, quote, escape, eol, num_threads, progress) • col_factor(levels, ordered = FALSE) - "f" col_type = list(x = col_double(), y = "l", z = "_")
• col_datetime(format = "") - "T" )
A B C write_delim(x, file, delim = " ") Write files with any delimiter. • col_date(format = "") - "D" Use a single string of abbreviations
A,B,C • col_time(format = "") - "t"
1 2 3
write_csv(x, file) Write a comma delimited file. # col types: skip, guess, integer, logical, character
4 5 NA 1,2,3 • col_skip() - "-", "_" read_csv(
4,5,NA write_csv2(x, file) Write a semicolon delimited file. • col_guess() - "?" file,
col_type = "_?ilc"
write_tsv(x, file) Write a tab delimited file. )

CC BY SA Posit So ware, PBC • [email protected] • posit.co • readr.tidyverse.org • readxl.tidyverse.org • googlesheets4.tidyverse.org • HTML cheatsheets at pos.it/cheatsheets • readr 2.1.4 • readxl 1.4.3 • googlesheets4 1.1.1 • Updated: 2023-07
ft
ft
Import Spreadsheets
with readxl with googlesheets4
READ EXCEL FILES READ SHEETS
A B C D E A B C D E
1 x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 1 x1 x2 x3 x4 x5 x1 x2 x3 x4 x5
2 x z 8 x NA z 8 NA 2 x z 8 x NA z 8 NA
3 y 7 9 10 y 7 NA 9 10 READXL COLUMN SPECIFICATION 3 y 7 9 10 y 7 NA 9 10 GOOGLESHEETS4 COLUMN SPECIFICATION
s1 s1
Column specifications define what data type Column specifications define what data type
each column of a file will be imported as. each column of a file will be imported as.
read_excel(path, sheet = NULL, range = NULL) read_sheet(ss, sheet = NULL, range = NULL)
Read a .xls or .xlsx file based on the file extension. Read a sheet from a URL, a Sheet ID, or a dribble
Use the col_types argument of read_excel() to Use the col_types argument of read_sheet()/
See front page for more read arguments. Also from the googledrive package. See front page for
set the column specification. range_read() to set the column specification.
read_xls() and read_xlsx(). more read arguments. Same as range_read().
read_excel("excel_file.xlsx")
Guess column types Guess column types
To guess a column type, read_ excel() looks at SHEETS METADATA To guess a column type read_sheet()/
READ SHEETS the first 1000 rows of data. Increase with the URLs are in the form: range_read() looks at the first 1000 rows of data.
guess_max argument. https://docs.google.com/spreadsheets/d/ Increase with guess_max.
A B C D E read_excel(path, sheet = read_excel(path, guess_max = Inf) read_sheet(path, guess_max = Inf)
NULL) Specify which sheet SPREADSHEET_ID/edit#gid=SHEET_ID
to read by position or name. Set all columns to same type, e.g. character gs4_get(ss) Get spreadsheet meta data. Set all columns to same type, e.g. character
read_excel(path, sheet = 1) read_excel(path, col_types = "text") read_sheet(path, col_types = "c")
s1 s2 s3
read_excel(path, sheet = "s1") gs4_find(...) Get data on all spreadsheet files.
Set each column individually sheet_properties(ss) Get a tibble of properties Set each column individually
read_excel( for each worksheet. Also sheet_names(). # col types: skip, guess, integer, logical, character
excel_sheets(path) Get a
vector of sheet names. path, read_sheets(ss, col_types = "_?ilc")
s1 s2 s3
col_types = c("text", "guess", "guess","numeric") WRITE SHEETS
excel_sheets("excel_file.xlsx")
) A B C write_sheet(data, ss =
1 x 4 1 1 x 4 NULL, sheet = NULL) COLUMN TYPES
A B C D E To read multiple sheets: 2 y 5 2 2 y 5
Write a data frame into a
COLUMN TYPES l n c D L
A B C D E 1. Get a vector of sheet 3 z 6 3 3 z 6
new or existing Sheet. TRUE 2 hello 1947-01-08 hello
s1
names from the file path. logical numeric text date list FALSE 3.45 world 1956-10-21 1
A B C D E gs4_create(name, ...,
2. Set the vector names to TRUE 2 hello 1947-01-08 hello
s1 s2 A B C D sheets = NULL) Create a
be the sheet names. FALSE 3.45 world 1956-10-21 1 • skip - "_" or "-" • date - "D"
1 new Sheet with a vector
s1 s2 3. Use purrr::map() and • guess - "?" • datetime - "T"
• skip • logical • date 2 of names, a data frame,
s1 s2 s3 purrr::list_rbind() to read • logical - "l" • character - "c"
• guess • numeric • list s1 or a (named) list of data
multiple files into one • integer - "i" • list-column - "L"
• text frames.
path <- "your_file_path.xlsx" data frame. • double - "d" • cell - "C" Returns
A B C
path |> 1 x1 x2 x3
sheet_append(ss, data, • numeric - "n" list of raw cell data.
x1 x2 x3
excel_sheets() |> Use list for columns that include multiple data 2 1 x 4 sheet = 1) Add rows to
2 y 5
set_names() |> types. See tidyr and purrr for list-column data. 3 z 6 3 2 y 5 the end of a worksheet. Use list for columns that include multiple data
map(read_excel, path = path) |> 4 3 z 6 types. See tidyr and purrr for list-column data.
s1
list_rbind()

OTHER USEFUL EXCEL PACKAGES CELL SPECIFICATION FOR READXL AND GOOGLESHEETS4 FILE LEVEL OPERATIONS
For functions to write data to Excel files, see: Use the range argument of readxl::read_excel() or googlesheets4 also o ers ways to modify other
• openxlsx googlesheets4::read_sheet() to read a subset of cells from a aspects of Sheets (e.g. freeze rows, set column
• writexl A B C D E sheet. width, manage (work)sheets). Go to
1 1 2 3 4 5 2 3 4 read_excel(path, range = "Sheet1!B1:D2") googlesheets4.tidyverse.org to read more.
For working with non-tabular Excel data, see: 2 x y z NA y z read_sheet(ss, range = "B1:D2")
• tidyxl 3 6 7 9 10 For whole-file operations (e.g. renaming, sharing,
s1 Also use the range argument with cell specification functions placing within a folder), see the tidyverse
cell_limits(), cell_rows(), cell_cols(), and anchored(). package googledrive at
googledrive.tidyverse.org.

CC BY SA Posit So ware, PBC • [email protected] • posit.co • readr.tidyverse.org • readxl.tidyverse.org • googlesheets4.tidyverse.org • HTML cheatsheets at pos.it/cheatsheets • readr 2.1.4 • readxl 1.4.3 • googlesheets4 1.1.1 • Updated: 2023-07
ft
ff

You might also like