0% found this document useful (0 votes)
12 views4 pages

Learn R - Learn R - Data Cleaning Cheatsheet - Codecademy

The document is a cheatsheet for data cleaning in R, detailing various functions such as gsub(), distinct(), str(), and as.numeric() for manipulating and cleaning data. It also covers combining data from multiple files, creating tidy datasets, and using dplyr and tidyr packages for effective data management. Key functions like separate() and gather() are highlighted for reshaping data and managing string values.

Uploaded by

snarficus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views4 pages

Learn R - Learn R - Data Cleaning Cheatsheet - Codecademy

The document is a cheatsheet for data cleaning in R, detailing various functions such as gsub(), distinct(), str(), and as.numeric() for manipulating and cleaning data. It also covers combining data from multiple files, creating tidy datasets, and using dplyr and tidyr packages for effective data management. Key functions like separate() and gather() are highlighted for reshaping data and managing string values.

Uploaded by

snarficus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

23-01-2025, 11:24 Learn R: Learn R: Data Cleaning Cheatsheet | Codecademy

Cheatsheets / Learn R

Learn R: Data Cleaning

gsub() R Function

The base R gsub() function searches for a regular # Replace the element "1" with the empty
expression in a string and replaces it. The function
string in the teams vector in order to
recieve a string or character to replace, a replacement
value, and the object that contains the regular get the teams_clean vector with the
expression. We can use it to replace substrings within a correct names.
single string or in each string in a vector.
teams <- c("Fal1cons", "Cardinals",
When combined with dplyr’s mutate() function, a
column of a data frame can be cleaned to enable "Seah1awks", "Vikings", "Bro1nco",
analysis. "Patrio1ts")

teams_clean <- gsub("1", "", teams)

print(teams_clean)

# Output:
# "Falcons" "Cardinals" "Seahawks"
"Vikings" "Bronco" "Patriots"

distinct() dplyr

The distinct() function from dplyr package is used to # Keep unique rows in the
keep only unique rows on a data frame. If there are
match_statistics data frame
duplicate rows, the function will preserve only the first
row. The function can be used to remove equal rows of distinct(match_statistics)
a dataframe, and to remove rows in a data frame based
on unique column values or unique combination of
# Keep only rows with different values in
columns values.
the prices column of trips
# dataframe
distinct(trips,prices)

https://www.codecademy.com/learn/learn-r/modules/learn-r-data-cleaning/cheatsheet 1/4
23-01-2025, 11:24 Learn R: Learn R: Data Cleaning Cheatsheet | Codecademy

str() Function

The str() function display the internal structure of an


R object that is passed as parameter of the function.
The function outputs the data structure of the object
as well as the elements of the object. When the object
is a dataframe, the function returns the data type of
each column in the data frame, the number of
observations and the number and variables.

Combing Data with R

Data from multiple files can be combined into one data


frame using the base R functions list.files() and
lappy() , with readr’s read_csv() and dplyr’s
bind_rows() functions. Consider the following steps:
1. Get the list of files. The following code will get a
list of all files in the current directory that
match the pattern “file_.*csv”

files <- list.files(pattern = "fi

1. Read in the files. The following code applies


read_csv(), a function from readr, to each file,
and adds the resulting data frames to the list
df_list.

df_list <- lapply(files, read_csv)

1. Combine the file data. Below bind_rows(), a


dplyr function, is used to combine the data from
each data frame in the list into one data frame.

df <- bind_rows(df_list)

https://www.codecademy.com/learn/learn-r/modules/learn-r-data-cleaning/cheatsheet 2/4
23-01-2025, 11:24 Learn R: Learn R: Data Cleaning Cheatsheet | Codecademy

R as.numeric() Function

The base R as.numeric() function can coerce


character string objects into numeric types.
This function is useful because often numbers are
stored as characters which do not allow operations or
analysis. The function receives the object to be
transformed as a parameter and transforms it to
numeric.
When this function is combined with the mutate()
function from dplyr, new columns of a dataframe can
be created with the numeric data type.

str_sub() function

The str_sub() function from the stringr package can # This command would take the first index
split a string by index position separating combined
to the five index of the string.
data values into their individual components. The
function uses the start= and end= arguments to str_sub('Marya1984', start=1,end=5)
perform the split operation. This function can be used
with mutate() from dplyr in order to generate multiple
new columns on a data frame based on split string
values of a particular column.

Tidy Dataset

In a tidy dataset each variable is represented by a


column, and each row is a separate observation. Tidy
datasets are the best way to conduct data analysis on
specific data. By adhering to the standard of a tidy
dataset, it is easier for an analyst to extract from.
Datasets that are not tidy present some issues in their
structure such as one column storing multiple variables,
the same information of a variable is spread out in
multiple columns, or the variables can be stored in both
rows and columns.

https://www.codecademy.com/learn/learn-r/modules/learn-r-data-cleaning/cheatsheet 3/4
23-01-2025, 11:24 Learn R: Learn R: Data Cleaning Cheatsheet | Codecademy

The dplyr and tidyr packages

The dplyr and tidyr packages provide functions that


solve common data cleaning challenges in R.
Data cleaning and preparation should be performed on
a “messy” dataset before any analysis can occur. This
process can include:
diagnosing the “tidiness” of the data
reshaping the data
combining multiple files of data
changing the data types of values
manipulating strings to better represent the
data

separate() Function

The separate() function from the tidyr package, is # This function would separate the
used to separate a single character column of a data
complete_name column into new columns
frame into multiple columns. Arguments of this function
are, in order, a dataframe, the column used to create called names and surnames on the
the new columns(column name or column position in individuals data frame.
the data frame), the new column names that will be
separate(individuals, complete_name,
used, and the separator argument. The default
seperator will match any non-alphanumeric sequence, c("names","surnames"))
such as a space or semicolon.

gather() tidyr

The gather() function from tidyr package is useful to


gather columns over a data frame into key-value pairs,
changing the shape of a data frame from wide to long.
The original data frame has multiple columns that can
be gathered, in a unique structure of key-value pair
with all values in one column and the column names in
another column.

Print Share

https://www.codecademy.com/learn/learn-r/modules/learn-r-data-cleaning/cheatsheet 4/4

You might also like