0% found this document useful (0 votes)
3 views

ProgrammingForDS16_Rdatamanipulation

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

ProgrammingForDS16_Rdatamanipulation

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Data Manipulation (R)

Liana Harutyunyan
Programming for Data Science
April 25, 2024
American University of Armenia
liana.harutyunyan@aua.am

1
Data Manipulation

• dplyr is a package that helps you to solve most


common data manipulation tasks.
• Again for installing and importing the package, you
need:

install.packages("dplyr")

library(dplyr)

2
Data Manipulation

These are the most common functions from the library:

• filter — filters rows;


• select - filters columns;
• arrange — sorts a dataframe;
• mutate — creates new columns;
• group by — groups data by a specific key;
• summarize - performs aggregate functions

3
Data Manipulation

• Not necessary, but dplyr works the best with pipe like
operator from magrittr package.
• % > % operator takes the object from its left hand side
and uses it as an argument in the function on the right
hand side.
• Understand how to use it by replacing the pipe
operation with ‘then’ (in your mind, not in the code)
• example: filter(...) % > % select(...) - FILTER,
THEN on the filtered SELECT

4
Mutate

mutate - adds new columns or modifies current variables


in the dataset.
Adding new columns filled with the same value:

diamonds %>%
mutate(JustOne = 1,
Values = "something",
Simple = TRUE)

5
Mutate

Can use existing columns:

diamonds %>%
mutate(price_discounted = price * 0.9)

Can modify the existing ones:

diamonds %>%
mutate(price = price * 0.9,
mean_price = mean(price))

6
ifelse

ifelse returns a value with the same shape as test which is


filled with elements selected from either yes or no
depending on whether the element of test is TRUE or FALSE.

ifelse(test, yes, no)

Example:

vector <- c(1:10)


vector <- ifelse(vector > 5, "high", "low")

7
ifelse with mutate

Will change / create a new column depending on another


column’s values.

practice %>%
mutate(Health = ifelse(Subject == 1,
"sick",
"healthy"))

8
Filter

Only retain specific rows of data that meet the specified


requirement(s).

diamonds %>%
filter(cut == "Fair")

Will return only those rows that have ”cut” equal to ”Fair”.
Equivalent to:

diamonds[diamonds$cut == "Fair", ]

9
Filter

To have multiple conditions,


• for OR, you can use ”|”
• for AND, you can use ”,” (comma).

diamonds %>%
filter(cut == "Fair" | cut == "Good",
price <= 600)

Same as

diamonds %>%
filter(cut %in% c("Fair", "Good"),
price <= 600)

10
Select

• Select only the columns that you want to see. Gets rid of
all other columns.
• Can use columns positions or by name.
• The order in which you list the column names/positions
is the order that the columns will be displayed.
diamonds %>%
select(cut, color)
Same as:
diamonds %>%
select(1:5)

11
Select with negative sign

To exclude columns, you can use minus sign, both with


position numbers or column names.
Examples:

diamonds %>% select(-cut)

diamonds %>% select(-cut, -color)

diamonds %>% select(-c(cut, color))

diamonds %>% select(-c(2, 3))

12
group by and summarize

Groups variables different categories together for future


operations.

data %>%
group_by(Country) %>%
summarize(m = mean(Score),
s = sd(Score),
n = n()) # calculate the count

Equivalent to Python’s:

data.groupby("Country")["Score"].agg(["mean", "std", "count

Can also groupby with multiple columns, using


group by(Country, City)
13
arrange

Allows you arrange values within a variable in ascending or


descending order.
This can apply to both numerical and non-numerical
(alphabetical order) values.

diamonds %>%
arrange(cut)

diamonds %>%
arrange(desc(price))

14
Data Manipulation - examples set 1

• Exercise: Filter the data to have a diamond subset,


which is the original diamond data’s only Ideal cuts.

15
Data Manipulation - examples set 1

• Exercise: Filter the data to have a diamond subset,


which is the original diamond data’s only Ideal cuts.
• Exercise: Add a new column *volume* that will be
multiplication result of columns x, y and z.
• Exercise: During last class, we calculated diamond price
means for each type of cut, using for loop. Do the same
with dplyr package.

15
Data Manipulation - examples set 1

• Exercise: Filter the data to have a diamond subset,


which is the original diamond data’s only Ideal cuts.
• Exercise: Add a new column *volume* that will be
multiplication result of columns x, y and z.
• Exercise: During last class, we calculated diamond price
means for each type of cut, using for loop. Do the same
with dplyr package.
• Exercise: Use the previous point to plot a barplot on
obtained table.
• Exercise: Calculate how many of each cut there is in the
dataset.

15
Data Manipulation - examples set 2

• Read summer.csv dataset in R.


• Filter the dataset to have only USA data. Now count the
type of medals USA received.
• Count how many medals received all the countries and
sort by the most medals to least. Take top 10.
• Take the data of countries ”USA”, ”FRA” and ”GBR”.
Group by by 2 columns Country and Medal and take
count for each.
• Plot the data from last point in a graph. What is the best
graph for this.

16
Data Manipulation - examples set 3

• Load *airquality* built-in dataset from R.


• Filter to have only those rows that do not complete NA
values and plot a scaterplot of *wind* and
*temperature*.
• Change the temperature from Fahrenheit to Celsius.
• Create a new column, where if temperature is bigger
than the mean temperature, write ”high”, otherwise
write ”low”.
• Calculate mean temperature for each month.
• Plot graph of multiple line plots, where each line is the
month and y is the temperature value.

17
Summary

Reading
https://bookdown.org/yih huynh/GuidetoRBook/basicdata-
management.html

Questions?

18

You might also like