02 Data Processing
02 Data Processing
Luiza Andrade, Rob Marty, Rony Rodriguez-Ramirez, Luis Eduardo San Martin, Leonardo Viotti
The World Bank – DIME | WB Github
April 2021
Table of contents
. Introduction
. ID variables
. Create variables
. Saving a dataframe
. Factor variables
. Reshaping
2 / 60
Introduction
3 / 60
Introduction
Goals of this session
To understand what RStudio projects are and how they can be used in your daily work.
To organize data in a way that it will be easier to analyze it and communicate it.
4 / 60
Introduction
In this session, you'll be introduced to some basic concepts of data cleaning in R. We will cover:
. RStudio projects;
. Exploring a dataset;
. Creating new variables;
. Filtering and subsetting datasets;
. Merging datasets;
. Dealing with factor variables;
. Saving data.
There are many other tasks that we usually perform as part of data cleaning that are beyond the scope of this
session.
5 / 60
Introduction: RStudio Projects
What are RStudio projects?
Projects should be reproducible, and if you happen to code in R, RStudio Projects will make your science reproducible
and easier in the long-run.
RStudio projects are simply working directories associated with a .Rproj .
If you open a .Rproj le, your working directory will be set automatically. Therefore, in this directory you should
create folders containing your data, codes, notes, and other material.
To create a RStudio project go to File > New Project , click on New Repository :
6 / 60
Introduction: RStudio Projects
Click on New Project Add a name to your project and directory, and click on
create project.
7 / 60
Introduction: RStudio Projects
RStudio projects give you a solid work ow:
Uses relative paths.
Keep data les there.
You will have a history of the functions you have used.
It could load the last state of your environment.
8 / 60
Introduction: Packages
Another important aspect to consider is R packages. Consider the following:
R is a new phone R packages are apps in your phone
9 / 60
Introduction: Packages
To install a package you can run the following command:
# To Install
install.packages("tidyverse")
Unlike in Stata, R packages need to be loaded in each R session that will use them.
That means that for example a functions that comes from the tidyverse cannot be used if the tidyverse package has not
been installed and loaded rst.
Notice that we se double quotes for installing but not for loading a package.
10 / 60
Introduction
Before we start, let's make sure we are all set:
. Start a fresh RStudio session.
. Open your RStudio project.
11 / 60
Let's load the two following packages:
# If you haven't installed the packages uncomment the next line
# install.package("tidyverse")
# install.package("here")
# install.package("janitor")
library(tidyverse)
library(here) # A package to work with relative le paths
library(janitor) # Additional data cleaning tools
Notes: Remember you should always load your packages before your start coding.
12 / 60
File paths
The here package allows you to interact with your working directory. It will look for the closest R Project and set its location as
the working directory. That's why it is important to set your RStudio project correctly.
13 / 60
Loading a dataset in R
Before we start wrangling our data, let's read ourdataset. In R, we can use the read.csv function from Base R, or read_csv
from the readr package if we want to load a CSV le. For this exercise, we are going to use the World Happiness Report (2015-
2018)
How to do it?
whr15 <- read_csv(here("DataWork", "DataSets", "Raw", "Un WHR", "WHR2015.csv")) %>% clean_names()
whr16 <- read_csv(here("DataWork", "DataSets", "Raw", "Un WHR", "WHR2016.csv")) %>% clean_names()
whr17 <- read_csv(here("DataWork", "DataSets", "Raw", "Un WHR", "WHR2017.csv")) %>% clean_names()
Notice the clean_names() function and the pipe (%>%) operator. More on this in the next slide.
14 / 60
The pipe %>% operator.
"Piping" in R can be seen as "chaining." This means that we are invoking multiple method calls.
Every time you have invoked a method (a function) this return an object that then is going to be used in the next pipe.
15 / 60
The clean_names() function
The clean_names() function helps us big time when our variables names are pretty bad. For example, if we have a variable
that is called GDP_per_CApita_2015, the clean_names() function will help us x those messy names.
Pro tip: Use the clean_names() function in a pipe after you load a dataset as we did in the last slide. It will make
sure column names are well-suited for use in R
16 / 60
Exploring your data
17 / 60
Exploring a data set
Some useful functions from base R:
18 / 60
Load and show a dataset
We can just show our dataset using the name of the object; in this case, whr15 .
whr15
## # A tibble: 158 x 12
## country region happiness_rank happiness_score standard_error economy_gdp_per~
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Switze~ Weste~ 1 7.59 0.0341 1.40
## 2 Iceland Weste~ 2 7.56 0.0488 1.30
## 3 Denmark Weste~ 3 7.53 0.0333 1.33
## 4 Norway Weste~ 4 7.52 0.0388 1.46
## 5 Canada North~ 5 7.43 0.0355 1.33
## 6 Finland Weste~ 6 7.41 0.0314 1.29
## 7 Nether~ Weste~ 7 7.38 0.0280 1.33
## 8 Sweden Weste~ 8 7.36 0.0316 1.33
## 9 New Ze~ Austr~ 9 7.29 0.0337 1.25
## 10 Austra~ Austr~ 10 7.28 0.0408 1.33
## # ... with 148 more rows, and 6 more variables: family <dbl>,
## # health_life_expectancy <dbl>, freedom <dbl>,
## # trust_government_corruption <dbl>, generosity <dbl>,
## # dystopia_residual <dbl> 19 / 60
Glimpse your data
Use glimpse() to get information about your variables (e.g., type, row, columns,)
whr15 %>%
glimpse()
## Rows: 158
## Columns: 12
## $ country <chr> "Switzerland", "Iceland", "Denmark", "Norw~
## $ region <chr> "Western Europe", "Western Europe", "Weste~
## $ happiness_rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,~
## $ happiness_score <dbl> 7.587, 7.561, 7.527, 7.522, 7.427, 7.406, ~
## $ standard_error <dbl> 0.03411, 0.04884, 0.03328, 0.03880, 0.0355~
## $ economy_gdp_per_capita <dbl> 1.39651, 1.30232, 1.32548, 1.45900, 1.3262~
## $ family <dbl> 1.34951, 1.40223, 1.36058, 1.33095, 1.3226~
## $ health_life_expectancy <dbl> 0.94143, 0.94784, 0.87464, 0.88521, 0.9056~
## $ freedom <dbl> 0.66557, 0.62877, 0.64938, 0.66973, 0.6329~
## $ trust_government_corruption <dbl> 0.41978, 0.14145, 0.48357, 0.36503, 0.3295~
## $ generosity <dbl> 0.29678, 0.43630, 0.34139, 0.34699, 0.4581~
## $ dystopia_residual <dbl> 2.51738, 2.70201, 2.49204, 2.46531, 2.4517~
20 / 60
ID variables
21 / 60
ID variables
Desired properties of an ID variable: uniquely and fully identifying.
22 / 60
ID variables
Let's see rst:
dim(whr15)
## [1] 158 12
The $ sign is a subsetting operator. In R, we have three subsetting operators ( [[ , [ , and $ .). It is often used to access
variables in a dataframe.
23 / 60
Missing values in R
Quick Note:
Missings in R are treated differently than in Stata. They NA is not a string or a numeric value, but an indicator
are represented by the NA symbol. of missingness.
Impossible values are represented by the symbol NaN NAs are contagious. This means that if you compare a
which means 'not a number.' number with NAs you will get NAs.
R uses the same symbol for character and numeric Therefore, always remember the na.rm = TRUE
data. argument if needed.
24 / 60
ID variables
In the last example, we used n_distinct . This allows us to count the number of unique values of a variable length of a vector.
We included na.rm = TRUE , so we don't count missing values.
. Region
. Country
How to do it?
n_distinct(whr15$country, na.rm = TRUE)
## [1] 158
## [1] 10 25 / 60
ID variables
We can also test whether the number of rows is equal to the number of distinct values in a speci c variable as follows:
nrow(whr15)
## [1] 158
We can use the two functions ( nrow and n_distinct ) together to test if their result is the same.
## [1] TRUE
## [1] TRUE
## [1] TRUE
26 / 60
Wrangling your data
27 / 60
dplyr:: lter
The lter function is used to subset rows in a dataset.
## # A tibble: 21 x 12
## country region happiness_rank happiness_score standard_error economy_gdp_per~
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Switze~ Weste~ 1 7.59 0.0341 1.40
## 2 Iceland Weste~ 2 7.56 0.0488 1.30
## 3 Denmark Weste~ 3 7.53 0.0333 1.33
## 4 Norway Weste~ 4 7.52 0.0388 1.46
## 5 Finland Weste~ 6 7.41 0.0314 1.29
## 6 Nether~ Weste~ 7 7.38 0.0280 1.33
## 7 Sweden Weste~ 8 7.36 0.0316 1.33
## 8 Austria Weste~ 13 7.2 0.0375 1.34
## 9 Luxemb~ Weste~ 17 6.95 0.0350 1.56
## 10 Ireland Weste~ 18 6.94 0.0368 1.34
## # ... with 11 more rows, and 6 more variables: family <dbl>,
## # health_life_expectancy <dbl>, freedom <dbl>,
## # trust_government_corruption <dbl>, generosity <dbl>,
## # dystopia_residual <dbl> 28 / 60
dplyr:: lter
Exercise 3: Filter the dataset.
Use lter() to extract only rows in one of these regions: (1) Eastern Asia and (2) North America.
whr15 %>%
lter(region == "Eastern Asia" | region == "North America")
A more elegant approach would be to use the %in% operator (equivalent to inlist() in Stata):
whr15 %>%
lter(region %in% c("Eastern Asia", "North America"))
29 / 60
dplyr:: lter missing cases
If case you want to remove (or identify) the missing cases for a speci c variable, you can use is.na() .
This function returns a value of true and false for each value in a data set.
If the value is NA the is.na() function return the value of true, otherwise, return to a value of false.
In this way, we can check NA values that can be used for other functions.
We can also negate the function using !is.na() which indicates that we want to return those observations with no
missings values in a specif variable.
DATA %>%
lter(is.na(VAR))
The observations that have missing values for the variable VAR.
30 / 60
dplyr:: lter missing cases
Let's try ltering the whr15 data. Let's keep those observations that have information per region, i.e., no missing values.
whr15 %>%
lter(!is.na(region)) %>%
head(5)
## # A tibble: 5 x 12
## country region happiness_rank happiness_score standard_error economy_gdp_per~
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Switzer~ Weste~ 1 7.59 0.0341 1.40
## 2 Iceland Weste~ 2 7.56 0.0488 1.30
## 3 Denmark Weste~ 3 7.53 0.0333 1.33
## 4 Norway Weste~ 4 7.52 0.0388 1.46
## 5 Canada North~ 5 7.43 0.0355 1.33
## # ... with 6 more variables: family <dbl>, health_life_expectancy <dbl>,
## # freedom <dbl>, trust_government_corruption <dbl>, generosity <dbl>,
## # dystopia_residual <dbl>
32 / 60
Creating new variables
In the tidyverse, we refer to creating variables as mutating
So, we use the mutate() function. Let's say we want to have interactions:
whr15 %>%
mutate(
hap_hle = happiness_score * health_life_expectancy,
) %>%
select(country:happiness_score, health_life_expectancy, hap_hle) %>%
head(5)
## # A tibble: 5 x 6
## country region happiness_rank happiness_score health_life_expectancy hap_hle
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Switzerland Western Europe 1 7.59 0.941 7.14
## 2 Iceland Western Europe 2 7.56 0.948 7.17
## 3 Denmark Western Europe 3 7.53 0.875 6.58
## 4 Norway Western Europe 4 7.52 0.885 6.66
## 5 Canada North America 5 7.43 0.906 6.73
33 / 60
Creating new variables: Dummy variables
whr15 %>%
mutate(happiness_score_6 = (happiness_score > 6))
whr15 %>%
mutate(happiness_score_6 = as.numeric((happiness_score > 6)))
whr15 %>%
mutate(happiness_high_mean = as.numeric((happiness_score > mean(happiness_score))))
34 / 60
Some notes: mutate() vs transmute()
mutate() vs transmute()
35 / 60
Creating variables by groups
Let's imagine now that we want to create a variable at the region level (recall bys gen in Stata).In R, we can use group_by()
before we mutate. So for this example, we are going to pipe the following functions.
whr15 %>%
group_by(region) %>%
mutate(
mean_hap = mean(happiness_score)
) %>%
select(country, region, happiness_score, mean_hap)
36 / 60
Creating multiple variables at the same time
We can create multiple variables in an easy way. So, let's imagine that we want to estimate the mean value for the variables:
happiness_score, health_life_expectancy, and trust_government_corruption.
We can use the function across() . It behaves this way: across(VARS that you want to transform, FUNCTION to
execute) .
across() should be always use inside summarise() or mutate() .
Across Output
whr15 %>%
group_by(region) %>%
summarize(
across(all_of(vars), mean)
)
37 / 60
Creating multiple variables at the same time
We can create multiple variables in an easy way. So, let's imagine that we want to estimate the mean value for the variables:
happiness_score, health_life_expectancy, and trust_government_corruption.
We can use the function across() . It behaves this way: across(VARS that you want to transform, FUNCTION to
execute) .
across() should be always use inside summarise() or mutate() .
Across Output
## # A tibble: 3 x 4
## region happiness_score health_life_expecta~ trust_government_corr~
## <chr> <dbl> <dbl> <dbl>
## 1 Australia and New~ 7.28 0.920 0.393
## 2 Central and Easte~ 5.33 0.719 0.0867
## 3 Eastern Asia 5.63 0.877 0.128
37 / 60
Creating variables
Exercise 5: Create a variable called year that equals to the year of each dataframe .
Use mutate()
Remember to assign it to the same dataframe.
How to do it?
whr15 <- whr15 %>%
mutate(
year = 2015
)
39 / 60
Appending and merging data sets
Now that we can identify the observations, we can combine the data set. Here are two functions to append objects by row
bind_rows(df1, df2, df3) # The tidyverse function, making some improvements to base R
How to do it?
bind_rows(whr15, whr16, whr17)
40 / 60
Appending and merging data sets
Exercise 7: Fixing our variables and appending the dfs correctly.
Exercise 7a:
41 / 60
Appending and merging data sets
We can use the left_join() function merge two dataframes. The function syntax is: left_join(a_df, another_df, by =
c("id_col1")) .
A left join takes all the values from the rst table, and looks for matches in the second table. If it nds a match, it
adds the data from the second table; if not, it adds missing values. It is the equivalent of merge, keep(master
matched) in Stata.
Exercise 7b:
Now, we join the regions dataframe with the whr17 dataframe.
Notes: Look at the everything() function. It takes all the variables from the dataframe and put them after country
and region. In this way, select can be use to order columns!
42 / 60
Appending and merging data sets
Exercise 7c:
Check if there is any other country without region info:
whr17 %>%
lter(is.na(region))
## # A tibble: 2 x 14
## country region happiness_rank happiness_score whisker_high whisker_low
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Taiwan Provinc~ <NA> 33 6.42 6.49 6.35
## 2 Hong Kong S.A.~ <NA> 71 5.47 5.55 5.39
## # ... with 8 more variables: economy_gdp_per_capita <dbl>, family <dbl>,
## # health_life_expectancy <dbl>, freedom <dbl>, generosity <dbl>,
## # trust_government_corruption <dbl>, dystopia_residual <dbl>, year <dbl>
43 / 60
So we ended up with two countries with NAs
This is due to the name of the countries. The regions dataset doesn't have "Taiwan Province of China" nor "Hong Kong S.A.R.,
China" but "Taiwan" and "Hong Kong."
44 / 60
Appending and merging data sets
Finally, let's keep those relevant variables rst and bind those rows.
45 / 60
Saving a dataset
46 / 60
Saving a dataset
The data set you have now is the same data set we’ve been using for earlier sessions, so we can save it now
As mentioned before, R data sets are often save as csv.
To save a dataset we can use the write_csv function from the tidyverse, or write.csv from base R.
47 / 60
Saving a dataset
Exercise 9: Save the dataset.
Use write_csv()
Use here()
write_csv(whr_panel,
here("DataWork", "DataSets", "Final", "whr_panel.csv"))
The problem with CSVs is that they cannot differentiate between strings and factors
They also don’t save factor orders
Data attributes (which are beyong the scope of this training, but also useful to document data sets) are also lost in csv
data
48 / 60
Saving a dataset
The R equivalent of a .dta le is a .rds le. It can be saved and loaded using the following commands:
49 / 60
Thank you~~
50 / 60
Appendix
51 / 60
Other relevant functions: slice, subset, select
Arrange Slice Select Combining functions
whr15 %>%
arrange(region, country) %>%
head(5)
## # A tibble: 5 x 8
## country region year happiness_rank happiness_score economy_gdp_per_c~
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Australia Australia a~ 2015 10 7.28 1.33
## 2 New Zeal~ Australia a~ 2015 9 7.29 1.25
## 3 Albania Central and~ 2015 95 4.96 0.879
## 4 Armenia Central and~ 2015 127 4.35 0.768
## 5 Azerbaij~ Central and~ 2015 80 5.21 1.02
## # ... with 2 more variables: health_life_expectancy <dbl>, freedom <dbl>
52 / 60
Other relevant functions: slice, subset, select
Arrange Slice Select Combining functions
whr15 %>%
slice(1:5) # to select the rst 5 rows
## # A tibble: 5 x 8
## country region year happiness_rank happiness_score economy_gdp_per_ca~
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Switzerla~ Western E~ 2015 1 7.59 1.40
## 2 Iceland Western E~ 2015 2 7.56 1.30
## 3 Denmark Western E~ 2015 3 7.53 1.33
## 4 Norway Western E~ 2015 4 7.52 1.46
## 5 Canada North Ame~ 2015 5 7.43 1.33
## # ... with 2 more variables: health_life_expectancy <dbl>, freedom <dbl>
You can also use slice_head and slice_tail to select the rst or last rows respectively. Or slice_sample to randomly draw n
rows.
52 / 60
Other relevant functions: slice, subset, select
Arrange Slice Select Combining functions
whr15 %>%
select(region, country, happiness_rank)
## # A tibble: 158 x 3
## region country happiness_rank
## <chr> <chr> <dbl>
## 1 Western Europe Switzerland 1
## 2 Western Europe Iceland 2
## 3 Western Europe Denmark 3
## 4 Western Europe Norway 4
## 5 North America Canada 5
## 6 Western Europe Finland 6
## 7 Western Europe Netherlands 7
## 8 Western Europe Sweden 8
## 9 Australia and New Zealand New Zealand 9
## 10 Australia and New Zealand Australia 10 52 / 60
Other relevant functions: slice, subset, select
Arrange Slice Select Combining functions
whr15 %>%
arrange(region, country) %>% # Sort by region and country
lter(!is.na(region)) %>% # Filter those non-missing obs for region if any
select(country, region, starts_with("happin")) %>% # Select country, year, and vars that stars with happin
slice_head() # Get the rst row
## # A tibble: 1 x 4
## country region happiness_rank happiness_score
## <chr> <chr> <dbl> <dbl>
## 1 Australia Australia and New Zealand 10 7.28
52 / 60
Using ifelse when creating a variable
We can also create a dummy variable with the ifelse() function. The way we use this function is as: ifelse(test, yes, no) .
We can also use another function called case_when() .
whr15 %>%
mutate(
latin_america_car = ifelse(region == "Latin America and Caribbean", 1, 0)
) %>%
arrange(-latin_america_car) %>%
head(5)
## # A tibble: 5 x 9
## country region year happiness_rank happiness_score economy_gdp_per_c~
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Costa R~ Latin Americ~ 2015 12 7.23 0.956
## 2 Mexico Latin Americ~ 2015 14 7.19 1.02
## 3 Brazil Latin Americ~ 2015 16 6.98 0.981
## 4 Venezue~ Latin Americ~ 2015 23 6.81 1.04
## 5 Panama Latin Americ~ 2015 25 6.79 1.06
## # ... with 3 more variables: health_life_expectancy <dbl>, freedom <dbl>,
## # latin_america_car <dbl>
53 / 60
Factor variables
54 / 60
Factor variables
When we imported this data set, we told R explicitly to not read strings as factor.
We did that because we knew that we’d have to x the country names.
The region variable, however, should be a factor.
str(whr_panel$region)
55 / 60
Factor variables
To create a factor variable, we use the factor() function (or as_factor() from the forcats package).
factor(x, levels, labels) : turns numeric or string vector x into a factor vector.
levels : a vector containing the possible values of x .
labels : a vector of strings containing the labels you want to apply to your factor variable
ordered : logical ag to determine if the levels should be regarded as ordered (in the order given).
If your categorical variable does not need to be ordered, and your string variable already has the label you want, making the
conversion is quite easy.
56 / 60
Factor variables
Exercise 10: Turn a string variable into a factor.
Use the mutate function to create a variable called region_cat containing a categorical version of the region variable.
TIP: to do this, you only need the rst argument of the factor function.
How to do it?
whr_panel <- mutate(whr_panel, region_cat = factor(region))
class(whr_panel$region_cat)
## [1] "factor"
57 / 60
Reshaping a dataset
58 / 60
Reshaping a dataset
Finally, let's try to reshape our dataset using the tidyverse functions. No more reshape from Stata. We can use pivot_wider or
pivot_longer . Let's assign our wide format panel to an object called whr_panel_wide.
whr_panel %>%
select(country, region, year, happiness_score) %>%
pivot_wider(
names_from = year,
values_from = happiness_score
) %>%
head(5)
## # A tibble: 5 x 5
## country region `2015` `2016` `2017`
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Switzerland Western Europe 7.59 7.51 7.49
## 2 Iceland Western Europe 7.56 7.50 7.50
## 3 Denmark Western Europe 7.53 7.53 7.52
## 4 Norway Western Europe 7.52 7.50 7.54 59 / 60
Reshaping a dataset
Finally, let's try to reshape our dataset using the tidyverse functions. No more reshape from Stata. We can use pivot_wider or
pivot_longer . Let's assign our wide format panel to an object called whr_panel_wide.
whr_panel_wide %>%
pivot_longer(
cols = `2015`:`2017`,
names_to = "year",
values_to = "happiness_score"
) %>%
head(5)
## # A tibble: 5 x 4
## country region year happiness_score
## <chr> <chr> <chr> <dbl>
## 1 Switzerland Western Europe 2015 7.59
## 2 Switzerland Western Europe 2016 7.51
## 3 Switzerland Western Europe 2017 7.49
## 4 Iceland Western Europe 2015 7.56 59 / 60
Thank you~~
60 / 60