0% found this document useful (0 votes)

27 views65 pages

02 Data Processing

This document provides an introduction to data processing in R for users familiar with Stata. It discusses RStudio projects for organizing work, exploring datasets using functions like View(), class(), dim(), names(), str() and summary(), and wrangling data through tasks like creating new variables, filtering, merging datasets, and dealing with factor variables. The goal is to understand R projects, organize data for easier analysis and communication, and get familiar with the tidyverse package for data cleaning in R.

Uploaded by

DavidVegasDelCastillo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views65 pages

02 Data Processing

Uploaded by

DavidVegasDelCastillo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 65

Session 2: Data Processing

R for Stata Users

Luiza Andrade, Rob Marty, Rony Rodriguez-Ramirez, Luis Eduardo San Martin, Leonardo Viotti
The World Bank – DIME | WB Github
April 2021
Table of contents
. Introduction

. Exploring your data

. ID variables

. Wrangling your data

. Create variables

. Appending and marging

. Saving a dataframe

. Factor variables

. Reshaping

2 / 60
Introduction

3 / 60
Introduction
Goals of this session
To understand what RStudio projects are and how they can be used in your daily work.

To organize data in a way that it will be easier to analyze it and communicate it.

To get familiar with the packages bundled into the tidyverse .

Things to keep in mind

We'll take you through the same steps we've taken when we were preparing the datasets used in this course.

In most cases, your datasets won't be tidy .

Tidy data: A dataset is said to be tidy if it satis es the following conditions:

. Every column is a variable.

. Every row is an observation.
. Every cell is a single value.

4 / 60
Introduction
In this session, you'll be introduced to some basic concepts of data cleaning in R. We will cover:

. RStudio projects;
. Exploring a dataset;
. Creating new variables;
. Filtering and subsetting datasets;
. Merging datasets;
. Dealing with factor variables;
. Saving data.

There are many other tasks that we usually perform as part of data cleaning that are beyond the scope of this
session.

5 / 60
Introduction: RStudio Projects
What are RStudio projects?
Projects should be reproducible, and if you happen to code in R, RStudio Projects will make your science reproducible
and easier in the long-run.
RStudio projects are simply working directories associated with a .Rproj .
If you open a .Rproj le, your working directory will be set automatically. Therefore, in this directory you should
create folders containing your data, codes, notes, and other material.
To create a RStudio project go to File > New Project , click on New Repository :

6 / 60
Introduction: RStudio Projects

Click on New Project Add a name to your project and directory, and click on
create project.

7 / 60
Introduction: RStudio Projects
RStudio projects give you a solid work ow:
Uses relative paths.
Keep data les there.
You will have a history of the functions you have used.
It could load the last state of your environment.

8 / 60
Introduction: Packages
Another important aspect to consider is R packages. Consider the following:
R is a new phone R packages are apps in your phone

9 / 60
Introduction: Packages
To install a package you can run the following command:
# To Install
install.packages("tidyverse")

Unlike in Stata, R packages need to be loaded in each R session that will use them.
That means that for example a functions that comes from the tidyverse cannot be used if the tidyverse package has not
been installed and loaded rst.

To load a package you can run the following command:

# To Install
library(tidyverse)

Notice that we se double quotes for installing but not for loading a package.

10 / 60
Introduction
Before we start, let's make sure we are all set:
. Start a fresh RStudio session.
. Open your RStudio project.

Notes for this session

Most of our exercises for today focus on the tidyverse.
The setup is a bit different than yesterday.
Since we are "exploring" our data, we something don't need to assign everything to an object.

11 / 60
Let's load the two following packages:
# If you haven't installed the packages uncomment the next line
# install.package("tidyverse")
# install.package("here")
# install.package("janitor")
library(tidyverse)
library(here) # A package to work with relative le paths
library(janitor) # Additional data cleaning tools

Notes: Remember you should always load your packages before your start coding.

12 / 60
File paths
The here package allows you to interact with your working directory. It will look for the closest R Project and set its location as
the working directory. That's why it is important to set your RStudio project correctly.

The goal of this package is to:

Easily reference your les in project-oriented work ows.

13 / 60
Loading a dataset in R
Before we start wrangling our data, let's read ourdataset. In R, we can use the read.csv function from Base R, or read_csv
from the readr package if we want to load a CSV le. For this exercise, we are going to use the World Happiness Report (2015-
2018)

Exercise 1: Load Data. This is a recap from yesterday's session.

Use either of the functions mentioned above and load the three WHR datasets from the DataWork/DataSets/Raw/Un WHR folder.
Use the following notation for each dataset: whrYY .

Remember to use here() to simplify the folder path.

How to do it?
whr15 <- read_csv(here("DataWork", "DataSets", "Raw", "Un WHR", "WHR2015.csv")) %>% clean_names()
whr16 <- read_csv(here("DataWork", "DataSets", "Raw", "Un WHR", "WHR2016.csv")) %>% clean_names()
whr17 <- read_csv(here("DataWork", "DataSets", "Raw", "Un WHR", "WHR2017.csv")) %>% clean_names()

Notice the clean_names() function and the pipe (%>%) operator. More on this in the next slide.

14 / 60
The pipe %>% operator.
"Piping" in R can be seen as "chaining." This means that we are invoking multiple method calls.
Every time you have invoked a method (a function) this return an object that then is going to be used in the next pipe.

rony %>% work(

wake_up(time = "6:30") %>% brush_teeth(
get_out_of_bead() %>% eat(
do_exercise() %>% get_dressed(
shower() %>% shower(
get_dressed() %>% do_exercise(
eat(meal = "breakfast", coffee = TRUE) %>% get_out_of_bed(
brush_teeth() %>% wake_up(me, time = "6:30")
work(effort = "mininum") ),
)
)
), meal = "breakfast", coffee = TRUE
)
), effort = "minimum"
)

15 / 60
The clean_names() function
The clean_names() function helps us big time when our variables names are pretty bad. For example, if we have a variable
that is called GDP_per_CApita_2015, the clean_names() function will help us x those messy names.

Pro tip: Use the clean_names() function in a pipe after you load a dataset as we did in the last slide. It will make
sure column names are well-suited for use in R

If we wanto to rename our variable manually, we could use:

whr15 <- whr15 %>%

rename(
var_newname = var_oldname
)

16 / 60
Exploring your data

17 / 60
Exploring a data set
Some useful functions from base R:

View() : open the data set

class() : reports object type of type of data stored.
dim() : reports the size of each one of an object's dimension.
names() : returns the variable names of a dataset.
str() : general information on an R object.
summary() : summary information about the variables in a data frame.
head() : shows the rst few observations in the dataset.
tail() : shows the last few observations in the dataset.

Some other useful functions from the tidyverse:

glimpse() : get a glimpse of your data

18 / 60
Load and show a dataset
We can just show our dataset using the name of the object; in this case, whr15 .

whr15

## # A tibble: 158 x 12
## country region happiness_rank happiness_score standard_error economy_gdp_per~
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Switze~ Weste~ 1 7.59 0.0341 1.40
## 2 Iceland Weste~ 2 7.56 0.0488 1.30
## 3 Denmark Weste~ 3 7.53 0.0333 1.33
## 4 Norway Weste~ 4 7.52 0.0388 1.46
## 5 Canada North~ 5 7.43 0.0355 1.33
## 6 Finland Weste~ 6 7.41 0.0314 1.29
## 7 Nether~ Weste~ 7 7.38 0.0280 1.33
## 8 Sweden Weste~ 8 7.36 0.0316 1.33
## 9 New Ze~ Austr~ 9 7.29 0.0337 1.25
## 10 Austra~ Austr~ 10 7.28 0.0408 1.33
## # ... with 148 more rows, and 6 more variables: family <dbl>,
## # health_life_expectancy <dbl>, freedom <dbl>,
## # trust_government_corruption <dbl>, generosity <dbl>,
## # dystopia_residual <dbl> 19 / 60
Glimpse your data
Use glimpse() to get information about your variables (e.g., type, row, columns,)

whr15 %>%
glimpse()

## Rows: 158
## Columns: 12
## $ country <chr> "Switzerland", "Iceland", "Denmark", "Norw~
## $ region <chr> "Western Europe", "Western Europe", "Weste~
## $ happiness_rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,~
## $ happiness_score <dbl> 7.587, 7.561, 7.527, 7.522, 7.427, 7.406, ~
## $ standard_error <dbl> 0.03411, 0.04884, 0.03328, 0.03880, 0.0355~
## $ economy_gdp_per_capita <dbl> 1.39651, 1.30232, 1.32548, 1.45900, 1.3262~
## $ family <dbl> 1.34951, 1.40223, 1.36058, 1.33095, 1.3226~
## $ health_life_expectancy <dbl> 0.94143, 0.94784, 0.87464, 0.88521, 0.9056~
## $ freedom <dbl> 0.66557, 0.62877, 0.64938, 0.66973, 0.6329~
## $ trust_government_corruption <dbl> 0.41978, 0.14145, 0.48357, 0.36503, 0.3295~
## $ generosity <dbl> 0.29678, 0.43630, 0.34139, 0.34699, 0.4581~
## $ dystopia_residual <dbl> 2.51738, 2.70201, 2.49204, 2.46531, 2.4517~

20 / 60
ID variables

21 / 60
ID variables
Desired properties of an ID variable: uniquely and fully identifying.

An ID variable cannot have duplicates

An ID variable may never be missing
The ID variable must be constant across a project
The ID variable must be anonymous

22 / 60
ID variables
Let's see rst:

Dimensions of your data:

dim(whr15)

## [1] 158 12

The number of distinct values of a particular variable:

n_distinct(DATASET$variable, na.rm = TRUE)

The $ sign is a subsetting operator. In R, we have three subsetting operators ( [[ , [ , and $ .). It is often used to access
variables in a dataframe.

23 / 60
Missing values in R
Quick Note:
Missings in R are treated differently than in Stata. They NA is not a string or a numeric value, but an indicator
are represented by the NA symbol. of missingness.
Impossible values are represented by the symbol NaN NAs are contagious. This means that if you compare a
which means 'not a number.' number with NAs you will get NAs.
R uses the same symbol for character and numeric Therefore, always remember the na.rm = TRUE
data. argument if needed.

24 / 60
ID variables
In the last example, we used n_distinct . This allows us to count the number of unique values of a variable length of a vector.
We included na.rm = TRUE , so we don't count missing values.

Exercise 2: Identify the ID.

Using the n_distinct function, can you tell if the following variables are IDs of the whr15 data set? Is any of these variables
an ID variable?

. Region
. Country

How to do it?
n_distinct(whr15$country, na.rm = TRUE)

## [1] 158

n_distinct(whr15$region, na.rm = TRUE)

## [1] 10 25 / 60
ID variables
We can also test whether the number of rows is equal to the number of distinct values in a speci c variable as follows:

nrow(whr15)

## [1] 158

We can use the two functions ( nrow and n_distinct ) together to test if their result is the same.

n_distinct(whr15$country, na.rm = TRUE) == nrow(whr15)

## [1] TRUE

n_distinct(whr16$country, na.rm = TRUE) == nrow(whr16)

## [1] TRUE

n_distinct(whr17$country, na.rm = TRUE) == nrow(whr17)

## [1] TRUE

26 / 60
Wrangling your data

27 / 60
dplyr:: lter
The lter function is used to subset rows in a dataset.

whr15 %>% lter(region == "Western Europe")

## # A tibble: 21 x 12
## country region happiness_rank happiness_score standard_error economy_gdp_per~
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Switze~ Weste~ 1 7.59 0.0341 1.40
## 2 Iceland Weste~ 2 7.56 0.0488 1.30
## 3 Denmark Weste~ 3 7.53 0.0333 1.33
## 4 Norway Weste~ 4 7.52 0.0388 1.46
## 5 Finland Weste~ 6 7.41 0.0314 1.29
## 6 Nether~ Weste~ 7 7.38 0.0280 1.33
## 7 Sweden Weste~ 8 7.36 0.0316 1.33
## 8 Austria Weste~ 13 7.2 0.0375 1.34
## 9 Luxemb~ Weste~ 17 6.95 0.0350 1.56
## 10 Ireland Weste~ 18 6.94 0.0368 1.34
## # ... with 11 more rows, and 6 more variables: family <dbl>,
## # health_life_expectancy <dbl>, freedom <dbl>,
## # trust_government_corruption <dbl>, generosity <dbl>,
## # dystopia_residual <dbl> 28 / 60
dplyr:: lter
Exercise 3: Filter the dataset.
Use lter() to extract only rows in one of these regions: (1) Eastern Asia and (2) North America.

The or operator ( | ) is one way to do it:

whr15 %>%
lter(region == "Eastern Asia" | region == "North America")

A more elegant approach would be to use the %in% operator (equivalent to inlist() in Stata):

whr15 %>%
lter(region %in% c("Eastern Asia", "North America"))

29 / 60
dplyr:: lter missing cases
If case you want to remove (or identify) the missing cases for a speci c variable, you can use is.na() .

This function returns a value of true and false for each value in a data set.
If the value is NA the is.na() function return the value of true, otherwise, return to a value of false.
In this way, we can check NA values that can be used for other functions.
We can also negate the function using !is.na() which indicates that we want to return those observations with no
missings values in a specif variable.

The function syntax in a pipeline is as follows:

DATA %>%
lter(is.na(VAR))

What are we returning here?

The observations that have missing values for the variable VAR.

30 / 60
dplyr:: lter missing cases
Let's try ltering the whr15 data. Let's keep those observations that have information per region, i.e., no missing values.

whr15 %>%
lter(!is.na(region)) %>%
head(5)

## # A tibble: 5 x 12
## country region happiness_rank happiness_score standard_error economy_gdp_per~
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Switzer~ Weste~ 1 7.59 0.0341 1.40
## 2 Iceland Weste~ 2 7.56 0.0488 1.30
## 3 Denmark Weste~ 3 7.53 0.0333 1.33
## 4 Norway Weste~ 4 7.52 0.0388 1.46
## 5 Canada North~ 5 7.43 0.0355 1.33
## # ... with 6 more variables: family <dbl>, health_life_expectancy <dbl>,
## # freedom <dbl>, trust_government_corruption <dbl>, generosity <dbl>,
## # dystopia_residual <dbl>

Notice that we are negating the function, i.e., !is.na()

In case we want to keep the observations that contains missing information we will only use is.na() .
31 / 60
Creating new variables

32 / 60
Creating new variables
In the tidyverse, we refer to creating variables as mutating
So, we use the mutate() function. Let's say we want to have interactions:

whr15 %>%
mutate(
hap_hle = happiness_score * health_life_expectancy,
) %>%
select(country:happiness_score, health_life_expectancy, hap_hle) %>%
head(5)

## # A tibble: 5 x 6
## country region happiness_rank happiness_score health_life_expectancy hap_hle
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Switzerland Western Europe 1 7.59 0.941 7.14
## 2 Iceland Western Europe 2 7.56 0.948 7.17
## 3 Denmark Western Europe 3 7.53 0.875 6.58
## 4 Norway Western Europe 4 7.52 0.885 6.66
## 5 Canada North America 5 7.43 0.906 6.73

33 / 60
Creating new variables: Dummy variables
whr15 %>%
mutate(happiness_score_6 = (happiness_score > 6))

What do you think is happening to this variable?

The variable we created contains either TRUE or FALSE .

If we wanted to have it as a numeric (1 or 0, respectively), we could include as.numeric() . However, the point of
having logical variables is to treat them as numbers when relevant (for example as dummy variables in a
regression) and as categories when relevant (for example in graphs)

whr15 %>%
mutate(happiness_score_6 = as.numeric((happiness_score > 6)))

Finally, instead of using a random number such as 6, we can do the following:

whr15 %>%
mutate(happiness_high_mean = as.numeric((happiness_score > mean(happiness_score))))

34 / 60
Some notes: mutate() vs transmute()
mutate() vs transmute()

Similar in nature but:

. mutate() returns original and new columns (variables).

. transmute() returns only the new columns (variables).

35 / 60
Creating variables by groups
Let's imagine now that we want to create a variable at the region level (recall bys gen in Stata).In R, we can use group_by()
before we mutate. So for this example, we are going to pipe the following functions.

. Group our data by the region variable.

. Create a variable that would be the mean of happiness_score by each region.
. Select the variables country, region, happiness_score, mean_hap .

whr15 %>%
group_by(region) %>%
mutate(
mean_hap = mean(happiness_score)
) %>%
select(country, region, happiness_score, mean_hap)

36 / 60
Creating multiple variables at the same time
We can create multiple variables in an easy way. So, let's imagine that we want to estimate the mean value for the variables:
happiness_score, health_life_expectancy, and trust_government_corruption.

How we can do it?

We can use the function across() . It behaves this way: across(VARS that you want to transform, FUNCTION to
execute) .
across() should be always use inside summarise() or mutate() .

Across Output

vars <- c("happiness_score", "health_life_expectancy", "trust_government_corruption")

whr15 %>%
group_by(region) %>%
summarize(
across(all_of(vars), mean)
)

37 / 60
Creating multiple variables at the same time
We can create multiple variables in an easy way. So, let's imagine that we want to estimate the mean value for the variables:
happiness_score, health_life_expectancy, and trust_government_corruption.

How we can do it?

We can use the function across() . It behaves this way: across(VARS that you want to transform, FUNCTION to
execute) .
across() should be always use inside summarise() or mutate() .

Across Output

## # A tibble: 3 x 4
## region happiness_score health_life_expecta~ trust_government_corr~
## <chr> <dbl> <dbl> <dbl>
## 1 Australia and New~ 7.28 0.920 0.393
## 2 Central and Easte~ 5.33 0.719 0.0867
## 3 Eastern Asia 5.63 0.877 0.128

37 / 60
Creating variables
Exercise 5: Create a variable called year that equals to the year of each dataframe .
Use mutate()
Remember to assign it to the same dataframe.

How to do it?
whr15 <- whr15 %>%
mutate(
year = 2015
)

whr16 <- whr16 %>%

mutate(
year = 2016
)

whr17 <- whr17 %>%

mutate(
year = 2017
) 38 / 60
Appending and merging data sets

39 / 60
Appending and merging data sets
Now that we can identify the observations, we can combine the data set. Here are two functions to append objects by row

rbind(df1, df2, df3) # The base R function

bind_rows(df1, df2, df3) # The tidyverse function, making some improvements to base R

Exercise 6: Append data sets.

Use the function bind_rows to append the three WHR datasets:

How to do it?
bind_rows(whr15, whr16, whr17)

What problems do you think we can have with these approach?

One of the problems with binding rows like this is that, sometimes, some columns are not compatible.

40 / 60
Appending and merging data sets
Exercise 7: Fixing our variables and appending the dfs correctly.
Exercise 7a:

Load the R data set regions.RDS from DataWork/DataSets/Raw/Un WHR

regions <- read_rds(here("DataWork", "DataSets", "Raw", "Un WHR", "regions.RDS"))

41 / 60
Appending and merging data sets
We can use the left_join() function merge two dataframes. The function syntax is: left_join(a_df, another_df, by =
c("id_col1")) .

A left join takes all the values from the rst table, and looks for matches in the second table. If it nds a match, it
adds the data from the second table; if not, it adds missing values. It is the equivalent of merge, keep(master
matched) in Stata.

Exercise 7b:
Now, we join the regions dataframe with the whr17 dataframe.

whr17 <- whr17 %>%

left_join(regions, by = "country") %>%
select(country, region, everything())

Notes: Look at the everything() function. It takes all the variables from the dataframe and put them after country
and region. In this way, select can be use to order columns!

42 / 60
Appending and merging data sets
Exercise 7c:
Check if there is any other country without region info:

Only use pipes %>%

And lter()
Do not assign it to an object.

whr17 %>%
lter(is.na(region))

## # A tibble: 2 x 14
## country region happiness_rank happiness_score whisker_high whisker_low
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Taiwan Provinc~ <NA> 33 6.42 6.49 6.35
## 2 Hong Kong S.A.~ <NA> 71 5.47 5.55 5.39
## # ... with 8 more variables: economy_gdp_per_capita <dbl>, family <dbl>,
## # health_life_expectancy <dbl>, freedom <dbl>, generosity <dbl>,
## # trust_government_corruption <dbl>, dystopia_residual <dbl>, year <dbl>

43 / 60
So we ended up with two countries with NAs
This is due to the name of the countries. The regions dataset doesn't have "Taiwan Province of China" nor "Hong Kong S.A.R.,
China" but "Taiwan" and "Hong Kong."

How do you think we should solve this?

My approach would be to:

. x the names of these countries in the whr17 dataset and;

. merge (left_join) it with the regions dataset.

44 / 60
Appending and merging data sets
Finally, let's keep those relevant variables rst and bind those rows.

Exercise 8: Bind all rows and create a panel called: whr_panel .

Use rbind()
Select the variables: country , region , year , happiness_rank , happiness_score , economy_gdp_per_capita ,
health_life_expectancy , freedom for each df, i.e., whr15 , whr16 , whr17 .

keepvars <- c("country", "region", "year", "happiness_rank",

"happiness_score", "economy_gdp_per_capita",
"health_life_expectancy", "freedom")

whr15 <- select(whr15, all_of(keepvars))

whr16 <- select(whr16, all_of(keepvars))
whr17 <- select(whr17, all_of(keepvars))

whr_panel <- rbind(whr15, whr16, whr17) # or bind_rows

45 / 60
Saving a dataset

46 / 60
Saving a dataset
The data set you have now is the same data set we’ve been using for earlier sessions, so we can save it now
As mentioned before, R data sets are often save as csv.
To save a dataset we can use the write_csv function from the tidyverse, or write.csv from base R.

The function takes the following structure:

write_csv(x, le, append = FALSE) :

x: the object (usually a data frame) you want to export to CSV

le: the le path to where you want to save it, including the le name and the format (“.csv”)
append: If FALSE, will create a new le. If TRUE, will append to an existing le.

47 / 60
Saving a dataset
Exercise 9: Save the dataset.
Use write_csv()
Use here()

# Save the whr data set

write_csv(whr_panel,
here("DataWork", "DataSets", "Final", "whr_panel.csv"))

The problem with CSVs is that they cannot differentiate between strings and factors
They also don’t save factor orders
Data attributes (which are beyong the scope of this training, but also useful to document data sets) are also lost in csv
data

48 / 60
Saving a dataset
The R equivalent of a .dta le is a .rds le. It can be saved and loaded using the following commands:

write_rds(object, le = "") : Writes a single R object to a le.

read_rds( le) : Load a single R object from a le.

# Save the data set

write_rds(whr_panel, le = here("DataWork", "DataSets", "Final", "whr_panel.Rds"))

49 / 60
Thank you~~

50 / 60
Appendix

51 / 60
Other relevant functions: slice, subset, select
Arrange Slice Select Combining functions

Arrange : allows you to order by a speci c column.

whr15 %>%
arrange(region, country) %>%
head(5)

## # A tibble: 5 x 8
## country region year happiness_rank happiness_score economy_gdp_per_c~
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Australia Australia a~ 2015 10 7.28 1.33
## 2 New Zeal~ Australia a~ 2015 9 7.29 1.25
## 3 Albania Central and~ 2015 95 4.96 0.879
## 4 Armenia Central and~ 2015 127 4.35 0.768
## 5 Azerbaij~ Central and~ 2015 80 5.21 1.02
## # ... with 2 more variables: health_life_expectancy <dbl>, freedom <dbl>

52 / 60
Other relevant functions: slice, subset, select
Arrange Slice Select Combining functions

Slice : allows you to select, remove, and duplicate rows.

whr15 %>%
slice(1:5) # to select the rst 5 rows

## # A tibble: 5 x 8
## country region year happiness_rank happiness_score economy_gdp_per_ca~
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Switzerla~ Western E~ 2015 1 7.59 1.40
## 2 Iceland Western E~ 2015 2 7.56 1.30
## 3 Denmark Western E~ 2015 3 7.53 1.33
## 4 Norway Western E~ 2015 4 7.52 1.46
## 5 Canada North Ame~ 2015 5 7.43 1.33
## # ... with 2 more variables: health_life_expectancy <dbl>, freedom <dbl>

You can also use slice_head and slice_tail to select the rst or last rows respectively. Or slice_sample to randomly draw n
rows.
52 / 60
Other relevant functions: slice, subset, select
Arrange Slice Select Combining functions

Select : allows you to select speci c columns.

whr15 %>%
select(region, country, happiness_rank)

## # A tibble: 158 x 3
## region country happiness_rank
## <chr> <chr> <dbl>
## 1 Western Europe Switzerland 1
## 2 Western Europe Iceland 2
## 3 Western Europe Denmark 3
## 4 Western Europe Norway 4
## 5 North America Canada 5
## 6 Western Europe Finland 6
## 7 Western Europe Netherlands 7
## 8 Western Europe Sweden 8
## 9 Australia and New Zealand New Zealand 9
## 10 Australia and New Zealand Australia 10 52 / 60
Other relevant functions: slice, subset, select
Arrange Slice Select Combining functions

Select : allows you to speci c columns.

whr15 %>%
arrange(region, country) %>% # Sort by region and country
lter(!is.na(region)) %>% # Filter those non-missing obs for region if any
select(country, region, starts_with("happin")) %>% # Select country, year, and vars that stars with happin
slice_head() # Get the rst row

## # A tibble: 1 x 4
## country region happiness_rank happiness_score
## <chr> <chr> <dbl> <dbl>
## 1 Australia Australia and New Zealand 10 7.28

52 / 60
Using ifelse when creating a variable
We can also create a dummy variable with the ifelse() function. The way we use this function is as: ifelse(test, yes, no) .
We can also use another function called case_when() .

whr15 %>%
mutate(
latin_america_car = ifelse(region == "Latin America and Caribbean", 1, 0)
) %>%
arrange(-latin_america_car) %>%
head(5)

## # A tibble: 5 x 9
## country region year happiness_rank happiness_score economy_gdp_per_c~
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Costa R~ Latin Americ~ 2015 12 7.23 0.956
## 2 Mexico Latin Americ~ 2015 14 7.19 1.02
## 3 Brazil Latin Americ~ 2015 16 6.98 0.981
## 4 Venezue~ Latin Americ~ 2015 23 6.81 1.04
## 5 Panama Latin Americ~ 2015 25 6.79 1.06
## # ... with 3 more variables: health_life_expectancy <dbl>, freedom <dbl>,
## # latin_america_car <dbl>
53 / 60
Factor variables

54 / 60
Factor variables
When we imported this data set, we told R explicitly to not read strings as factor.
We did that because we knew that we’d have to x the country names.
The region variable, however, should be a factor.

str(whr_panel$region)

## chr [1:470] "Western Europe" "Western Europe" "Western Europe" ...

55 / 60
Factor variables
To create a factor variable, we use the factor() function (or as_factor() from the forcats package).

factor(x, levels, labels) : turns numeric or string vector x into a factor vector.
levels : a vector containing the possible values of x .
labels : a vector of strings containing the labels you want to apply to your factor variable
ordered : logical ag to determine if the levels should be regarded as ordered (in the order given).

If your categorical variable does not need to be ordered, and your string variable already has the label you want, making the
conversion is quite easy.

56 / 60
Factor variables
Exercise 10: Turn a string variable into a factor.
Use the mutate function to create a variable called region_cat containing a categorical version of the region variable.
TIP: to do this, you only need the rst argument of the factor function.

How to do it?
whr_panel <- mutate(whr_panel, region_cat = factor(region))

And now we can check the class of our variable.

class(whr_panel$region_cat)

## [1] "factor"

57 / 60
Reshaping a dataset

58 / 60
Reshaping a dataset
Finally, let's try to reshape our dataset using the tidyverse functions. No more reshape from Stata. We can use pivot_wider or
pivot_longer . Let's assign our wide format panel to an object called whr_panel_wide.

Long to Wide Wide to Long

whr_panel %>%
select(country, region, year, happiness_score) %>%
pivot_wider(
names_from = year,
values_from = happiness_score
) %>%
head(5)

## # A tibble: 5 x 5
## country region `2015` `2016` `2017`
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Switzerland Western Europe 7.59 7.51 7.49
## 2 Iceland Western Europe 7.56 7.50 7.50
## 3 Denmark Western Europe 7.53 7.53 7.52
## 4 Norway Western Europe 7.52 7.50 7.54 59 / 60
Reshaping a dataset
Finally, let's try to reshape our dataset using the tidyverse functions. No more reshape from Stata. We can use pivot_wider or
pivot_longer . Let's assign our wide format panel to an object called whr_panel_wide.

Long to Wide Wide to Long

whr_panel_wide %>%
pivot_longer(
cols = `2015`:`2017`,
names_to = "year",
values_to = "happiness_score"
) %>%
head(5)

## # A tibble: 5 x 4
## country region year happiness_score
## <chr> <chr> <chr> <dbl>
## 1 Switzerland Western Europe 2015 7.59
## 2 Switzerland Western Europe 2016 7.51
## 3 Switzerland Western Europe 2017 7.49
## 4 Iceland Western Europe 2015 7.56 59 / 60
Thank you~~

60 / 60

CIND123 Swirl Lesson 15
No ratings yet
CIND123 Swirl Lesson 15
46 pages
R Programming
No ratings yet
R Programming
59 pages
A Brief Guide To R For Beginners in Econometrics: Department of Economics, Stockholm University
No ratings yet
A Brief Guide To R For Beginners in Econometrics: Department of Economics, Stockholm University
33 pages
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell pdf download
100% (3)
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell pdf download
40 pages
AnalyticsEdge Rmanual PDF
100% (1)
AnalyticsEdge Rmanual PDF
44 pages
An Introduction To R: Biostatistics 615/815
No ratings yet
An Introduction To R: Biostatistics 615/815
59 pages
R1 Guideline Session1 Part2
No ratings yet
R1 Guideline Session1 Part2
25 pages
R Exercise 1 - Introduction To R For Non-Programmers
No ratings yet
R Exercise 1 - Introduction To R For Non-Programmers
9 pages
1. R Programming
No ratings yet
1. R Programming
22 pages
Introduction To Rlogistic
No ratings yet
Introduction To Rlogistic
135 pages
01 Intro To R
No ratings yet
01 Intro To R
90 pages
R Course ISLR Basics 2023
No ratings yet
R Course ISLR Basics 2023
77 pages
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell - Read Online Or Download Now
100% (8)
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell - Read Online Or Download Now
35 pages
MultivariateRGGobi PDF
No ratings yet
MultivariateRGGobi PDF
60 pages
Beginner Guide To R and R Studio V1
No ratings yet
Beginner Guide To R and R Studio V1
27 pages
R Intro STAT5000
No ratings yet
R Intro STAT5000
17 pages
Rintro
No ratings yet
Rintro
14 pages
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell pdf download
100% (2)
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell pdf download
47 pages
DR - Pierpaolo-Delser - Introduction R
No ratings yet
DR - Pierpaolo-Delser - Introduction R
83 pages
Data analysis using R(Student copy) (1)
No ratings yet
Data analysis using R(Student copy) (1)
79 pages
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell download
100% (2)
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell download
33 pages
Introduction To R, Version 2
No ratings yet
Introduction To R, Version 2
51 pages
Chapter 1 Introduction (4)
No ratings yet
Chapter 1 Introduction (4)
179 pages
Module 7_(Data Analysis with R Programming)
No ratings yet
Module 7_(Data Analysis with R Programming)
18 pages
R For Data Engineers: Greg Wilson
No ratings yet
R For Data Engineers: Greg Wilson
249 pages
Week 1-3
No ratings yet
Week 1-3
17 pages
Starting With R
No ratings yet
Starting With R
34 pages
S24_STATS10_LAB1-1
No ratings yet
S24_STATS10_LAB1-1
8 pages
R Programming Slides
No ratings yet
R Programming Slides
73 pages
MIS 4.hafta (Introduction To R)
No ratings yet
MIS 4.hafta (Introduction To R)
52 pages
Data_analysis_with_R _24
No ratings yet
Data_analysis_with_R _24
47 pages
CLASS ONE
No ratings yet
CLASS ONE
66 pages
RStudio Exercices
No ratings yet
RStudio Exercices
8 pages
R Language Lab Manual Lab 1
No ratings yet
R Language Lab Manual Lab 1
32 pages
BS51009 workshop 1
No ratings yet
BS51009 workshop 1
15 pages
R Lab
No ratings yet
R Lab
114 pages
Da Session 4
No ratings yet
Da Session 4
75 pages
Introduction To R
No ratings yet
Introduction To R
6 pages
Introduction To R
No ratings yet
Introduction To R
34 pages
Chapter - 03 - Review of Basic Data
No ratings yet
Chapter - 03 - Review of Basic Data
92 pages
Data Science Using R - Lab Manual-Complete Ver 2.0 - Nov 2024
No ratings yet
Data Science Using R - Lab Manual-Complete Ver 2.0 - Nov 2024
36 pages
Intro2R Wk2
No ratings yet
Intro2R Wk2
40 pages
Introduction to R
No ratings yet
Introduction to R
23 pages
Basic Data Science With R
100% (1)
Basic Data Science With R
364 pages
R Language Lab Manual Lab 1
100% (1)
R Language Lab Manual Lab 1
33 pages
Introduction To R: 1 Getting Started
No ratings yet
Introduction To R: 1 Getting Started
14 pages
Lec 1
No ratings yet
Lec 1
42 pages
Programming With R: Lecture #4
No ratings yet
Programming With R: Lecture #4
34 pages
Bayes CPH - Tutorial R
No ratings yet
Bayes CPH - Tutorial R
9 pages
R Intro Script
No ratings yet
R Intro Script
86 pages
Introduction To R PDF
No ratings yet
Introduction To R PDF
56 pages
Intro To Statistic Using R - Session 2
No ratings yet
Intro To Statistic Using R - Session 2
1 page
STATA - Subject Table of Contents
No ratings yet
STATA - Subject Table of Contents
15 pages
Part I: Introductory Materials: Introduction To R
No ratings yet
Part I: Introductory Materials: Introduction To R
25 pages
Statistical Lab Using R-Programming Lab Manual and Workbook: Department of Mathematics
No ratings yet
Statistical Lab Using R-Programming Lab Manual and Workbook: Department of Mathematics
58 pages
Mastering Java: A Comprehensive Guide to Development Tools and Techniques
From Everand
Mastering Java: A Comprehensive Guide to Development Tools and Techniques
Lena Neill
No ratings yet
Quick Python Guide
From Everand
Quick Python Guide
Coder1
No ratings yet
Programming in Pascal: From simple Pascal programs to current desktop applications with Database DEV-PASCAL, LAZARUS AND PASCAL N-IDE
From Everand
Programming in Pascal: From simple Pascal programs to current desktop applications with Database DEV-PASCAL, LAZARUS AND PASCAL N-IDE
Olga Maria Stefania Cucaro
No ratings yet
Core Java Programming Book
From Everand
Core Java Programming Book
Manish Soni
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet