Bt1101 l1 Lab - Basics of R Ay2425

Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

BT1101

Basics of R Tutorial Part I

@2023 NUS. The contents contained in the document may not be distributed or reproduced in any form or by any means without the written permission of NUS
Lab session contents
• Review related concepts
• Cover Part 1 of tutorial
o Discuss strategy and approach to the assignment questions
o Hands-on coding in R
o Discussion of answers

Everyone is here to learn. If you don't know, please ask. Let's learn together

2
Learning Objectives
• To be able to download and install R and R Studio to their
laptops
• To familiarise with the R studio interface, be able to create
an R script file, open and export data files
• To know some common functions used to do quick checks
on the dataset
• To be able to do some basic data manipulation with Base
R and dplyr with a small dataset (built in dataset will be
introduced and used)

3
Basics of

4
Setting up R& RStudio
Go to https://posit.co/download/rstudio-desktop/

Link to download R: https://https://cran.rstudio.com/


Link to download Rstudio desktop (free):
https://posit.co/download/rstudio-desktop/

5
Data Types in R
A basic concept in programming is variables.
• Variables allow you to store information such as values (e.g. “2”) or objects (e.g.
dataframes, functions) in R.
• Calling a variable’s name retrieves the stored information.
• Variable names are case-sensitive!
• Every variable has a data type (class):
- Numeric
- Integers
- Logical
- Character

- Factor

6
Data Structures in R
1. Vectors. Can contain one datatype (e.g. numeric, character, logical), 1D.
• y1 <- c(1, 2, 2, 3, 4, 5)
• y2 <- c(“small”, “medium”, “large”, “large”)

2. Matrix. Like vectors, can only contain one datatype (usually numeric). Data is arranged into a
fixed number of rows and columns, 2D.
• mat1 <- matrix(1:4, nrow=2, ncol=2)

3. Arrays. Multidimensional data structures. Matrices can be thought of as a special case of a


2D array.

4. Lists. Can contain one or more datatypes.


• list1 <- list(1, “two”, y1, y2)

5. Dataframes. Tabular data.


• data_frame <- data.frame(int_vec, char_vec, bool_vec)

7
Part 1
1) We will start by exploring the built-in dataset called ToothGrowth. To find out more
about this dataset, type ?ToothGrowth in the R command line.

Where is the help menu and what


information can you retrieve from it?

8
Part 1
1) We will start by exploring the built-in dataset called ToothGrowth. To find out more
about this dataset, type ?ToothGrowth in the R command line.

What is the difference between the


output of summary versus str?

9
Part 1
1) We will start by exploring the built-in dataset called ToothGrowth. To find out more
about this dataset, type ?ToothGrowth in the R command line.

How about the difference between


head and tail?

10
Part 1
2) Selecting data
There are several variables in ToothGrowth. Using Base R and dplyr functions, can you
perform (i), (ii) and (iii)?
i. I. Extract the column supp
ii. II. Extract rows where supp is equal to “VC” and dose is less than 1 and assign
the output to df2
iii. III. Extract the values of len where supp is equal to “VC”
iv. IV. Try to perform the above operations (i, ii, iii) again but this time, assign the
output to df2.1, df2.2 and df2.3 respectively.
v. V. Use the class function to check the class attribute for each of the outputs. Use
is.data.frame function to check whether the output is a dataframe or a
vector.

11
Indexing and selection with base R
Cheatsheet: https://github.com/rstudio/cheatsheets/blob/main/base-r.pdf

Hint: df [row, column]

12
dplyr Package
• Data manipulation library in R
• Lets you subset, reshape, join and summarize data typically using less code than would
be required in base R
• Part of the R tidyverse
• Install the package (if not already installed), then load the dplyr library.
- install.packages(“tidyverse”)

- library(tidyverse)

Packages need to be loaded each time your


environment is restarted. If successful, the package
should be reflected in Global Environment.

However, packages only need to be installed


ONCE. Comment out the installation code
afterwards, or use the console to run the
installation code rather than writing it in your R
scripts — this might save you some errors later on.

13
dplyr Package
Documentation: https://cran.r-project.org/web/packages/dplyr/dplyr.pdf
Cheatsheet: https://github.com/rstudio/cheatsheets/blob/main/data-transformation.pdf

… plus some others …

14
dplyr Package
Documentation: https://cran.r-project.org/web/packages/dplyr/dplyr.pdf
Cheatsheet: https://github.com/rstudio/cheatsheets/blob/main/data-transformation.pdf

15
dplyr Package
Documentation: https://cran.r-project.org/web/packages/dplyr/dplyr.pdf
Cheatsheet: https://github.com/rstudio/cheatsheets/blob/main/data-transformation.pdf

16
dplyr Package
Documentation: https://cran.r-project.org/web/packages/dplyr/dplyr.pdf
Cheatsheet: https://github.com/rstudio/cheatsheets/blob/main/data-transformation.pdf

17
dplyr Package
Documentation: https://cran.r-project.org/web/packages/dplyr/dplyr.pdf
Cheatsheet: https://github.com/rstudio/cheatsheets/blob/main/data-transformation.pdf

This will be covered


more in Wk 4 lecture
materials.

18
dplyr Package
• dplyr introduces pipes: %>%
• Allows you to use the result of one function as the input to another
function that comes after the pipe.
• Essentially, pipes allow you to chain several functions together

19
After both pipes (%>%) After the first %>% Original ToothGrowth
Part 1
2i) Extract the column supp.
Part 1
2i) Extract the column supp.

What is the dplyr %>% doing here?

21
Part 1
2ii) Extract rows where supp is equal to “VC” and dose is less than 1 and assign the
output to df2

22
Part 1
2iii) Extract the values of len where supp is equal to “VC”
Part 1
2iv) Try to perform the above operations (i, ii, iii) again but this time, assign the output
to df2.1, df2.2 and df2.3 respectively.
2 v) Use the class function to check the class attribute for each of the outputs.
Use is.data.frame function to check whether the output is a dataframe or a vector.

24
Part 1

Why do you need to assign the output to a


label/name?

25
Part 1
2vi) Use the `slice` function to extract the maximum and minimum values of `len`. Also
use `slice` to extract the 5th to 10th rows of observations.

26
Part 1
3) Adding/Removing/Changing data columns for Toothgrowth data.
i. Change the variable name from len to length and assign the output to df3.1
ii. Increase the value of len by 0.5 if supp is equal to OJ and assign the output to
df3.2
iii. Remove the column dose from the data and assign the output to df3.3
iv. Increase the value of dose by 0.1 for all records and rename dose to dose.new and
assign output to df3.4
v. Create a new variable high.dose and assign it a value of “TRUE” if dose is more
than 1 and “FALSE” if dose is less than or equal to 1. Assign the dataframe with
the new variable high.dose to df3.5. Export df3.5 to a csv file. Discuss what is the r
code to export as an excel file (.xlsx).

27
Part 1
3i) Change the variable name from len to length and assign the output to df3.1

With the base R option, the results are stored.

28
Part 1
3ii) Increase the value of len by 0.5 if supp is equal to OJ and assign the output to df3.2
Part 1
3iii) Remove the column dose from the data and assign the output to df3.3
Part 1
3iv) Increase the value of dose by 0.1 for all records and rename dose to dose.new and
assign output to df3.4
Part 1
3v) Create a new variable high.dose and assign it a value of “TRUE” if dose is more than 1
and “FALSE” if dose is less than or equal to 1. Assign the dataframe with the new variable
high.dose to df3.5. Export df3.5 to a csv file. Discuss what is the r code to export as an
excel file (.xlsx).

can also use mutate(... case_when())

df3.5 <- dfTooth %>% mutate(high.dose = case_when(dose > 1 ~ "TRUE",


dose <= 1 ~ "FALSE"))
32
Part 1
4) Sorting
i. There are two functions in Base R “sort” and “order” to perform sorting. How do
these two functions differ? Try to do a sort with each function on
ToothGrowth$len.
ii. Using a base R function (e.g. order), how can you sort the dataframe
ToothGrowth in decreasing order of len?
iii. What dplyr functions can you use to sort ToothGrowth in increasing order of len?
Can you also sort the dataframe in decreasing order of len?

33
Sorting
Base R
• sort returns the original object, sorted in ascending order by default.
• order returns the indices of the sorted object, also in ascending order by default.

In dplyr, arrange() orders the rows of


a data frame by the values of selected
columns in ascending order by default.

34
Part 1
4i) There are two functions in Base R “sort” and “order” to perform sorting. How do
these two functions differ? Try to do a sort with each function on ToothGrowth$len.

Hint: How sorting can be done using Base R


and dplyr

35
Part 1
4ii) Using a base R function (e.g. order), how can you sort the dataframe ToothGrowth
in decreasing order of len?

36
Part 1
4iii) What dplyr function can you use to sort ToothGrowth in increasing order of len?
Can you also sort the dataframe in decreasing order of len?
Part 1
5) Factors
i. Check if supp is a factor vector. First type ToothGrowth$supp. What do you
observe with the output?
ii. Next use is.factor() and is.ordered() to check if supp is a factor and is
so whether it is an ordered factor.
iii. Now supposed we find that vitamin C (VC) is a superior supplement compared to
orange juice (OJ), and we want to order supp such that VC is a higher level than
OJ, how could we do this?

38
Part 1
5i) Check if supp is a factor vector. First type ToothGrowth$supp. What do you
observe with the output?

What are factors and their levels? Why do you


think factors might be helpful when performing
data analysis?

39
Part 1
5ii) Next use is.factor() and is.ordered() to check if supp is a factor and is so
whether it is an ordered factor.
Part 1
5ii) Now supposed we find that vitamin C (VC) is a superior supplement compared to
orange juice (OJ), and we want to order supp such that VC is a higher level than OJ, how
could we do this? (Hint: Assign factor_supp to ToothGrowth$supp)

41
R& RStudio tips
Clearing your workspace can prevent it from becoming messy.
• To clear the console: ctrl + L
• To clear a variable from your environment: rm(variable)
• To clear all variables (use with caution!): rm(list=ls())

Keyboard shortcuts help you work more efficiently. Some common ones are:
• Run the current line of code: cmd+enter (Mac), ctrl+enter (Windows)
• Insert the <- operator: option + - (Mac), Alt + - (Windows)
• Insert the %>% operator: cmd+shift+M (Mac), ctrl+shift+M (Windows)

Documentation provides information about functions in R and examples of how they’re used.
• To read R documentation about a function: help(function_name) or
?function_name

42
Next week
• Lecture and tutorials as per usual next week
• Basics of R Part 2 due 9th Sep, 9am — answers must be submitted in
the form of an R script.

43

You might also like