Course 7
Course 7
R Programming
M1: Programming & Data analytics
M2: Programming using Rstudio
M3: Working with data in R
M4: More about Viz, Aesthetics, and annotations
M5: Documentation & Reports
M1: Programming & Data analytics
Part 1: The exciting world of programming
• Computer programming: giving instructions to a computer to perform an action or
set of actions.
• R is a programming language used for statistical analysis, visualization, and other
data analysis
• Programming languages: the words and symbols we use to write instructions for
computers to follow.
=> a bridge that connects humans and computers
• Syntax: words and symbols should be used
• Coding: writing instructions to the computer in the syntax of a specific
programming language
• Benefits of using any programming language to work with your data:
+ clarify the steps of your analysis,
+ saves time,
+ reproduce (research data and code are made available so that others are
Benefits of R
• R: cleaning, analysis, visualization, and reporting
• Clarify: Programming languages have specific rules and guidelines for
giving instructions to the computer
• Saves time: With one line of code, you can create a separate dataset
without any missing values.
• Reproduce and share your work: Data analysis is most useful when
you can reproduce your work and share it with other people
Part 2
Programming as a data analyst
Ways to learn about programming
• A data analyst collects, transforms, and organizes data to draw conclusions, make
predictions, and drive informed decision-making.
=> R and Python
• R offers convenient statistical features for data analysis and is useful for creating
advanced data visualizations.
• Python is a general-purpose language that you can use to create what you need for
data analysis
• Tips for learning programming languages
- Define a practice project and use the language to help you complete it. This makes
the learning process more practical and engaging.
- Keep previous concepts and coding principles in mind. Many of these are
transferable between programming languages. So, after you have learned one
language, learning a second or third programming language tends to be much easier.
- Create and keep good notes and cheat sheets in whatever format (handwritten or
typed) that works best for you.
- Create an online filing system for information that you can easily access while you
From spreadsheets to SQL to R
• https://posit.co/download/rstudio-desktop/#download
-> WINDOWS
Note: Lưu thư mục cài đặt để tạo Shortcut
• Install and load packages: install and load packages in your RStudio
Desktop console (like RStudio Cloud)
1. > install.packages("tidyverse")
2. > library(lubridate)
When to use RStudio
• Why RStudio?
- RStudio are designed to handle large data sets, which spreadsheets
might not be able to handle as well.
- RStudio also makes it easy to reproduce your work on different
datasets: input your code
• When RStudio truly shines
- When the data is spread across multiple categories or groups
Ex: analyzing sales data for every city across an entire country
+ it easy to take a specific analysis step and perform it for each group
(every city) using basic code
+ allows for flexible data visualization
+ create an output of summary stats—or even your visualized plots—for
each group.
M2: Programming using Rstudio
R documentation
R has built-in documentation for all functions and packages. To learn more about any R function, just run the
code ?function_name.
?geom_bar
Vectors and lists in R
• Data structure: a format for organizing and storing data (vectors, data frames,
matrices, and arrays)
=> Single data elements don’t give you much information, but when data elements
are combined into vectors, data frames, and other data structures -> to solve a
business challenge.
• Vector (atomic vectors and lists): a group of data elements of the same type,
stored in a one-dimensional sequence
=> Vectors can only contain data of one type.
- Atomic vectors: can be logical, numeric, or character
Create vectors
Or x[“b”]
Lists
• When you’re doing data analysis, you won’t usually create a data
frame yourself. Instead, you’ll import data from another source, such
as a .csv file, a relational database, or a software program
• Create a file: file.create(),
ex: file.create("new_csv_file.csv")
[1] TRUE/ FALSE (create Successfully/ Fail)
• Copy a file: file.copy("new_text_file.txt", "destination_folder")
• delete files: unlink("some_.file.csv")
Matrices
End of Part 1
Part 2: Explore coding
• Operators & Calculation:
if statement
if (x > 0) {print("x is a positive number")}
Þthe code to be executed if the condition is TRUE
else statement
if (x > 0) {print ("x is a positive number")}
else {print ("x is either a negative number or zero)}
else if statement
if (x < 0) {print("x is a negative number")}
else if (x == 0) {print("x is zero")}
else {print("x is a positive number")}
Hands-On Activity: R sandbox
• https://posit.cloud/content/6208304
• Save a Permanent Copy -> Course 7 -> Week 2 -> Lesson3_Sandbox.Rmd
• Carefully read the instructions in the comments of the Rmd file and
complete each step
Content:
- install and load `R packages`; functions
- viewing, cleaning, and visualizing data;
- using `R markdown` to export your work
Note:
-code chunk: đọan code
End of P2
P3: Learning about R packages
• Package: units of reproducible R code -> use to add more functionality
to R
• Packages include:
- reusable R functions
- documentation about the functions
- sample datasets
- tests for checking your code
• R: includes a set of packages called base R that are available to use in
RStudio when you start your first programming session.
• Check packages: installed.packages()
Check packages
• Base: the package is already installed and loaded
• Recommended: package is installed but not loaded
• Check by: a brief description of each package
Ex: class package has a check next to it: successfully loaded for use
• One of the most commonly used sources of packages is CRAN
(comprehensive R archive network)
Available R packages
End of P3
P4: Explore the tydiverse
• 8 core tidyverse packages:
- ggplot2: used for data visualization, specifically plots
- tidyr: used for data cleaning to make tidy data
- readr: used for importing data
- dplyr: offers a consistent set of functions that help you complete some
common data manipulation tasks
- tibble: works with data frames
- purrr: works with functions and vectors
- stringr: work with strings
- forcats: provides tools that solve common problems with factors (store
categorical data in R)
• 4 packages that are an essential part of the workflow for data analysts:
ggplot2, dplyr, tidyr and readr
Use pipes to nest code
• Pipe: a tool in R for expressing a sequence of multiple operations (%>% or Ctrl +
shift +m)
=> it takes the output of one statement and makes it the input of the next statement
Ex: +normal code
+Pipe:
+ colnames()
+ mutate(): add a new col (carat_2) to diamond data frame & calculate
this col
Hands-on Activity: Create your own data frame
• Open: https://posit.cloud/content/6208304
• Save a Permanent Copy -> Course 7 -> Week 3 ->
Lesson2_Dataframe.Rmd
• Read the instructions in the
comments of the .Rmd file and
complete each step:
+ There are 3 common sources for data:
- A`package` by loading that `package`
- An external file like a spreadsheet or CSV that can be imported
- Data that has been generated from scratch using `R` code
Hand _on: Creating and using data frames
=> reduces the risk of errors and data mishandling, a critical consideration in data analysis.
• The readr package comes with some sample files from built-in
datasets. To list the sample files, you can run the
readr_example()
• use the read_csv() function to read the "mtcars.csv" file:
read_csv(readr_example("mtcars.csv"))
=> gives the name and type of each column & tibble
• the readxl package: To import spreadsheet data => transfer data
from Excel into R
- The readxl package is part of the tidyverse but is not a core tidyverse
package, so you need to load readxl in R by library(readxl)
- readxl_example() to see the list
- read_excel(readxl_example("type-me.xlsx"))
Hands-On Activity: Importing and working with data
• Open https://posit.cloud/content/6208304
• Save a Permanent Copy -> Course 7 -> Week 3 -> Lesson2_Import.Rmd
• Read the instructions in the comments of the Rmd file and complete each step
## The scenario: clean a .csv file that was created after querying a database to combine
two different tables from different hotels
- Step 1: use the `read_csv()` function to import data from a .csv in the project folder
called "hotel_bookings.csv" and save it as a data frame called `bookings_df`
bookings_df <- read_csv("hotel_bookings.csv")
- Step 2: Inspect & clean data
+ head(bookings_df)
+ create another data frame using `bookings_df` that focuses on the average daily
rate (`adr`) in the data frame, and `adults`:
new_df <- select(bookings_df, `adr`, adults)
+ create new variables: use the `mutate()` function. This will make changes to the
data frame, but not to the original data set you imported. That source data will remain
unchanged: mutate(new_df, total = `adr` / adults)
Hand_on (cont.)
- Step 3: Import your own data
+ import and save the file in the current working directory the Cloud > project
folder:
Files -> Upload button -> Chọn Target directory -> Chọn Tệp (E:\Tina\Data
Science\Course 7\product_inf.csv)
+ Import: read_csv(“product_inf.csv”)
+ Inspect: View(), ….
End of P1
P2: Cleaning data
Cleaning up with the basics: how to clean up columns
• install 3 packages:
+ Here: makes referencing files easier
+ Skimr: makes summarizing data
+ Janitor: simplify data cleaning tasks
• make sure the dplyr package is loaded
• load a dataset “penguins” in “palmerpenguins” package
• skim_without_charts (“penguins”) gives us a pretty comprehensive summary of
a dataset
• glimpse () to get a really quick idea of what's in this dataset
• head() to get a preview of the column names and the first few rows
• select() to specify certain columns or to exclude columns we don't need
Ex:
how to clean up columns
• clean_names () in the Janitor package will automatically make sure that the column
names are unique and consistent
=> ensures that there's only characters, numbers, and underscores in the names
Ex:
File-naming conventions
• organize and filter data: use the arrange, group by and filter functions
1. arrange(): sorting our data
• filter()
Hands-On Activity: Cleaning data in R
• Open https://posit.cloud/content/6208304
-> Save a Permanent Copy -> Course 7 -> Week 3 -> Lesson3_Clean.Rmd.
• read the instructions in the comments of the Rmd file and complete each step:
## The scenario: clean a .csv file that was created after querying a database to
combine two different tables from different hotels
## Step 1: Load packages (`tidyverse`, `skimr`, and `janitor`)
## Step 2: Import data
+ Calculate the total number of canceled bookings and the average lead time for
booking: Make a column called 'number_canceled' to represent the total number of
canceled bookings. Then, make a column called 'average_lead_time' to represent the
average lead time. Use the `summarize()`
Gợi ý: example_df <- bookings_df %>% … coding…
Solution: Xem Lesson3_Clean_Solutions.Rmd
Manually create a data frame
Note: sep=‘space’ : separate the name column at the first blank space
Nếu ko có sep=‘ ‘ cũng được
Transforming data (cont.)
• unite (): allows to merge columns together => the opposite of separate
• Mutate(): a little bit before to clean and organize our data. But mutate
can also be used to add columns with calculations
Wide to long with tidyr
=> pivot_longer()
• Long data: all the observations in a single column
Þpivot_wider()
End of P2
P3: Take a closer look at the data
• Anscombe's quartet: has 4 datasets that have nearly identical
summary statistics
• install.packages("Tmisc") & load
• Load dataset: data("quartet"): 4 set (I,II,III,IV) & x,y
• get a summary of these statistical measures
Anscombe's quartet (cont.)
Þ datasets are identical, but sometimes just looking at the summarized data can be misleading
• Let's put together some simple graphs to help us visualize this data and check if the datasets
are actually identical
ggplot(quartet,aes(x,y)) + geom_point() + geom_smooth(method=lm, se=FALSE) +
facet_wrap(~set)
(Học kỹ phần plot sau!)
Check:
4 datasets appear quite
different when we visualize
Þ If we just gone with a
statistical summaries, we
never would have known
that this data is actually
really different
The datasauRus package
• The datasauRus: creates plots with the Anscombe data in different shapes
• famous dinosaur: a bull's eye, a star, … => R is a pretty powerful
visualization tool. You could use the relationships between data points to
create many other shapes
• install.packages("datasauR") (ko có trong R desktop/ Clould hiện đang cài
đặt)
ggplot(datasaurus_dozen,aes(x=x,y=y,colour=dataset))+geom_point()
+theme_void()+theme(legend.position = "none") + facet_wrap
(~dataset,ncol=3)
install.packages("datasauR") (ko có trong R desktop
hiện đang cài đặt)
The bias function
Ex2: compare ordering amount to their actual sales -> they could find
out if they are ordering new stock according to their actual needs
Work with biased data
• Open: https://posit.cloud/content/6208304
-> Save a Permanent Copy -> Course 7 -> Week 3 -> Lesson3_Change.Rmd
• read the instructions in the comments of the Rmd file and complete each step
• use statistical summaries to explore your data, and gain initial insights for your
stakeholders
• Using Rstudio desktop
## The Scenario: clean a .csv file that was created after querying a database to
combine two different tables from different hotels
## Step 1: Load packages `tidyverse`, `skimr`, and `janitor`
## Step 2: Import data
+ Copy “hotel_bookings.csv” into working directory
C:/Users/DELL/Document/Basic
+ hotel_bookings <- read_csv("hotel_bookings.csv")
## Step 3: Getting to know your data head() / View (),….
Hands-On Activity (cont.)
## Manipulating your data
+ arrange the data by most lead_time to least lead_time because
you want to focus on bookings that were made far in advance
arrange(hotel_bookings, desc(lead_time)) hoặc thêm dấu – thay cho
desc
Note: without saving your data to a new data frame, it does not alter
the existing data frame, check by head(hotel_bookings)
Hands-On Activity (cont.)
+ want to know what the average lead_time before booking is for just
city hotels
End of M3
M4: More about Viz, Aesthetics, and annotations
• Create a plot in ggplot2: plot the relationship between body mass and flipper length in the
three penguin species
Þ A scatterplot of points would be an effective way to display the relationship between the two
variables (flipper length on the x-axis and body mass on the y-axis). There are 3 steps:
1. Start with the ggplot() and choose a dataset to work with
2. Add a geom_function to display your data
3. Map the variables you want to plot in the argument of the aes()
ggplot(data = penguins) + geom_point(mapping =
aes(x = flipper_length_mm, y = body_mass_g))
Or
ggplot(data = penguins, mapping =
Explain the code
• ggplot(data = penguins): creates a coordinate system
data: dataset “penguins”
• +: add a “+” symbol to add a new layer to your plot. You complete
your plot by adding one or more layers to ggplot()
• geom_point(): use points to create scatterplots (geom_bar(): create
bar charts),…
+ (mapping = aes(x = flipper_length_mm, y = body_mass_g)):
mapping argument:
- define how variables in your dataset are mapped to visual properties.
- always paired with the aes(): The x and y arguments specify which
variables to map to the x-axis and the y-axis of the coordinate system.
Hands-On Activity: Using ggplot
• Open https://posit.cloud/content/6208304
• Save a Permanent Copy-> Course 7 -> Week 4 -> Lesson2_GGPlot.Rmd.
• read the instructions in the comments of the Rmd file and complete
each step
## The Scenario: create some simple data visualizations with the `ggplot2` package
## Step 1: Import your data hotel_bookings <- read.csv("hotel_bookings.csv")
## Step 2: Look at a sample of your data colnames(hotel_bookings)
## Step 3: Begin creating a plot
Statement: "I want to target people who book early, and I have a hypothesis that
people with children have to book in advance.“
=> create a visualization to see how true that statement is-- or isn't.
Hands-On Activity (cont.)
ggplot(data = hotel_bookings) + geom_point(mapping = aes(x =
lead_time, y = children))
End of P1
P2: Explore aesthetics in analysis
• Enhancing visualizations
Þwe can't tell which data points
refer to each of the three penguin species
ggplot(data = penguins)
+ geom_point(mapping = aes(
x = flipper_length_mm, y = body_mass_g,
color=species))
Combine (OR / AND): color, shape & size
If we want to change the color of all the
points to purple, we code outside of the aes():
ggplot(data = penguins) + geom_point(mapping =
aes(x = flipper_length_mm, y = body_mass_g),
geom()
• geom_point; geom_bar; geom_line; geom_smooth
ggplot(data = penguins) + geom_smooth(mapping = aes(x =
flipper_length_mm, y = body_mass_g))
Þ enables the detection of a data trend
ggplot(data = penguins) + geom_smooth(mapping = aes(x =
flipper_length_mm, y = body_mass_g, linetype=species))
• Open https://posit.cloud/content/6208304
• Save a Permanent Copy-> Course 7 -> Week 4 -> Lesson3_Aesthetics.Rmd
• read the instructions in the comments of the Rmd file and complete
## The Scenario: creating visualizations that highlight different aspects of
the data
## Step 1: Import your data …. Step 3
## Step 4: Making a Bar Chart: how many of the transactions are occurring
for each different distribution type
ggplot(data = hotel_bookings) +
geom_bar(mapping = aes
(x = distribution_channel))
Hands-On Activity (cont.)
## Practice quiz: what distribution type has the most number of bookings?
## Step 5: Diving deeper into bar charts
if the number of bookings for each distribution type is different depending
on whether or not there was a deposit or what market segment they
represent?
+ deposit
ggplot(data = hotel_bookings) +
geom_bar(mapping = aes(x =
distribution_channel,fill=deposit_type ))
+ Market segment
ggplot(data = hotel_bookings) + geom_bar(mapping = aes(x =
distribution_channel,fill=market_segment ))
Hands-On Activity (cont.)
## Step 6: Facets galore => create separate charts for each deposit type
and market segment
+ deposit type:
ggplot(data = hotel_bookings) +
geom_bar(mapping = aes(x = distribution_channel)) +
facet_wrap(~ deposit_type )
Þ it's hard to read the x-axis labels
ggplot(data = hotel_bookings) + geom_bar(mapping = aes(x =
distribution_channel)) + facet_wrap(~deposit_type) +
theme(axis.text.x = element_text(angle = 45))
Þrotates the text to 45 degrees
to make it easier to read.
Hands-On Activity (cont.)
• The facet_grid () does something similar. The main difference is that
`facet_grid` will include plots even if they are empty
ggplot(data = hotel_bookings) + geom_bar(mapping = aes(x =
distribution_channel)) + facet_grid(~deposit_type) + theme(axis.text.x
= element_text(angle = 45))
• You should have 3 bar charts. Now, you could put all of this in one
chart and explore the differences by deposit type and market
segment.
ggplot(data = hotel_bookings) + geom_bar(mapping = aes(x =
distribution_channel)) + facet_wrap(~deposit_type~market_segment)
+ theme(axis.text.x = element_text(angle = 45))
=> These charts are probably overwhelming and too hard to read, but it
can be useful if you are exploring your data through visualizations.
Hands-On Activity: Filters and plots
• Open https://posit.cloud/content/6208304
• Save a Permanent Copy-> Course 7 -> Week 4 -> Lesson3_Filters.Rmd
• how to make use of the filters and facets features of ggplot2 to create
custom visualizations based on different criteria.
## The Scenario: clean hotel booking data, create visualizations with
`ggplot2` to gain insight into the data, and present different facets of the
data through visualization
## Step 1- 3: Import & load packages
## Step 4: Making many different charts
To run a family-friendly promotion targeting key market segments.
=> Managers want to know which market segments generate the largest
number of bookings, and where these bookings are made (city hotels or
resort hotels).
Hands-On Activity (cont.)
• First, you decide to create a bar chart showing each hotel type and
market segment
ggplot(data = hotel_bookings) +
geom_bar(mapping = aes(x = hotel, fill = market_segment))
• this bar chart: difficult to compare the size of the market segments at
the top of the bars.
Þclearly compare each segment
Þ use the facet_wrap() to create a separate plot for each market
segment
ggplot(data = hotel_bookings) + geom_bar(mapping = aes(x = hotel)) +
facet_wrap(~ market_segment)
Hands-On Activity (cont.)
## Step 5: Filtering
To send the promotion to families that make online bookings for city hotels. The online
segment is the fastest growing segment, and families tend to spend more at city hotels
than other types of guests.
Þcreate a plot that shows the relationship between lead time and guests traveling with
children for online bookings at city hotels
1. filter data: create a data set that only includes the data you want
onlineta_city_hotels <- filter(hotel_bookings, (hotel=="City Hotel" &
hotel_bookings$market_segment=="Online TA"))
View(“onlineta_city_hotels”)
2. plot filtered data:
ggplot(data = onlineta_city_hotels) + geom_point(mapping = aes(x = lead_time, y =
children))
Þbookings with children tend to have a shorter lead time, and bookings with 3 children
have a significantly shorter lead time => promotions targeting families can be made
closer to the valid booking dates.
P3: Annotation layer
• Annotate: add notes to a document or diagram to explain or comment upon it
• Lable: labs(title, subtitle, caption)
ggplot(data = penguins) + geom_point(mapping = aes(x = flipper_length_mm, y =
body_mass_g, color=species))+ labs(title="Palmer Penguins: Body vs Flipper")
+ Include: subtitle, caption (source)
labs(title="Palmer Penguins: Body vs Flipper“, subtitle=“Sample of 3 Penguins”,
caption=“Data collected by Dr. Kristen”)
ggplot(data = penguins) + geom_point(mapping = aes(x = flipper_length_mm, y =
body_mass_g, color=species))+
labs(title="Palmer Penguins: Body vs Flipper", subtitle="Sample of 3
Penguins",caption="Data collected by Dr.Kristen") +
annotate("text", x= 220, y=3500,label="The Gentoos are the largest“, color
=“purple”, fontface=“bold”, size=4.5, angle=25)
Saving your visualizations
• Export:
• ggsave()
ggsave("Palmer Penguins.png")
Xem file: Files ->
Hands-On Activity: Annotating and saving viz
• Open https://posit.cloud/content/6208304
• Save a Permanent Copy-> Course 7 -> Week 4 -> Lesson4_Annotations.Rmd
## The Scenario: add annotations to your visualizations
## Step 1-3: Import your data & load “hotel_bookings.csv” dataset
## Step 4: Annotating your chart
+ create a viz that compares market segments between city hotels and resort hotels
ggplot(data = hotel_bookings) + geom_bar(mapping = aes(x = market_segment)) +
facet_wrap(~hotel)
Þunclear where the data is from, what the main takeaway is, or even what the data is
showing
ÞAnnotations
mindate <- min(hotel_bookings$arrival_date_year)
maxdate <- max(hotel_bookings$arrival_date_year)
ggplot(data = hotel_bookings) + geom_bar(mapping = aes(x = market_segment)) +
facet_wrap(~hotel) + labs(title="Comparison of market segments by hotel type for hotel
bookings", caption=paste0("Data from: ", mindate, " to ", maxdate), x="Market
Hands-On Activity (cont.)
END OF M4
M5: Documentation & Reports
• create R Markdown documents to record your analysis => keep track of your data
analysis process and share your work with others
• Open an Rmd file: File -> New File -> R Markdown
+ YAML (yet another markup language): a language used in data files to improve human
readability) header section => can change the information;
each line has a number associated with it
=>easy to reference a location in the notebook
+ code chunk: the gray background
ÞRun code, start ```{r} & end ```
• Edit :
• Knit (render):
+ *italics works*
+ **bold is useful**
+ create a header: # Conclusion
+ The more hashtags you add (up to 6), the smaller the header
+ Tick marks format the text to appear as code even though the text is not in a
code chunk
END OF P1
P2: Create R Markdown documents
• Knit
End of P2
P3: Understand code chunks & exports
• Delimiter: is a character that indicates the beginning or end of a data
item. In RMarkdown, ```{r } and ``` can be used as delimiters for code
chunks.
• Wirte code chunk: Code -> Insert Chunk
=>
Or
Hands-On Activity: Adding code chunks to R Markdown notebooks
• Export:
End of M5