0% found this document useful (0 votes)
3 views100 pages

Course 7

The document outlines a course on Data Analysis with R Programming, covering programming concepts, RStudio usage, and data manipulation techniques. It emphasizes the benefits of R for statistical analysis and visualization, and provides hands-on activities for downloading R, using RStudio, and working with data frames. Key topics include basic programming concepts, R packages, and the Tidyverse collection of packages for data analysis.

Uploaded by

student230116
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views100 pages

Course 7

The document outlines a course on Data Analysis with R Programming, covering programming concepts, RStudio usage, and data manipulation techniques. It emphasizes the benefits of R for statistical analysis and visualization, and provides hands-on activities for downloading R, using RStudio, and working with data frames. Key topics include basic programming concepts, R packages, and the Tidyverse collection of packages for data analysis.

Uploaded by

student230116
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 100

Course 7: Data Analysis with

R Programming
M1: Programming & Data analytics
M2: Programming using Rstudio
M3: Working with data in R
M4: More about Viz, Aesthetics, and annotations
M5: Documentation & Reports
M1: Programming & Data analytics
Part 1: The exciting world of programming
• Computer programming: giving instructions to a computer to perform an action or
set of actions.
• R is a programming language used for statistical analysis, visualization, and other
data analysis
• Programming languages: the words and symbols we use to write instructions for
computers to follow.
=> a bridge that connects humans and computers
• Syntax: words and symbols should be used
• Coding: writing instructions to the computer in the syntax of a specific
programming language
• Benefits of using any programming language to work with your data:
+ clarify the steps of your analysis,
+ saves time,
+ reproduce (research data and code are made available so that others are
Benefits of R
• R: cleaning, analysis, visualization, and reporting
• Clarify: Programming languages have specific rules and guidelines for
giving instructions to the computer
• Saves time: With one line of code, you can create a separate dataset
without any missing values.
• Reproduce and share your work: Data analysis is most useful when
you can reproduce your work and share it with other people

Part 2
Programming as a data analyst
Ways to learn about programming

• A data analyst collects, transforms, and organizes data to draw conclusions, make
predictions, and drive informed decision-making.
=> R and Python
• R offers convenient statistical features for data analysis and is useful for creating
advanced data visualizations.
• Python is a general-purpose language that you can use to create what you need for
data analysis
• Tips for learning programming languages
- Define a practice project and use the language to help you complete it. This makes
the learning process more practical and engaging.
- Keep previous concepts and coding principles in mind. Many of these are
transferable between programming languages. So, after you have learned one
language, learning a second or third programming language tends to be much easier.
- Create and keep good notes and cheat sheets in whatever format (handwritten or
typed) that works best for you.
- Create an online filing system for information that you can easily access while you
From spreadsheets to SQL to R

• all of the tools (spreadsheets to SQL to R): often used together.


SQL: pull a specific dataset,
R: clean and organize ,
and then export it as a Spreadsheet for quick insights
Introduction to R

• In the 1990s, Ross Oaxaca and Robert Gentleman developed R at the


University of Auckland, New Zealand
• Why people who work with data love R
- accessible: anyone
- data-centric: solve problems that involve data
- open-source: the code is freely available and may be modified
and shared by the people who use it
- community: This vibrant, diverse and accessible community is so
supportive of new learners
• Specific situations of R for data analysis:
- reproducing your analysis,
- processing lots of data,
- creating data visualizations
Hands-On Activity: Downloading and installing R &
R Console

• Download R: https://mirrors.cicku.me/cran/ -> R for Windows -> base


-> R-4.4.3
• Install R
• Using R: R console (the program window in R where you make use of
the R programming language. It is an interface that lets you view,
write, edit, and execute your R code.)
Part 3: Learn Programming using RStudio

• IDE (Integrated Development Environment): The SW app that brings


together all the tools you might want to use in a single place.
=> Rstudio: built specifically for use with R
Hands-On Activity: Cloud access to RStudio

• Access RStudio Cloud (Posit Cloud): https://posit.cloud/plans/free ->


Sign up -> New Project -> RStudio Cloud console
• Install and load packages (Packages are units of reproducible R code.
Members of the R community create packages to keep track of the R
functions that they write and reuse. Packages offer a helpful
combination of code, reusable R functions, descriptive documentation,
tests for checking your code, and sample data sets.):
1. install the core tidyverse packages: > install.packages("tidyverse")
2. Load the tidyverse library: > library(tidyverse)
3. Load the lubridate package: > library(lubridate)
Note: You only need to install a package once, but you need to reload it
every time you start a new session > library(tidyverse)
Hands-On Activity: Get started in RStudio Desktop

• https://posit.co/download/rstudio-desktop/#download
-> WINDOWS
Note: Lưu thư mục cài đặt để tạo Shortcut
• Install and load packages: install and load packages in your RStudio
Desktop console (like RStudio Cloud)
1. > install.packages("tidyverse")
2. > library(lubridate)
When to use RStudio

• Why RStudio?
- RStudio are designed to handle large data sets, which spreadsheets
might not be able to handle as well.
- RStudio also makes it easy to reproduce your work on different
datasets: input your code
• When RStudio truly shines
- When the data is spread across multiple categories or groups
Ex: analyzing sales data for every city across an entire country
+ it easy to take a specific analysis step and perform it for each group
(every city) using basic code
+ allows for flexible data visualization
+ create an output of summary stats—or even your visualized plots—for
each group.
M2: Programming using Rstudio

Part 1: Understanding basic programming concepts


• The basic concepts of R:
1. Function: a body of reusable code used to perform specific tasks in R
function (arguments)
2. Comment: describe or explain what's going on in your code (# …)
3. Variable: representation of a value in R that can be stored for use later during
programming (also be called objects)
ÞThis lets us call out the values any time we need to with just the variable
Ex: assign a variable to a different data type (numeric)
4. Data types: numeric, date, time,…
5. Vector: a group of data elements of the same type stored in a sequence
6. Pipe: a tool for expressing a sequence of multiple operations, represented %>
%
=> All of 6 work together as a foundation for using R
Example of The basic concepts of R

R documentation
R has built-in documentation for all functions and packages. To learn more about any R function, just run the
code ?function_name.
?geom_bar
Vectors and lists in R

• Data structure: a format for organizing and storing data (vectors, data frames,
matrices, and arrays)
=> Single data elements don’t give you much information, but when data elements
are combined into vectors, data frames, and other data structures -> to solve a
business challenge.
• Vector (atomic vectors and lists): a group of data elements of the same type,
stored in a one-dimensional sequence
=> Vectors can only contain data of one type.
- Atomic vectors: can be logical, numeric, or character
Create vectors

• Use the c() function to store numeric data in a vector


Ex: c(2.5, 48.5, 101.5)
Note: create a vector of integers, you must place the letter L directly after each
number: c(1L, 5L, 15L)
c("Sara" , "Lisa" , "Anna")
c(TRUE, FALSE, TRUE)
- create a vector of a sequence of numbers: c(4:10)
• Determine vector properties: type and length.
+ Determine a vector's type with the typeof() function: Place the code for the vector
inside the parentheses of the function
Ex: typeof(c("a" , "b")) => the output "character"
Note: check if a vector is a specific type by using an is function: is.logical(),
is.double(), is.integer(), or is.character()
Ex:1. x <- c(2L, 5L, 11L)
2. is.integer(x)
Vector length

• Determine the length of an existing vector: the number of elements it


contains
Ex:
• Name vectors: You can name elements in vectors of any type with the
names() function
Ex: 1.assigns the variable, x, to a new
vector with three elements
2. uses the names() function to assign
a different name to each element of the vector
3. Hiển thị x => return
• Extract a subset of a vector

Or x[“b”]
Lists

• Lists: elements can be of any type—including characters, integers, and


logical values. Lists can even contain other lists, matrices, vectors, or
data frames
• Create lists: list(); list("a", 1L, 1.5, TRUE)
• Determine a list's structure: str(); str(list("a", 1L, 1.5, TRUE))
• Name list elements:
list('Chicago' = 1, 'New York' = 2, 'Los Angeles' = 3)
Dates and times

• Load tidyverse and lubridate packages: in your current R session


- Install package: install.packages("tidyverse")
- Load “tidyverse”: library(tidyverse)
- Load lubridate package (library(lubridate)): tools automatically discover the
date/time format
Cách khác: Check & click on
(ko cần dung lệnh library)

• Work with dates and times: 3 types


- date ("2016-08-16")
- time within a day ("20:11:59 UTC")
- date-time ("2018-03-31 18:15:48 UTC")
• Commands: today(); now();
• identify the order: ymd("2023-01-20") / mdy("January 20th, 2023")/
ymd(20210120) unquoted numbers / ymd_hms("2021-01-20 20:11:59")
Data frames

• data frame: a collection of columns containing data (similar to a


spreadsheet or SQL table).
- Data frames can include many different types of data, including
numeric, logical, or character.
- Data frames can have only one element in each cell.
- Each column should be named.
- Each column should consist of elements of the same data type.
• create a data frame: data.frame()
Files

• When you’re doing data analysis, you won’t usually create a data
frame yourself. Instead, you’ll import data from another source, such
as a .csv file, a relational database, or a software program
• Create a file: file.create(),
ex: file.create("new_csv_file.csv")
[1] TRUE/ FALSE (create Successfully/ Fail)
• Copy a file: file.copy("new_text_file.txt", "destination_folder")
• delete files: unlink("some_.file.csv")
Matrices

• Matrix: a two-dimensional collection of data elements, only contain a


single data type
Þhas both rows and columns

End of Part 1
Part 2: Explore coding
• Operators & Calculation:

• Logical operators and conditional statements: AND, OR, and NOT


Note Khác với hướng dẫn:
data("airquality")
View(airquality)
- AND:
airquality[, "Solar.R"] > 150 &
airquality[, "Wind"] > 10

Tương tự cho OR, NOT


Conditional statements

if statement
if (x > 0) {print("x is a positive number")}
Þthe code to be executed if the condition is TRUE

else statement
if (x > 0) {print ("x is a positive number")}
else {print ("x is either a negative number or zero)}

else if statement
if (x < 0) {print("x is a negative number")}
else if (x == 0) {print("x is zero")}
else {print("x is a positive number")}
Hands-On Activity: R sandbox

• https://posit.cloud/content/6208304
• Save a Permanent Copy -> Course 7 -> Week 2 -> Lesson3_Sandbox.Rmd
• Carefully read the instructions in the comments of the Rmd file and
complete each step
Content:
- install and load `R packages`; functions
- viewing, cleaning, and visualizing data;
- using `R markdown` to export your work

Note:
-code chunk: đọan code
End of P2
P3: Learning about R packages
• Package: units of reproducible R code -> use to add more functionality
to R
• Packages include:
- reusable R functions
- documentation about the functions
- sample datasets
- tests for checking your code
• R: includes a set of packages called base R that are available to use in
RStudio when you start your first programming session.
• Check packages: installed.packages()
Check packages
• Base: the package is already installed and loaded
• Recommended: package is installed but not loaded
• Check by: a brief description of each package

Ex: class package has a check next to it: successfully loaded for use
• One of the most commonly used sources of packages is CRAN
(comprehensive R archive network)
Available R packages

• R community creates and shares packages so that other users can


access them
• Packages can be found in repositories: CRAN (
https://cran.r-project.org/),...
=> Choosing the right packages
- Tidyverse: a collection of R packages specifically designed for
working with data -> a standard library for most data analysts
- Quick list of useful R package (
https://support.posit.co/hc/en-us/articles/201057987-Quick-list-of-usef
ul-R-packages
): RStudio Support’s list of useful packages
- CRAN task views (https://cran.r-project.org/web/views/): an
index of CRAN packages sorted by task
tidyverse

• Tidyverse: actually a collection of packages in R with a common


design philosophy for data manipulation, exploration, and visualization
• Conflicts happen when packages have functions with the same names as
other functions
• Workflow:

• Update: install.packages(“package name”)


Hands-On Activity: Installing and loading tidyverse

• Install the tidyverse: install.packages("tidyverse")


• Load the tidyverse: library(tidyverse)
• Read tidyverse vignettes: documentation that acts as a guide to an R
package
Ex: browseVignettes(“ggplot2”)
Note: Gõ lệch > browse….
- vignette: mô tả ngắn gọn

End of P3
P4: Explore the tydiverse
• 8 core tidyverse packages:
- ggplot2: used for data visualization, specifically plots
- tidyr: used for data cleaning to make tidy data
- readr: used for importing data
- dplyr: offers a consistent set of functions that help you complete some
common data manipulation tasks
- tibble: works with data frames
- purrr: works with functions and vectors
- stringr: work with strings
- forcats: provides tools that solve common problems with factors (store
categorical data in R)

• 4 packages that are an essential part of the workflow for data analysts:
ggplot2, dplyr, tidyr and readr
Use pipes to nest code
• Pipe: a tool in R for expressing a sequence of multiple operations (%>% or Ctrl +
shift +m)
=> it takes the output of one statement and makes it the input of the next statement
Ex: +normal code

+Pipe:

• Nested: describes code that performs a particular function and is


contained within code that performs a broader function
Ex:
Example 1_ pipe: Sort data
Example 2_ pipe: Summarise data
END OF M2
M3: Working with data in R

P1: Explore data


• Data frame, tibble
+ head()
+ str(): Structure

+ colnames()

+ mutate(): add a new col (carat_2) to diamond data frame & calculate
this col
Hands-on Activity: Create your own data frame

• Open: https://posit.cloud/content/6208304
• Save a Permanent Copy -> Course 7 -> Week 3 ->
Lesson2_Dataframe.Rmd
• Read the instructions in the
comments of the .Rmd file and
complete each step:
+ There are 3 common sources for data:
- A`package` by loading that `package`
- An external file like a spreadsheet or CSV that can be imported
- Data that has been generated from scratch using `R` code
Hand _on: Creating and using data frames

## Step 1: Create data frame directly called `people`


- create a vector: names <- c(“A", “B", “C", “D")
- create a vector of ages: age <- c(80, 52, 22 ,17)
- create a new data frame: people <- data.frame(names, age)
## Step 2: inspect the data frame: head(people); str(); glimpse(); colnames()
- create a new variable that would capture each person's age in
twenty years: mutate(people, age_in_20 = age + 20)
## Step 3: Try it yourself: create your data frame, called `fruit_ranks`
- create a vector of any five different fruits
- create a new vector with a number representing your own personal
rank for each fruit: 1-5 (1 like the most)
- create a data frame `fruit_ranks`
Tibbles
• Tibbles are a streamlined variation of data frames.
• Tibbles automatically pull up only the first 10 rows of a dataset and only as many columns
as can fit on your screen => view a small snapshot & also includes the type of data in each
column.
Ex:
• Tibbles increase efficiency:
- Efficiently explore data: automatically
presenting a manageable preview of the
data.
- Maintain consistency and data integrity
maintain the consistency of variable names
and data types, ensuring data integrity
through the analysis process.

=> reduces the risk of errors and data mishandling, a critical consideration in data analysis.

Note: - streamlined: được sắp xếp hợp lý


Create a Tibble

• Use the function as_tibble() to create a tibble from an existing


data frame or matrix. Specify the data frame you’d like to convert to
a tibble in the function:
as_tibble(diamonds)
it won’t save the tibble.
To save the diamonds dataset as a tibble, save it to a new object
with the following code:
diamonds_tibble <- as_tibble(diamonds)
examine it with the code: diamonds_tibble
Data-import basics

1. preloaded datasets from the datasets package


• List datasets: data(), Load datasets “diamonds”: data (“diamonds”)
2. import data from other sources
• rectangular data: each column referring to a single variable and each
row referring to a single observation: .csv (comma separated
values), .tsv (tab separated values), .fwf (fixed width
files), .log (a .log file is a computer-generated file that records
events from operating systems and other software programs.)
• readr package: a great tool for reading rectangular data
Reading a .csv file with readr

• The readr package comes with some sample files from built-in
datasets. To list the sample files, you can run the
readr_example()
• use the read_csv() function to read the "mtcars.csv" file:
read_csv(readr_example("mtcars.csv"))
=> gives the name and type of each column & tibble
• the readxl package: To import spreadsheet data => transfer data
from Excel into R
- The readxl package is part of the tidyverse but is not a core tidyverse
package, so you need to load readxl in R by library(readxl)
- readxl_example() to see the list
- read_excel(readxl_example("type-me.xlsx"))
Hands-On Activity: Importing and working with data

• Open https://posit.cloud/content/6208304
• Save a Permanent Copy -> Course 7 -> Week 3 -> Lesson2_Import.Rmd
• Read the instructions in the comments of the Rmd file and complete each step
## The scenario: clean a .csv file that was created after querying a database to combine
two different tables from different hotels
- Step 1: use the `read_csv()` function to import data from a .csv in the project folder
called "hotel_bookings.csv" and save it as a data frame called `bookings_df`
bookings_df <- read_csv("hotel_bookings.csv")
- Step 2: Inspect & clean data
+ head(bookings_df)
+ create another data frame using `bookings_df` that focuses on the average daily
rate (`adr`) in the data frame, and `adults`:
new_df <- select(bookings_df, `adr`, adults)
+ create new variables: use the `mutate()` function. This will make changes to the
data frame, but not to the original data set you imported. That source data will remain
unchanged: mutate(new_df, total = `adr` / adults)
Hand_on (cont.)
- Step 3: Import your own data
+ import and save the file in the current working directory the Cloud > project
folder:
Files -> Upload button -> Chọn Target directory -> Chọn Tệp (E:\Tina\Data
Science\Course 7\product_inf.csv)
+ Import: read_csv(“product_inf.csv”)
+ Inspect: View(), ….

End of P1
P2: Cleaning data
Cleaning up with the basics: how to clean up columns
• install 3 packages:
+ Here: makes referencing files easier
+ Skimr: makes summarizing data
+ Janitor: simplify data cleaning tasks
• make sure the dplyr package is loaded
• load a dataset “penguins” in “palmerpenguins” package
• skim_without_charts (“penguins”) gives us a pretty comprehensive summary of
a dataset
• glimpse () to get a really quick idea of what's in this dataset
• head() to get a preview of the column names and the first few rows
• select() to specify certain columns or to exclude columns we don't need
Ex:
how to clean up columns

• rename () makes it easy to change column names


Ex:

• rename_with() can change column names to be more consistent


Ex:

• clean_names () in the Janitor package will automatically make sure that the column
names are unique and consistent
=> ensures that there's only characters, numbers, and underscores in the names
Ex:
File-naming conventions

• file names: should be accurate, consistent, and easy to read


• Examples of good filenames vs Examples of filenames to avoid
2020-04-10_march-attendance.R
_20210320*newcustomeridsforfebonly.csv
More on R operators

• Operators: 4 main types


1. Assignment: assign values to variables
Ex: x <- 2 , y <-5 => assign x = 2 , y = 5
2. Arithmetic: perform basic math operations, include
+ Modulus %%: returns the remainder after division (y%%x => 1)
+ Integer division %/%: returns an integer value after division
(y%/%x => 2)
+ Exponent ^ (y^x => 25)
3. Relational (or comparators): compare values (less than <, equal to ==, or greater than >, not
equal !=) => output for relational operators is either TRUE or FALSE (which is a logical data type,
or boolean)
Ex: x < y => TRUE
4. Logical: combine logical statements (AND &; OR |; NOT !) and return a logical value, TRUE/
FALSE
Ex: x > 1 & x < 3 => TRUE
y < 5 | y > 6 => F
! (x > 3) => T
Organize your data

• organize and filter data: use the arrange, group by and filter functions
1. arrange(): sorting our data

• save cleaned data without losing information from the


original dataset:
2. group_by (): sorting & group some cols
filter data

• filter()
Hands-On Activity: Cleaning data in R

• Open https://posit.cloud/content/6208304
-> Save a Permanent Copy -> Course 7 -> Week 3 -> Lesson3_Clean.Rmd.
• read the instructions in the comments of the Rmd file and complete each step:
## The scenario: clean a .csv file that was created after querying a database to
combine two different tables from different hotels
## Step 1: Load packages (`tidyverse`, `skimr`, and `janitor`)
## Step 2: Import data

## Step 3: Getting to know your data head(bookings_df) or `str()` and `glimpse()`,


colnames (), skim_without_charts()
## Step 4: Cleaning your data
+ Create a new data frame with just columns: 'hotel', 'is_canceled', and
'lead_time‘ by trimmed_df <- bookings_df %>% select(hotel, is_canceled, lead_time)
+ combine the arrival month and year into one column using the unite():
example_df <- bookings_df %>% select(arrival_date_year, arrival_date_month) %>%
unite(arrival_month_year, c("arrival_date_month", "arrival_date_year"), sep = " ")
Hands-On Activity: Cleaning data in R (cont.)

## Step 5: Another way of doing things


+ create a new column that summed up all the adults, children, and babies on a
reservation for the total number of people:
example_df <- bookings_df %>% mutate(guests = adults + children + babies)

+ Calculate the total number of canceled bookings and the average lead time for
booking: Make a column called 'number_canceled' to represent the total number of
canceled bookings. Then, make a column called 'average_lead_time' to represent the
average lead time. Use the `summarize()`
Gợi ý: example_df <- bookings_df %>% … coding…
Solution: Xem Lesson3_Clean_Solutions.Rmd
Manually create a data frame

• Create data frame “employee”


id <- c(1:10)
name<- c("John Mendes", "Rob Stewart", "Rachel Abrahamson",
"Christy Hickman", "Johnson Harper", "Candace Miller", "Carlson
Landy", "Pansy Jordan", "Darius Berry", "Claudia Garcia")
job_title <- c("Professional", "Programmer", "Management", "Clerical",
"Developer", "Programmer", "Management", "Clerical", "Developer",
"Programmer")
employee <- data.frame(id, name, job_title)
Check: print(employee)
Transforming data

• break up a variable across multiple columns or combine existing


columns, or even add new values to your data frame
=> separate(), unite() and mutate() functions to transform our data
Using “employee” data frame
• use the separate (): to split the first and last names into separate
columns

Note: sep=‘space’ : separate the name column at the first blank space
Nếu ko có sep=‘ ‘ cũng được
Transforming data (cont.)

• unite (): allows to merge columns together => the opposite of separate

• Mutate(): a little bit before to clean and organize our data. But mutate
can also be used to add columns with calculations
Wide to long with tidyr

• Wide data: observations across several columns

=> pivot_longer()
• Long data: all the observations in a single column

Þpivot_wider()

End of P2
P3: Take a closer look at the data
• Anscombe's quartet: has 4 datasets that have nearly identical
summary statistics
• install.packages("Tmisc") & load
• Load dataset: data("quartet"): 4 set (I,II,III,IV) & x,y
• get a summary of these statistical measures
Anscombe's quartet (cont.)
Þ datasets are identical, but sometimes just looking at the summarized data can be misleading
• Let's put together some simple graphs to help us visualize this data and check if the datasets
are actually identical
ggplot(quartet,aes(x,y)) + geom_point() + geom_smooth(method=lm, se=FALSE) +
facet_wrap(~set)
(Học kỹ phần plot sau!)
Check:
4 datasets appear quite
different when we visualize
Þ If we just gone with a
statistical summaries, we
never would have known
that this data is actually
really different
The datasauRus package
• The datasauRus: creates plots with the Anscombe data in different shapes
• famous dinosaur: a bull's eye, a star, … => R is a pretty powerful
visualization tool. You could use the relationships between data points to
create many other shapes
• install.packages("datasauR") (ko có trong R desktop/ Clould hiện đang cài
đặt)
ggplot(datasaurus_dozen,aes(x=x,y=y,colour=dataset))+geom_point()
+theme_void()+theme(legend.position = "none") + facet_wrap
(~dataset,ncol=3)
install.packages("datasauR") (ko có trong R desktop
hiện đang cài đặt)
The bias function

• bias (a, b) : finds the average amount (a-b)


Þ If the model is unbiased, the outcome ~ 0.
A high result (far away from 0) -> data might be biased
• Install: “SimDesign” package
Ex1: use the bias function to compare forecasted temperatures with
actual temperatures

Ex2: compare ordering amount to their actual sales -> they could find
out if they are ordering new stock according to their actual needs
Work with biased data

• understand how to identify and manage biased data whenever possible.


• sample(): allows you to take a random sample of elements from a data set
Ex: showing a group of users with 3 ads side-by-side for the same mobile app
design.
After viewing 3 ads, the users complete a survey to determine their
preferences
Þ We were seeing consistent bias in favor of the ad viewed first
Solution: We decided to add randomization to the position of the ads using
R (sample(), SMOTE, and NearMiss algorithms).
Presented the ads to users again, and this time, the position of the ads was
random and controlled for bias
=> Less bias (meant that the survey was more effective because the data
was more reliable)
Hands-On Activity: Changing your data

• Open: https://posit.cloud/content/6208304
-> Save a Permanent Copy -> Course 7 -> Week 3 -> Lesson3_Change.Rmd
• read the instructions in the comments of the Rmd file and complete each step
• use statistical summaries to explore your data, and gain initial insights for your
stakeholders
• Using Rstudio desktop
## The Scenario: clean a .csv file that was created after querying a database to
combine two different tables from different hotels
## Step 1: Load packages `tidyverse`, `skimr`, and `janitor`
## Step 2: Import data
+ Copy “hotel_bookings.csv” into working directory
C:/Users/DELL/Document/Basic
+ hotel_bookings <- read_csv("hotel_bookings.csv")
## Step 3: Getting to know your data head() / View (),….
Hands-On Activity (cont.)
## Manipulating your data
+ arrange the data by most lead_time to least lead_time because
you want to focus on bookings that were made far in advance
arrange(hotel_bookings, desc(lead_time)) hoặc thêm dấu – thay cho
desc

Note: without saving your data to a new data frame, it does not alter
the existing data frame, check by head(hotel_bookings)
Hands-On Activity (cont.)

+ If you wanted to create a new data frame that had those


changes saved, you would use the assignment operator, <-
hotel_bookings_v2 <- arrange(hotel_bookings, desc(lead_time))

+ find out the maximum and minimum lead_times without


sorting the whole dataset using the `arrange()`
min(hotel_bookings$lead_time)
+ the average: mean(hotel_bookings$lead_time)
Hands-On Activity (cont.)

+ want to know what the average lead_time before booking is for just
city hotels

+ know a lot more information about city hotels, including the


maximum and minimum lead time. They are also interested in how they are
different from resort hotels

End of M3
M4: More about Viz, Aesthetics, and annotations

P1: Create data Viz

• some different visualization packages:


+ Base package
+ Others: RGL (focus on specific solutions like 3D visuals), ggplot2, Plotly,
Lattice, RGL, Dygraphs, Leaflet, Highcharter, Patchwork, gganimate and ggridges
+ ggplot2: create all kinds of different plots
More inf: was originally created by the statistician and developer Hadley Wickham in
2005. Wickham's inspiration for creating ggplot2 came from the 1999 book The
Grammar of Graphics, a scholarly study of data visualization by computer scientist
Leland Wilkinson. The first two letters of ggplot2 actually stand for grammar of
graphics. And in the same way the grammar of a human language gives us rules to build
any kind of sentence, the grammar of graphics gives us rules to build any kind of visual
Xem “data-visualization_Cheat_sheet.pdf”
ggplot2
• core concepts in ggplot2: aesthetics, geoms, facets, labels and annotations
1. Aesthetics (thẩm mỹ): a visual property of an object in your plot
Ex: in a scatter plot, aesthetics include the size, shape , color or location (x-
axis, y-axis) of your data points

2. A geom: the geometric object used to represent your data


Ex: Points (to create a scatter plot), Bars (to create a bar chart), or Lines (to
create a line diagram)
3. Facets: let you display smaller groups or subsets of your data
Þ create separate plots for all the variables in your dataset.
4. labels and annotations: let you customize your plot
=> add text like titles, subtitles and captions to communicate the purpose of
your plot or highlight important data
Hands-On Activity: Visualizing data with ggplot2

• Dataset: “PalmerPenguins” (install.packages(“palmerpenguins”))


• Load “penguins” dataset: data(penguins)

• Create a plot in ggplot2: plot the relationship between body mass and flipper length in the
three penguin species
Þ A scatterplot of points would be an effective way to display the relationship between the two
variables (flipper length on the x-axis and body mass on the y-axis). There are 3 steps:
1. Start with the ggplot() and choose a dataset to work with
2. Add a geom_function to display your data
3. Map the variables you want to plot in the argument of the aes()
ggplot(data = penguins) + geom_point(mapping =
aes(x = flipper_length_mm, y = body_mass_g))
Or
ggplot(data = penguins, mapping =
Explain the code
• ggplot(data = penguins): creates a coordinate system
data: dataset “penguins”
• +: add a “+” symbol to add a new layer to your plot. You complete
your plot by adding one or more layers to ggplot()
• geom_point(): use points to create scatterplots (geom_bar(): create
bar charts),…
+ (mapping = aes(x = flipper_length_mm, y = body_mass_g)):
mapping argument:
- define how variables in your dataset are mapped to visual properties.
- always paired with the aes(): The x and y arguments specify which
variables to map to the x-axis and the y-axis of the coordinate system.
Hands-On Activity: Using ggplot
• Open https://posit.cloud/content/6208304
• Save a Permanent Copy-> Course 7 -> Week 4 -> Lesson2_GGPlot.Rmd.
• read the instructions in the comments of the Rmd file and complete
each step
## The Scenario: create some simple data visualizations with the `ggplot2` package
## Step 1: Import your data hotel_bookings <- read.csv("hotel_bookings.csv")
## Step 2: Look at a sample of your data colnames(hotel_bookings)
## Step 3: Begin creating a plot
Statement: "I want to target people who book early, and I have a hypothesis that
people with children have to book in advance.“
=> create a visualization to see how true that statement is-- or isn't.
Hands-On Activity (cont.)
ggplot(data = hotel_bookings) + geom_point(mapping = aes(x =
lead_time, y = children))

The plot reveals that:hypothesis is incorrect.


You report back to your stakeholder that
many of the advanced bookings are being
made by people with 0 children.
## Step 5: Try it on your own
Statement: “guests without children book
the most weekend nights” Is this true?
Hands-On Activity (cont.)
ÞTry mapping 'stays_in_weekend_nights' on the x-axis and 'children'
on the y-axis
ggplot(data = hotel_bookings) + geom_point(mapping = aes(x =
stays_in_weekend_nights, y = children))

True / False ???

End of P1
P2: Explore aesthetics in analysis
• Enhancing visualizations
Þwe can't tell which data points
refer to each of the three penguin species
ggplot(data = penguins)
+ geom_point(mapping = aes(
x = flipper_length_mm, y = body_mass_g,
color=species))
Combine (OR / AND): color, shape & size
If we want to change the color of all the
points to purple, we code outside of the aes():
ggplot(data = penguins) + geom_point(mapping =
aes(x = flipper_length_mm, y = body_mass_g),
geom()
• geom_point; geom_bar; geom_line; geom_smooth
ggplot(data = penguins) + geom_smooth(mapping = aes(x =
flipper_length_mm, y = body_mass_g))
Þ enables the detection of a data trend
ggplot(data = penguins) + geom_smooth(mapping = aes(x =
flipper_length_mm, y = body_mass_g, linetype=species))

ggplot(data = penguins) + geom_smooth(mapping = aes(x =


flipper_length_mm, y = body_mass_g)) + geom_point(mapping = aes(x =
flipper_length_mm, y = body_mass_g))
geom(cont.)

• ggplot(data=diamonds) + geom_bar(mapping=aes(x= cut))


Notice that we didn't supply a variable for the
y-axis: When you use geom underscore bar,
R automatically counts how many
times each x-value appears in the data,
and then shows the counts on the y-axis
(The default: count rows)
• uses several aesthetics (color, size, fill…)
ggplot(data=diamonds) + geom_bar(mapping=aes(x= cut,color=cut))
ggplot(data=diamonds) + geom_bar(mapping=aes(x= cut,fill=cut))
What is different between color vs fill ???
facets

• Tilde (~) operator is used to define the relationship between dependent


variable and independent variables in a statistical model formula.
• The variable on the left-hand side of tilde operator is the dependent
variable and the variable(s) on the right-hand side of tilde operator
is/are called the independent variable(s).
• Tilde operator helps to define that dependent variable depends on the
independent variable(s) that are on the right-hand side of tilde
operator.
Ex: facet_wrap(~species): 1 variable “species”
ggplot(data=penguins, aes(x=flipper_
length_mm, y=body_mass_g))+
geom_point(aes(color=species))+
facet_wrap(~species)
Facets (cont.)

• it's got too many variables or levels within variables


• Ex: diamond dataset: the number of diamonds for each category of
cut (Fair, good, very good, premium, and ideal)
ggplot(data=diamonds) + geom_bar(mapping=aes(x= color, fill=cut)) +
facet_wrap(~cut)
Facets (cont.)
• 2 variables: facet_grid(sex~species)
sex: vertically (trục tung) & species: horizontally (trục hoành)
ggplot(data = penguins) + geom_point(mapping = aes(x =
flipper_length_mm, y = body_mass_g, color=species))+
facet_grid(sex~species)
Hands-On Activity: Aesthetics and visualizations

• Open https://posit.cloud/content/6208304
• Save a Permanent Copy-> Course 7 -> Week 4 -> Lesson3_Aesthetics.Rmd
• read the instructions in the comments of the Rmd file and complete
## The Scenario: creating visualizations that highlight different aspects of
the data
## Step 1: Import your data …. Step 3
## Step 4: Making a Bar Chart: how many of the transactions are occurring
for each different distribution type
ggplot(data = hotel_bookings) +
geom_bar(mapping = aes
(x = distribution_channel))
Hands-On Activity (cont.)
## Practice quiz: what distribution type has the most number of bookings?
## Step 5: Diving deeper into bar charts
if the number of bookings for each distribution type is different depending
on whether or not there was a deposit or what market segment they
represent?
+ deposit
ggplot(data = hotel_bookings) +
geom_bar(mapping = aes(x =
distribution_channel,fill=deposit_type ))
+ Market segment
ggplot(data = hotel_bookings) + geom_bar(mapping = aes(x =
distribution_channel,fill=market_segment ))
Hands-On Activity (cont.)
## Step 6: Facets galore => create separate charts for each deposit type
and market segment
+ deposit type:
ggplot(data = hotel_bookings) +
geom_bar(mapping = aes(x = distribution_channel)) +
facet_wrap(~ deposit_type )
Þ it's hard to read the x-axis labels
ggplot(data = hotel_bookings) + geom_bar(mapping = aes(x =
distribution_channel)) + facet_wrap(~deposit_type) +
theme(axis.text.x = element_text(angle = 45))
Þrotates the text to 45 degrees
to make it easier to read.
Hands-On Activity (cont.)
• The facet_grid () does something similar. The main difference is that
`facet_grid` will include plots even if they are empty
ggplot(data = hotel_bookings) + geom_bar(mapping = aes(x =
distribution_channel)) + facet_grid(~deposit_type) + theme(axis.text.x
= element_text(angle = 45))
• You should have 3 bar charts. Now, you could put all of this in one
chart and explore the differences by deposit type and market
segment.
ggplot(data = hotel_bookings) + geom_bar(mapping = aes(x =
distribution_channel)) + facet_wrap(~deposit_type~market_segment)
+ theme(axis.text.x = element_text(angle = 45))
=> These charts are probably overwhelming and too hard to read, but it
can be useful if you are exploring your data through visualizations.
Hands-On Activity: Filters and plots
• Open https://posit.cloud/content/6208304
• Save a Permanent Copy-> Course 7 -> Week 4 -> Lesson3_Filters.Rmd
• how to make use of the filters and facets features of ggplot2 to create
custom visualizations based on different criteria.
## The Scenario: clean hotel booking data, create visualizations with
`ggplot2` to gain insight into the data, and present different facets of the
data through visualization
## Step 1- 3: Import & load packages
## Step 4: Making many different charts
To run a family-friendly promotion targeting key market segments.
=> Managers want to know which market segments generate the largest
number of bookings, and where these bookings are made (city hotels or
resort hotels).
Hands-On Activity (cont.)
• First, you decide to create a bar chart showing each hotel type and
market segment
ggplot(data = hotel_bookings) +
geom_bar(mapping = aes(x = hotel, fill = market_segment))
• this bar chart: difficult to compare the size of the market segments at
the top of the bars.
Þclearly compare each segment
Þ use the facet_wrap() to create a separate plot for each market
segment
ggplot(data = hotel_bookings) + geom_bar(mapping = aes(x = hotel)) +
facet_wrap(~ market_segment)
Hands-On Activity (cont.)
## Step 5: Filtering
To send the promotion to families that make online bookings for city hotels. The online
segment is the fastest growing segment, and families tend to spend more at city hotels
than other types of guests.
Þcreate a plot that shows the relationship between lead time and guests traveling with
children for online bookings at city hotels
1. filter data: create a data set that only includes the data you want
onlineta_city_hotels <- filter(hotel_bookings, (hotel=="City Hotel" &
hotel_bookings$market_segment=="Online TA"))
View(“onlineta_city_hotels”)
2. plot filtered data:
ggplot(data = onlineta_city_hotels) + geom_point(mapping = aes(x = lead_time, y =
children))
Þbookings with children tend to have a shorter lead time, and bookings with 3 children
have a significantly shorter lead time => promotions targeting families can be made
closer to the valid booking dates.
P3: Annotation layer
• Annotate: add notes to a document or diagram to explain or comment upon it
• Lable: labs(title, subtitle, caption)
ggplot(data = penguins) + geom_point(mapping = aes(x = flipper_length_mm, y =
body_mass_g, color=species))+ labs(title="Palmer Penguins: Body vs Flipper")
+ Include: subtitle, caption (source)
labs(title="Palmer Penguins: Body vs Flipper“, subtitle=“Sample of 3 Penguins”,
caption=“Data collected by Dr. Kristen”)
ggplot(data = penguins) + geom_point(mapping = aes(x = flipper_length_mm, y =
body_mass_g, color=species))+
labs(title="Palmer Penguins: Body vs Flipper", subtitle="Sample of 3
Penguins",caption="Data collected by Dr.Kristen") +
annotate("text", x= 220, y=3500,label="The Gentoos are the largest“, color
=“purple”, fontface=“bold”, size=4.5, angle=25)
Saving your visualizations

• Export:

• ggsave()
ggsave("Palmer Penguins.png")
Xem file: Files ->
Hands-On Activity: Annotating and saving viz

• Open https://posit.cloud/content/6208304
• Save a Permanent Copy-> Course 7 -> Week 4 -> Lesson4_Annotations.Rmd
## The Scenario: add annotations to your visualizations
## Step 1-3: Import your data & load “hotel_bookings.csv” dataset
## Step 4: Annotating your chart
+ create a viz that compares market segments between city hotels and resort hotels
ggplot(data = hotel_bookings) + geom_bar(mapping = aes(x = market_segment)) +
facet_wrap(~hotel)
Þunclear where the data is from, what the main takeaway is, or even what the data is
showing
ÞAnnotations
mindate <- min(hotel_bookings$arrival_date_year)
maxdate <- max(hotel_bookings$arrival_date_year)
ggplot(data = hotel_bookings) + geom_bar(mapping = aes(x = market_segment)) +
facet_wrap(~hotel) + labs(title="Comparison of market segments by hotel type for hotel
bookings", caption=paste0("Data from: ", mindate, " to ", maxdate), x="Market
Hands-On Activity (cont.)

+ save it as a .png file: ggsave('hotel_booking_chart.png')


## Practice quiz: What are the default dimensions that `ggsave()` saved
your image as?
=> specify the height and width
ggsave('hotel_booking_chart.png', width=16, height=8)

END OF M4
M5: Documentation & Reports

P1: Develop documentation & reports in Rstudio


• R Markdown (.Rmd): a file format for making dynamic documents
Þuse an R Markdown file as a code notebook to save, organize, and
document your analysis using code chunks, comments, and other
features.
• R Markdown documents are written in Markdown (a syntax for
formatting plain text files)
• R Notebook (dynamic documents): includes an interactive option that
lets users run your code and show the graphs and charts that visualize
the code
• R Markdown lets you convert your files into: HTML, PDF, Word
documents, slide presentation or dashboard
=> easy to share the same analysis in a variety of ways
install.packages("rmarkdown")
• Packages are downloaded in C:\Users\DELL\AppData\Local\Temp\
RtmpKSJOIZ\downloaded_packages
• Open a new R Markdown: File -> New file… -> R Markdown
Knit: save an R Markdown
as a shareable HTML report
Hands-On Activity: Your R Markdown notebook

• create R Markdown documents to record your analysis => keep track of your data
analysis process and share your work with others
• Open an Rmd file: File -> New File -> R Markdown
+ YAML (yet another markup language): a language used in data files to improve human
readability) header section => can change the information;
each line has a number associated with it
=>easy to reference a location in the notebook
+ code chunk: the gray background
ÞRun code, start ```{r} & end ```

+ add a code chunk:


1. Open Rmd file (Lesson3_Clean.Rmd in Course7> Week3)
2. Edit Rmd file
Some basic formatting options

• Edit :

• Knit (render):

+ *italics works*
+ **bold is useful**
+ create a header: # Conclusion
+ The more hashtags you add (up to 6), the smaller the header
+ Tick marks format the text to appear as code even though the text is not in a
code chunk
END OF P1
P2: Create R Markdown documents

• Open R Markdown Intr.Rmd file


• Edit:

• Knit

End of P2
P3: Understand code chunks & exports
• Delimiter: is a character that indicates the beginning or end of a data
item. In RMarkdown, ```{r } and ``` can be used as delimiters for code
chunks.
• Wirte code chunk: Code -> Insert Chunk
=>

Or
Hands-On Activity: Adding code chunks to R Markdown notebooks

• create two new chunks:


+ ```{r ggplot for penguin data}
library(ggplot2)
library(palmerpenguins)
data(penguins)
View(penguins)
```
+ ```{r ggplot for penguin data visualization}
ggplot(data = penguins) +
geom_point(mapping = aes(x = flipper_length_mm, y = body_mass_g))
```
Slide Presentations

• R Markdown renders files to specific presentation formats when you


use the following output settings:
+ beamer_presentation – for PDF presentations with beamer
+ ioslides_presentation – for HTML presentations with ioslides
+ slidy_presentation – for HTML presentations with Slidy
+ powerpoint_presentation – for PowerPoint presentations
+ revealjs : : revealjs_presentation – for HTML presentations with
reveal.js (a framework for creating HTML presentations that requires
the reveal.js package)
Ex:
Dashboards

• Dashboards are a useful way to quickly communicate a lot of


information. The flexdashboard package lets you publish a group of
related data visualizations as a dashboard. Flexdashboard also
provides tools for creating sidebars, tabsets, value boxes, and gauges.
Ex:
Hands-On Activity: Exporting your R Markdown notebook

• Export:

End of M5

You might also like