ProfessiR programming
ProfessiR programming
R was developed by statisticians and data analysts as an interactive environment for data
analysis.
Some of the advantages of R are that
(1) it is free and open source
(2) it has the capability to save scripts
(3) there are numerous resources for learning
(4) it is easy for developers to share software implementation.
Expressions are evaluated in the R console when you type the expression into the console
and hit Return.
A great advantage of R over point and click analysis software is that you can save your
work as scripts.
“Base R” is what you get after you first install R. Additional components are available via
packages.
# installing the dslabs package
install.packages("dslabs")
# loading the dslabs package into the R session
library(dslabs)
The base version of R is quite minimal, but you can supplement its functions by installing
additional packages.
We will be using tidyverse and dslabs packages for this course.
Install packages from R console: install.packages("pkg_name")
Install packages from RStudio interface: Tools > Install Packages (allows autocomplete)
Once installed, we can use library(pkg_name) to load a package each time we want to
use it
library(tidyverse)
library(dslabs)
data(murders)
murders %>%
ggplot(aes(population, total, label=abb, color=region)) +
geom_label()
2. R BASICS
2.2 OBJECTS
Typing ls() - shows all the variables saved in the work space.
To define a variable, we may use the assignment symbol, <-.
There are two ways to see the value stored in a variable:
(1) type the variable name into the console and hit Return,
(2) use the print() function by typing print(variable_name) and hitting Return.
Objects are things that are stored in named containers in R. They can be variables,
functions, etc.
The ls() function shows the names of the objects saved in your workspace.
# assigning values to variables
a <-1
b <-1
c <--1
# solving the quadratic equation
(-b + sqrt(b^2 - 4*a*c))/(2*a)
(-b - sqrt(b^2 - 4*a*c))/(2*a)
2.1 FUNCTIONS
Examples of functions sqrt, log etc
To evaluate functions, parentheses (brackets) is used BUT not always .
Functions can nested (Nested functions), they are evaluated from inside out when
nested : log(exp(1)).
Help files are user manuals for the functions: help(“log”) or ?log for operators add
quotes i.e “+”
Arguments are things expected from a function : args(log), then it can be changed to
your liking, example log(8,base2)
Examples: base of the function log defaults to base=exp(1) making it a natural log (base,
x and value for log)
All data sets can be seen by typing data ()
There are data and mathematical objects that are pre built i.e. CO2, pie or infinity no
Creating and saving scripts simplifies the coding work
2.3: DATA TYPES
2.4: VECTORS
We can get more than one entry by using multi-entry vector as an index
codes[1:5] #to access a range from 1 to 5
codes[c(1,3)] #to access specific cities
Calling vectors as names result the same way like calling numbers
codes["canada"]
codes[c("egypt","italy")]
seq(7, 49, 7), the first argument defines the start, the second defines the end, and the third
defines the size of the jump.
length.out is a seq() argument which allows to generate sequences that increment by the same
value but generate a vector of a specific length.
x <- seq(0, 100, length.out = 5). To generate num (1:5) 0, 25, 50, 75, 100
integer class: You can create an integer by adding the letter L after an integer
The main difference is that integers take up less space in the computer's memory. So for large
operations, using integers can have a substantial impact.
Class of seq(1,10) is interger
Class of x <- seq(0, 100, length.out = 5) is numeric
2.5 SORTING
murders$state[1:10]
murders$abb[1:10]
There is a simpler way to do this, max function is used for large number, min function for smaller
numbers.
Rank function:
x <- c(31, 4, 15, 92, 65)
x
rank(x) # returns ranks (smallest to largest)
Original – raw data
Sort – arrangement from small to large
Order – index needed to obtain the sorted data
Rank – tells the arrangement of the original vectors
31 4 2 3
4 15 3 1
15 31 1 2
92 65 5 5
65 92 4 4
- lets we want a state with murder rates lower than 0.71(Italy), then we:
Rate = murder$total/murder$rate*10000
Index <- murder_rate < 0.71. or
Index <= murder_rate < 0.71
Index
Murder$state[index]
To know how many entries are true, the sum function is used. True and false are
converted to numeric which are 1 and 0
Sum(index) # the number of states with murders less than 0.71
We previously tried to calculate the average using mean(na_example) and got NA. This
is because the mean function returns NA, if it finds at least one NA. As a consequence, a
common operation is to remove entries that are NA before applying functions like mean.
The operator ! can help us with this. The operator ! is a logical denial. !TRUE becomes
FALSE and !FALSE becomes TRUE. In the previous exercise, we defined ind as the logical
vector that tells us how many inputs are NA. We'll use ind here again.
x <- c(1, 2, 3)
ind <- c(FALSE, TRUE, FALSE)
x[!ind]
# result is 1 and 3
Calculate the average of `na_example` after removing the
# input `NA` by using the `!` operator in `ind`
mean(na_example[!ind])
We use and to get a vector of radicals which satisfy the 2 conditions
West < - murders$region == “West”
Safe < - murder_rate <= 1
Index < - safe & west
Murders$state(index)
It won’t make sense to compare murders in California to other states due their large
population, therefore we employ murders per capital by using vector arithmetic.
vector arithmetic occur element wise; meaning each murders of the state will be divided by
their population to get the right values then multiply by 100,000 to get the right units.
# The name of the state with the maximum population is found by doing the following
murders$state[which.max(murders$population)]
2.8 INDEXING
# Create an `ind` vector for states located in the northeast and with rates of
# homicide lower than 1.
a) Scatter plots
population_in_millions < - murders$population/10^6
total _gun_murders < - murders$total
plot(population_in_million, total _gun_murders)
b) Histogram
hist(murders$rate)
c) Box plots
boxplot(rate~region, data = murders)
DATA WRANGLING
R commands and techniques that help you wrangle and analyze data. In this section, you
will learn to:
For the purpose of manipulating data tables, we use dplyr package. It introduces functions
that perform the most common manipulation and uses name for these function that are
relatively easy to remember.
To change a data table by adding a new column, or changing an existing one, we use
the mutate() function.
To subset the data by selecting specific columns, we use the select( ) function.
- Some summary statistics are the mean, median, and standard deviation.
- The summarize() function from dplyr provides an easy way to compute summary
statistics.
library(tidyverse)
library(dplyr)
library(dslabs)
data(murders)
murders <- mutate(murders, rate = total / population * 10^5)
# minimum, median, and maximum murder rate for the states in the West region
s <- murders %>%
filter(region == "West") %>%
summarize(minimum = min(rate),
median = median(rate),
maximum = max(rate))
library(tidyverse)
library(dplyr)
library(dslabs)
data(murders)
murders <- mutate(murders, rate = total / population * 10^5)
# minimum, median, and maximum murder rate for the states in the West region using
quantile
# note that this returns a vector
murders %>%
filter(region == "West") %>%
summarize(range = quantile(rate, c(0, 0.5, 1)))
Since as most dplyr functions summarize always returns a data frame. This might be
problematic if you want to use the result with functions that require numeric value.
To get numeric values or vectors, we can use the access or the dollar sign or the
dplyr pull function.
If we want to save the number directly with just one line of code, we can do the
whole operation like this.
We now see that the US murder rate object is now a numeric. We can use the class
function like this.
Although not needed for the functions presented here, It is often convenient to have
a way to access the data frame being piped.
Think of the dot as a placeholder for the data that being passed through the pipe.
Because the data object is a data frame, we can access its columns with the dollar
sign, the accessor.
Group then summarize
A common operation in data exploration is the first split data into groups and then
compute summaries for each group.
For example, we may want to compute the median murder rate in each region of the
United States.
The group by function helps us do this. If we type this, the result does not look very
different from the original murders data table, except we see regions and then a four
when we print the object.
Although not immediately obvious from its appearance, this is now a special data frame
called a group dataframe.
And dplyr functions, in particular summarize, will behave differently when acting on this
object. Conceptually, you can think of this table as many tables with the same columns
but not necessarily the same number of rows stacked together in one object.
Know what happens when we summarize the data after grouping. If we type this
command, the summarize function applies a summarization to each group separately.
Group by followed by summarize is one of the most used operations in data analysis.
library(tidyverse)
library(dplyr)
library(dslabs)
data(murders)
murders <- mutate(murders, rate = total / population * 10^5)
# group by region
murders %>% group_by(region)
To see the states ordered by the murder rate, we can type this instead.
# order the states by population size
murders %>% arrange(population) %>% head()
Note that the default behavior is the order in ascending order, in dplyr, the function d-e-
s-c transforms a vector to be in descending order.
If we are ordering by a column with ties, we can use a second column to break the ties.
Similarly, a third column can be used to break ties between first and second columns, and so
on.
To grab an individual of minimum height and know their gender after indexing them is by
individual_sex
Data table
Data table is used to analyse large data faster and easier focusing on analytical and
statistical approach.
It a separate package so, we install it, the load it ; library(data.table). we will use data.table
approach by dplyr mutate, filter, select, group by and summarize.
First step to using data.table, we convert the data frame into data.table object using a set
DT function
Data.table uses an approach that avoids a new assignment, called update by reference. This
can help with large data sets that take up most of your computer memory. (:=)
Also .() tells us that things inside the parenthesis are column names not variables.
In this course, we often use tidyverse packages to illustrate because these packages
tend to have code that is very readable for beginners.
There are other approaches to wrangling and analyzing data in R that are faster and
better at handling large objects, such as the data.table package.
Because the data.table package is designed to avoid wasting memory, when you
make a copy of a table, it does not create a new object. The := function changes by
reference. If you want to make an actual copy, you need to use the copy() function.
# selecting in dplyr
select(murders, state, region)
x[,a := 2]
y
y[,a := 1]
x
With data table, we again use an approach similar to sub-setting with matrices, except
data table knows that rate refers to a column name and not to an object in the R
environment.
Look at this simple line of code that achieves the same thing.
The results will be the state names and rates for those states that have rates below 0.7,
just this very simple command.
# subsetting in dplyr
filter(murders, rate <= 0.7)
# subsetting in data.table
murders[rate <= 0.7]
These lines of codes below show how simple it is to operate summarizing with data.table
compared to dplyr.
In data.table we can call functions inside .() and they will be applied to rows.
# summarizing in dplyr
s <- heights %>%
summarize(average = mean(height), standard_deviation = sd(height))
# summarizing in data.table
s <- heights[, .(average = mean(height), standard_deviation = sd(height))]
The data table package also makes it easier to order rows based on the values of a
column.
Here's some code that orders the murders data set based on the population size of
states. We simply type the following.
# order by population
murders[order(population)] |> head()
Similarly, we can perform nested ordering by including more than one variable in the
function order.
So, for example, if we want to order by region, and then within region by rate,
we can simply type this code
# order by region and then murder rate
murders[order(region, rate)]
D: Tibbles
# compare what happens when accessing a column that doesn't exist in a regular
data frame to in a tibble
murders$State
as_tibble(murders)$State
# create a tibble
tibble(id = c(1, 2, 3), func = c(mean, median, sd))
PROGRAMMING BASICS
Section 3 introduces general programming features perform exploratory data analysis, build
data analysis pipelines, and prepare data analysis visualization to communicate the results:
A: CONDITIONALS
- This is most basic feature of programming is the if-else statement.
- The basic idea of this is to print the reciprocal of a unless a is 0
If…..else conditions
if(Boolean condition){
expressions
} else{
Alternative expression
}
Using the murders data frame, the first code is to tell us states with murder rate lower than
0.5, else print there is no, the other code is set to murder rate of 0.25
# an example that tells us which states, if any, have a murder rate less than 0.5
library(dslabs)
data(murders)
murder_rate <- murders$total / murders$population*100000
ind <- which.min(murder_rate)
if(murder_rate[ind] < 0.5){
print(murders$state[ind])
} else{
print("No state has murder rate that low")
}
This function helps us to replace NAs with some other value. Now lets use dslab packages
called NA_example which has 145 values of numbers and nas all together, the code
examples as below;
Any returns an answer if any of the asked vector is there, while for All it only says true if
all values are available and true
Example.
We can use the dplyr group_by() function to create a special grouped data frame to
facilitate such summaries.
Review am
Compute the sum of the integers 1 to 1,000 using the seq and sum functions.