0% found this document useful (0 votes)
3 views

ProfessiR programming

The document provides an introduction to R programming, highlighting its advantages such as being free, open-source, and script-saving capabilities. It covers basic operations like installing and loading packages, defining variables, using functions, and understanding data types and structures. Additionally, it discusses data manipulation techniques using the dplyr package for data wrangling and analysis.

Uploaded by

amosnatalia37
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

ProfessiR programming

The document provides an introduction to R programming, highlighting its advantages such as being free, open-source, and script-saving capabilities. It covers basic operations like installing and loading packages, defining variables, using functions, and understanding data types and structures. Additionally, it discusses data manipulation techniques using the dplyr package for data wrangling and analysis.

Uploaded by

amosnatalia37
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 22

1.

Section 1: R Basics _ Motivation and Getting Started

 R was developed by statisticians and data analysts as an interactive environment for data
analysis.
 Some of the advantages of R are that
(1) it is free and open source
(2) it has the capability to save scripts
(3) there are numerous resources for learning
(4) it is easy for developers to share software implementation.
 Expressions are evaluated in the R console when you type the expression into the console
and hit Return.
 A great advantage of R over point and click analysis software is that you can save your
work as scripts.
 “Base R” is what you get after you first install R. Additional components are available via
packages.
# installing the dslabs package
install.packages("dslabs")
# loading the dslabs package into the R session
library(dslabs)

 To install a package, you use the code install.packages("package_name", dependencies =


TRUE).
 To load a package, you use the code library(package_name).
 If you also want to use a dataset from a loaded package, use the code data(dataset_name).
 To see the dataset, you can take the additional step of View(dataset_name).

1.1 INSTALLING PACKAGES

 The base version of R is quite minimal, but you can supplement its functions by installing
additional packages.
 We will be using tidyverse and dslabs packages for this course.
 Install packages from R console: install.packages("pkg_name")
 Install packages from RStudio interface: Tools > Install Packages (allows autocomplete)
 Once installed, we can use library(pkg_name) to load a package each time we want to
use it

install.packages("dslabs") # to install a single package


install.packages(c("tidyverse", "dslabs")) # to install two packages at the same
time
installed.packages() # to see the list of all installed packages

1.2 RUNNING COMMANDS BY EDITING SCRIPTS


Save script: Command + S
Run entire script: Command + Shift + Return
Run single line: Command + Return
Open new script: Command + Shift + N

library(tidyverse)
library(dslabs)
data(murders)

murders %>%
ggplot(aes(population, total, label=abb, color=region)) +
geom_label()

2. R BASICS

2.2 OBJECTS
Typing ls() - shows all the variables saved in the work space.
 To define a variable, we may use the assignment symbol, <-.
 There are two ways to see the value stored in a variable:
(1) type the variable name into the console and hit Return,
(2) use the print() function by typing print(variable_name) and hitting Return.
 Objects are things that are stored in named containers in R. They can be variables,
functions, etc.
 The ls() function shows the names of the objects saved in your workspace.
# assigning values to variables
a <-1
b <-1
c <--1
# solving the quadratic equation
(-b + sqrt(b^2 - 4*a*c))/(2*a)
(-b - sqrt(b^2 - 4*a*c))/(2*a)

2.1 FUNCTIONS
 Examples of functions sqrt, log etc
 To evaluate functions, parentheses (brackets) is used BUT not always .
 Functions can nested (Nested functions), they are evaluated from inside out when
nested : log(exp(1)).
 Help files are user manuals for the functions: help(“log”) or ?log for operators add
quotes i.e “+”
 Arguments are things expected from a function : args(log), then it can be changed to
your liking, example log(8,base2)
 Examples: base of the function log defaults to base=exp(1) making it a natural log (base,
x and value for log)
 All data sets can be seen by typing data ()
 There are data and mathematical objects that are pre built i.e. CO2, pie or infinity no
 Creating and saving scripts simplifies the coding work
2.3: DATA TYPES

 Rows represent observations and column represent variables


 Class helps us determine the type of an object i.e class (a) = numeric if a<-1
 Classes can be numerical, categorical or logical
 Data frames is the most common ways of storing data, more like a table(rows and
columns
 Data frames helps to combine different data sets into 1 object
 Loading data frames = data(“murders”)
 To find out more about this data stored in the object “murder” we use structure
function written as str
 Head function is used to show the first 6 lines of the data frame. Head(murders)
Analysing the data
 For the analysis, we need to access the variables represented by Columns, the accessor,
dollar sign is used, $. i.e murders$population
 Also Names function will provide names of the population i.e names(murders)
 The result of the accessor ($), preserves the order of rows
 There are also multiple ways to access variables in a data frame. For example, we can
use two square brackets [[ instead of the access operator $. i.e c <-murders[["abb"]]
 The results of objects that are not a single number although it has multiple entries are
called vectors
 Length function tells how many are they i.e length (pop)
 Character vectors , quotes are used to distinguish between variable names and
characters because variables also use character strings. i.e “a”
 An example of character vector = state name column
 Logical vectors must be either true or false
 == is a relational operator to determine true or false, where as = assigns values
 Factors data type are commonly confused for vectors, i.e regions in our murder data
frame
 Factors are categorical data, in the US there is only 4 regions, @ state is in one of the
four regions.
 We can see these factor data using the function called level function i.e
levels(murder$region) = NE, NC, S and W
 Use class function to ask what type of data one is to avoid confusion especially with
factors and characters.
# loading the dslabs package and the murders dataset
library(dslabs)
data(murders)

# determining that the murders dataset is of the "data frame" class


class(murders)
# finding out more about the structure of the object
str(murders)
# showing the first 6 lines of the dataset
head(murders)

# using the accessor operator to obtain the population column


murders$population
# displaying the variable names in the murders dataset
names(murders)
# determining how many entries are in a vector
pop <- murders$population
length(pop)
# vectors can be of class numeric and character
class(pop)
class(murders$state)

# logical vectors are either TRUE or FALSE


z <- 3 == 2
zd
class(z)

# factors are another type of class


class(murders$region)
# obtaining the levels of a factor
levels(murders$region)
 The "c" in `c()` is short for "concatenate," which is the action of connecting objects to a string
 The `c()` function connects all the strings into a single vector, which we can assign to `x`
 The table function is called on the variable;
r <- murders$region
x <- c("a", "a", "b", "b", "b", "c")
table(murders$region)

2.4: VECTORS

 It is the most basic unit to store data


 Complex data sets are broken down into components that are vectors
 Function C is used to create vectors, it stands for concatenate
codes <- c(380, 124, 818)
country <- c("italy", "canada", "egypt") – Character vectors we use quotes “
 The quotes used tells us that they are characters not variables. Or we can assign names to simplify
results
codes <- c(italy = 380, canada = 124, egypt = 818)
codes <- c("italy" = 380, "canada" = 124, "egypt" = 818)
names(codes) <- country

 Sequence function is also used to create vectors


Subsetting : to access elements of a vector we use [ ]. E.g. codes [2] = Canada 124 2
indicating second element

 We can get more than one entry by using multi-entry vector as an index
 codes[1:5] #to access a range from 1 to 5
codes[c(1,3)] #to access specific cities
 Calling vectors as names result the same way like calling numbers
codes["canada"]
codes[c("egypt","italy")]
 seq(7, 49, 7), the first argument defines the start, the second defines the end, and the third
defines the size of the jump.
 length.out is a seq() argument which allows to generate sequences that increment by the same
value but generate a vector of a specific length.
x <- seq(0, 100, length.out = 5). To generate num (1:5) 0, 25, 50, 75, 100
 integer class: You can create an integer by adding the letter L after an integer
 The main difference is that integers take up less space in the computer's memory. So for large
operations, using integers can have a substantial impact.
Class of seq(1,10) is interger
Class of x <- seq(0, 100, length.out = 5) is numeric

2.4.1 VECTORS COERSION

 Is an attempt by R to be flexible with data types


 More like R fixing a problem by assuming what we ment before the error.
 as.character is a function that can convert number into strings and as.numeric to return them
 NA is missing data, it is common in coercion

2.5 SORTING

 This function sorts the vector in increasing order. i.e.:


 For example we use it to organize gun murders in the states from least to most.
Sort(murders$total). But it only shows total not states
 Order function: is the index that returns the indices that sort the vector parameter in order. We use
object index
Index <- order(x)
x[index]

murders$state[1:10]
murders$abb[1:10]

index <- order(murders$total)


murders$abb[index] # order abbreviations by total murders

sort(x) and x[order(x)] produce the same result.

 There is a simpler way to do this, max function is used for large number, min function for smaller
numbers.

max(murders$total) # highest number of total murders


i_max <- which.max(murders$total) # index with highest number of murders
murders$state[i_max] # state name with highest number of total murders

 Rank function:
x <- c(31, 4, 15, 92, 65)
x
rank(x) # returns ranks (smallest to largest)
 Original – raw data
 Sort – arrangement from small to large
 Order – index needed to obtain the sorted data
 Rank – tells the arrangement of the original vectors

Original Sort Order Rank

31 4 2 3

4 15 3 1

15 31 1 2

92 65 5 5

65 92 4 4

2.6 LOGICAL OPERATORS TO INDEX VECTORS

- lets we want a state with murder rates lower than 0.71(Italy), then we:

Rate = murder$total/murder$rate*10000
Index <- murder_rate < 0.71. or
Index <= murder_rate < 0.71
Index
Murder$state[index]

 To know how many entries are true, the sum function is used. True and false are
converted to numeric which are 1 and 0
Sum(index) # the number of states with murders less than 0.71
 We previously tried to calculate the average using mean(na_example) and got NA. This
is because the mean function returns NA, if it finds at least one NA. As a consequence, a
common operation is to remove entries that are NA before applying functions like mean.
The operator ! can help us with this. The operator ! is a logical denial. !TRUE becomes
FALSE and !FALSE becomes TRUE. In the previous exercise, we defined ind as the logical
vector that tells us how many inputs are NA. We'll use ind here again.
x <- c(1, 2, 3)
ind <- c(FALSE, TRUE, FALSE)
x[!ind]
# result is 1 and 3
 Calculate the average of `na_example` after removing the
# input `NA` by using the `!` operator in `ind`
mean(na_example[!ind])
 We use and to get a vector of radicals which satisfy the 2 conditions
West < - murders$region == “West”
Safe < - murder_rate <= 1
Index < - safe & west
Murders$state(index)

2.7 VECTOR ARITHMETIC


dslabs::murders
library(dslabs)
data(murders)
# The name of the state with the maximum population is found by doing the following
murders$state[which.max(murders$population)]. ------- a state with the most
population( carlifornia

max(murders$population) ------ number of population

It won’t make sense to compare murders in California to other states due their large
population, therefore we employ murders per capital by using vector arithmetic.

vector arithmetic occur element wise; meaning each murders of the state will be divided by
their population to get the right values then multiply by 100,000 to get the right units.

# The name of the state with the maximum population is found by doing the following
murders$state[which.max(murders$population)]

# how to obtain the murder rate


murder_rate <- murders$total / murders$population * 100000

# ordering the states by murder rate, in decreasing order


murders$state[order(murder_rate, decreasing=TRUE)]

2.8 INDEXING

# Create an `ind` vector for states located in the northeast and with rates of
# homicide lower than 1.

ind <- low & murders$region == "Northeast"

2.9 BASIC PLOTS

a) Scatter plots
population_in_millions < - murders$population/10^6
total _gun_murders < - murders$total
plot(population_in_million, total _gun_murders)

b) Histogram
hist(murders$rate)

c) Box plots
boxplot(rate~region, data = murders)
DATA WRANGLING
R commands and techniques that help you wrangle and analyze data. In this section, you
will learn to:

 Wrangle data tables using functions in the dplyr package.

 Use summarize() to facilitate summarizing data in dplyr.

 Learn how to subset and summarize data using data.table.

 Learn how to sort data frames using data.table.

A: Basic Data Wrangling

For the purpose of manipulating data tables, we use dplyr package. It introduces functions
that perform the most common manipulation and uses name for these function that are
relatively easy to remember.

 To change a data table by adding a new column, or changing an existing one, we use
the mutate() function.

 To filter the data by subsetting rows, we use the function filter( ).

 To subset the data by selecting specific columns, we use the select( ) function.

 We can perform a series of operations by sending the results of one function to


another function using the pipe operator, %>%.
# installing and loading the dplyr package
install.packages("dplyr")
library(dplyr)

# adding a column with mutate


library(dslabs)
data("murders")
murders <- mutate(murders, rate = total / population * 100000)

# subsetting with filter


filter(murders, rate <= 0.71)

# selecting columns with select


new_table <- select(murders, state, region, rate)

# using the pipe


murders %>% select(state, region, rate) %>% filter(rate <= 0.71)

filter – chooses rows

select – chooses columns


 filter with not equal
# Use `filter` with the != operator to create a new data
# frame without the southern region and call it `no_south`
no_south <- filter (murders, state != "south" )

 filter with in command (%in%)


# Create a new data frame called `murders_nw` with only
# northeastern and western states
murders_nw <- filter(murders, region %in% c("northeast", "west"))

 filter given two conditions


# Add rate column
murders <- mutate(murders, rate = total / population * 100000, rank = rank(-rate))

# Create a data frame and call it `my_states` that satisfies both


# conditions, northeast or west and homicide rate less than 1
my_states <- filter(murders, rate < 1 & region == "northeast", "west")

 Using the pipe


pipe |>
- It replaces the argument that “select” brings
state, total / population * 100000, rank = rank(-rate)) |> rank)

B: Creating Data Frames

 We can use the data.frame() function to create data frames.

 Formerly, the data.frame() function turned characters into factors by default. To


avoid this, we could utilize the stringsAsFactors argument and set it equal to false. As
of R 4.0, it is no longer necessary to include the stringsAsFactors argument, because
R no longer turns characters into factors by default.

# creating a data frame with stringAsFactors = FALSE


grades <- data.frame(names = c("John", "Juan", "Jean", "Yao"),
exam_1 = c(95, 80, 90, 85),
exam_2 = c(90, 85, 85, 90),
stringsAsFactors = FALSE)

why do we add string factor, because

Class(grade$names) …#factor so we convert the names to characters with


stringsAsFactors = FALSE.
C: Summarizing with dplyr

The Summarize Function

- Some summary statistics are the mean, median, and standard deviation.
- The summarize() function from dplyr provides an easy way to compute summary
statistics.

library(tidyverse)
library(dplyr)
library(dslabs)
data(murders)
murders <- mutate(murders, rate = total / population * 10^5)

# minimum, median, and maximum murder rate for the states in the West region
s <- murders %>%
filter(region == "West") %>%
summarize(minimum = min(rate),
median = median(rate),
maximum = max(rate))

# accessing the components with the accessor $


s$median
s$maximum

# average rate unadjusted by population size


mean(murders$rate)

# average rate adjusted by population size


us_murder_rate <- murders %>%
summarize(rate = sum(total) / sum(population) * 10^5)
us_murder_rate

Summarizing with more than one value

library(tidyverse)
library(dplyr)
library(dslabs)
data(murders)
murders <- mutate(murders, rate = total / population * 10^5)

# minimum, median, and maximum murder rate for the states in the West region using
quantile
# note that this returns a vector
murders %>%
filter(region == "West") %>%
summarize(range = quantile(rate, c(0, 0.5, 1)))

# returning minimum, median, and maximum as a data frame


my_quantile <- function(x){
r <- quantile(x, c(0, 0.5, 1))
data.frame(minimum = r[1], median = r[2], maximum = r[3])
}
murders %>%
filter(region == "West") %>%
summarize(my_quantile(rate))
Pull to access columns

Since as most dplyr functions summarize always returns a data frame. This might be
problematic if you want to use the result with functions that require numeric value.

To get numeric values or vectors, we can use the access or the dollar sign or the
dplyr pull function.

Here's an example of how we could use the pull function.

# the pull function can return it as a numeric value


us_murder_rate %>% pull(rate)

If we want to save the number directly with just one line of code, we can do the
whole operation like this.

# using pull to save the number directly


us_murder_rate <- murders %>%
summarize(rate = sum(total) / sum(population) * 10^5) %>%
pull(rate)
us_murder_rate

# us_murder_rate is now stored as a number


class(us_murder_rate)

We now see that the US murder rate object is now a numeric. We can use the class
function like this.

The Dot place holder

Although not needed for the functions presented here, It is often convenient to have
a way to access the data frame being piped.

For this, we use the dot.

Here is an example that imitates the pull function,

but using the dot instead.


# using the dot to access the rate
us_murder_rate <- murders %>%
summarize(rate = sum(total) / sum(population) * 10^5) %>%
.$rate
us_murder_rate

Think of the dot as a placeholder for the data that being passed through the pipe.
Because the data object is a data frame, we can access its columns with the dollar
sign, the accessor.
Group then summarize

A common operation in data exploration is the first split data into groups and then
compute summaries for each group.

For example, we may want to compute the median murder rate in each region of the
United States.

The group by function helps us do this. If we type this, the result does not look very
different from the original murders data table, except we see regions and then a four
when we print the object.

Although not immediately obvious from its appearance, this is now a special data frame
called a group dataframe.

And dplyr functions, in particular summarize, will behave differently when acting on this
object. Conceptually, you can think of this table as many tables with the same columns
but not necessarily the same number of rows stacked together in one object.

Know what happens when we summarize the data after grouping. If we type this
command, the summarize function applies a summarization to each group separately.

Group by followed by summarize is one of the most used operations in data analysis.

library(tidyverse)
library(dplyr)
library(dslabs)
data(murders)
murders <- mutate(murders, rate = total / population * 10^5)

# group by region
murders %>% group_by(region)

# summarize after grouping


murders %>%
group_by(region) %>%
summarize(median = median(rate))

Sorting Data Tables

 To order an entire table, we can use the dplyr function arrange().

 We can also use nested sorting to order by additional columns.

 The function head() returns on the first few lines of a table.

 The function top_n() returns the top n rows of a table.


library(tidyverse)
library(dplyr)
library(dslabs)
data(murders)
murders <- mutate(murders, rate = total / population
* 10^5)

To see the states ordered by the murder rate, we can type this instead.
# order the states by population size
murders %>% arrange(population) %>% head()

Note that the default behavior is the order in ascending order, in dplyr, the function d-e-
s-c transforms a vector to be in descending order.

# order the states by murder rate - the default is


ascending order
murders %>% arrange(rate) %>% head()

# order the states by murder rate in descending order


murders %>% arrange(desc(rate)) %>% head()

If we are ordering by a column with ties, we can use a second column to break the ties.
Similarly, a third column can be used to break ties between first and second columns, and so
on.

# order the states by region and then by murder rate


within region
murders %>% arrange(region, rate) %>% head()

# return the top 10 states by murder rate


murders %>% top_n(10, rate)

# return the top 10 states ranked by murder rate,


sorted by murder rate
murders %>% arrange(desc(rate)) %>% top_n(10)

To grab an individual of minimum height and know their gender after indexing them is by

min_heights <- min(heights$height)

index_min_height <- match(min_heights, heights$height)

individual_sex <- heights$sex[1032]

individual_sex
Data table

A: Introduction to Data table

Data table is used to analyse large data faster and easier focusing on analytical and
statistical approach.

It a separate package so, we install it, the load it ; library(data.table). we will use data.table
approach by dplyr mutate, filter, select, group by and summarize.

First step to using data.table, we convert the data frame into data.table object using a set
DT function

Data.table uses an approach that avoids a new assignment, called update by reference. This
can help with large data sets that take up most of your computer memory. (:=)

Also .() tells us that things inside the parenthesis are column names not variables.

# load packages and datasets and prepare the data


library(tidyverse)
library(dplyr)
library(data.table)
library(dslabs)
data(murders)
murders <- setDT(murders)
murders[, rate := total / population * 100000]

 In this course, we often use tidyverse packages to illustrate because these packages
tend to have code that is very readable for beginners.

 There are other approaches to wrangling and analyzing data in R that are faster and
better at handling large objects, such as the data.table package.

 Selecting in data.table uses notation similar to that used with matrices.

 To add a column in data.table, you can use the := function.

 Because the data.table package is designed to avoid wasting memory, when you
make a copy of a table, it does not create a new object. The := function changes by
reference. If you want to make an actual copy, you need to use the copy() function.

# install the data.table package before you use it!


install.packages("data.table")

# load data.table package


library(data.table)

# load other packages and datasets


library(tidyverse)
library(dplyr)
library(dslabs)
data(murders)
# convert the data frame into a data.table object
murders <- setDT(murders)

# selecting in dplyr
select(murders, state, region)

# selecting in data.table - 2 methods


murders[, c("state", "region")] |> head()
murders[, .(state, region)] |> head()

# adding or changing a column in dplyr


murders <- mutate(murders, rate = total / population * 10^5)

# adding or changing a column in data.table


murders[, rate := total / population * 100000]
head(murders)
murders[, ":="(rate = total / population * 100000, rank = rank(population))]
#for multiples columns

# y is referring to x and := changes by reference


x <- data.table(a = 1)
y <- x

x[,a := 2]
y

y[,a := 1]
x

# use copy to make an actual copy


x <- data.table(a = 1)
y <- copy(x)
x[,a := 2]
y

B: Subsetting data table

With data table, we again use an approach similar to sub-setting with matrices, except
data table knows that rate refers to a column name and not to an object in the R
environment.

Look at this simple line of code that achieves the same thing.

murders[rate <= 0.7, .(state, rate)]


Notice that we can combine the filter and select into one succinct command.

The results will be the state names and rates for those states that have rates below 0.7,
just this very simple command.

# load packages and prepare the data


library(tidyverse)
library(dplyr)
library(dslabs)
data(murders)
library(data.table)
murders <- setDT(murders)
murders <- mutate(murders, rate = total / population * 10^5)
murders[, rate := total / population * 100000]

# subsetting in dplyr
filter(murders, rate <= 0.7)

# subsetting in data.table
murders[rate <= 0.7]

# combining filter and select in data.table


murders[rate <= 0.7, .(state, rate)]

# combining filter and select in dplyr


murders %>% filter(rate <= 0.7) %>% select(state, rate)

B: Summarizing with data table

These lines of codes below show how simple it is to operate summarizing with data.table
compared to dplyr.

 In data.table we can call functions inside .() and they will be applied to rows.

 The group_by followed by summarize in dplyr is performed in one line


in data.table using the by argument.

load packages and prepare the data - heights dataset


library(tidyverse)
library(dplyr)
library(dslabs)
data(heights)
heights <- setDT(heights)

# summarizing in dplyr
s <- heights %>%
summarize(average = mean(height), standard_deviation = sd(height))

# summarizing in data.table
s <- heights[, .(average = mean(height), standard_deviation = sd(height))]

# subsetting and then summarizing in dplyr


s <- heights %>%
filter(sex == "Female") %>%
summarize(average = mean(height), standard_deviation = sd(height))

# subsetting and then summarizing in data.table


s <- heights[sex == "Female", .(average = mean(height), standard_deviation =
sd(height))]

# previously defined function


median_min_max <- function(x){
qs <- quantile(x, c(0.5, 0, 1))
data.frame(median = qs[1], minimum = qs[2], maximum = qs[3])
}

# multiple summaries in data.table


heights[, .(median_min_max(height))]

# grouping then summarizing in data.table


heights[, .(average = mean(height), standard_deviation = sd(height)), by = sex]

C: Sorting Data Tables

The data table package also makes it easier to order rows based on the values of a
column.
Here's some code that orders the murders data set based on the population size of
states. We simply type the following.
# order by population
murders[order(population)] |> head()

To sort the table in descending order, we


can order by the negative of population, or use the decreasing argument,
like this.
# order by population in descending order
murders[order(population, decreasing = TRUE)]

Similarly, we can perform nested ordering by including more than one variable in the
function order.
So, for example, if we want to order by region, and then within region by rate,
we can simply type this code
# order by region and then murder rate
murders[order(region, rate)]

D: Tibbles

- A tbl (pronounced "tibble") is a special kind of data frame.


- Tibbles are the default data frame in the tidyverse.
- Tibbles display better than regular data frames.
- Subsets of tibbles are tibbles, which is useful because tidyverse functions require
data frames as inputs.
- Tibbles will warn you if you try to access a column that doesn't exist.
- Entries in tibbles can be complex - they can be lists or functions.
- The function group_by() returns a grouped tibble, which is a special kind of tibble.
-

# view the dataset


murders %>% group_by(region)

# see the class


murders %>% group_by(region) %>% class()

# compare the print output of a regular data frame to a tibble


gapminder
as_tibble(gapminder)

# compare subsetting a regular data frame and a tibble


class(murders[,1])
class(as_tibble(murders)[,1])

# access a column vector not as a tibble using $


class(as_tibble(murders)$state)

# compare what happens when accessing a column that doesn't exist in a regular
data frame to in a tibble
murders$State
as_tibble(murders)$State

# create a tibble
tibble(id = c(1, 2, 3), func = c(mean, median, sd))

PROGRAMMING BASICS

Section 3 introduces general programming features perform exploratory data analysis, build
data analysis pipelines, and prepare data analysis visualization to communicate the results:

 Use basic conditional expressions to perform different operations.

 Check if any or all elements of a logical vector are TRUE.

 Define and call functions to perform various operations.

 Pass arguments to functions, and return variables/objects from functions.

 Use for-loops to perform repeated operations.

A: CONDITIONALS
- This is most basic feature of programming is the if-else statement.
- The basic idea of this is to print the reciprocal of a unless a is 0

 If…..else conditions

#an example showing the general structure of an if-else statement


a <- 2
if(a!=0){
print(1/a)
} else{
print("No reciprocal for 0.")
}
- The general form of this condition:

if(Boolean condition){
expressions
} else{
Alternative expression
}
Using the murders data frame, the first code is to tell us states with murder rate lower than
0.5, else print there is no, the other code is set to murder rate of 0.25

# an example that tells us which states, if any, have a murder rate less than 0.5
library(dslabs)
data(murders)
murder_rate <- murders$total / murders$population*100000
ind <- which.min(murder_rate)
if(murder_rate[ind] < 0.5){
print(murders$state[ind])
} else{
print("No state has murder rate that low")
}

# changing the condition to < 0.25 changes the result


if(murder_rate[ind] < 0.25){
print(murders$state[ind])
} else{
print("No state has a murder rate that low.")
}

 Ifelse conditions (one word)

# the ifelse() function works similarly to an if-else conditional


a <- 0
ifelse(a > 0, 1/a, NA)

- This function is particularly useful because it works on vectors, it examines each


element of logical vector and returns answer accordingly

# the ifelse() function is particularly useful on vectors


a <- c(0,1,2,-4,5)
result <- ifelse(a > 0, 1/a, NA)

This function helps us to replace NAs with some other value. Now lets use dslab packages
called NA_example which has 145 values of numbers and nas all together, the code
examples as below;

#the ifelse() function is also helpful for replacing missing values


data(na_example)
no_nas <- ifelse(is.na(na_example), 0, na_example)
sum(is.na(no_nas))
 Any and All conditions

- The any function takes a vector of logicals and it returns true


if any of the entries is true.

Any returns an answer if any of the asked vector is there, while for All it only says true if
all values are available and true
Example.

# the any() and all() functions evaluate logical vectors


z <- c(TRUE, TRUE, FALSE)
any(z)
all(z)

- The nchar function indicates the number of characters contained in a


character string; eg
nchar("Leo")
nchar("Pedro")
char_len <- nchar(murders$state)
head(char_len)
 Splitting data into groups and then computing summaries for each group is a
common operation in data exploration.

 We can use the dplyr group_by() function to create a special grouped data frame to
facilitate such summaries.

Review am

Compute the sum of the integers 1 to 1,000 using the seq and sum functions.

# Calculate the sum of integers from 1 to 1000 using


# the `seq` and `sum` functions.
n <- seq(1,1000)
x <- c(n)
x
sum(x)

ind <- is.na(na_example)


x[!ind]
mean(na_example)

You might also like