0% found this document useful (0 votes)

13 views22 pages

ProfessiR programming

The document provides an introduction to R programming, highlighting its advantages such as being free, open-source, and script-saving capabilities. It covers basic operations like installing and loading packages, defining variables, using functions, and understanding data types and structures. Additionally, it discusses data manipulation techniques using the dplyr package for data wrangling and analysis.

Uploaded by

amosnatalia37

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views22 pages

ProfessiR programming

Uploaded by

amosnatalia37

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 22

1.

Section 1: R Basics _ Motivation and Getting Started

 R was developed by statisticians and data analysts as an interactive environment for data
analysis.
 Some of the advantages of R are that
(1) it is free and open source
(2) it has the capability to save scripts
(3) there are numerous resources for learning
(4) it is easy for developers to share software implementation.
 Expressions are evaluated in the R console when you type the expression into the console
and hit Return.
 A great advantage of R over point and click analysis software is that you can save your
work as scripts.
 “Base R” is what you get after you first install R. Additional components are available via
packages.
# installing the dslabs package
install.packages("dslabs")
# loading the dslabs package into the R session
library(dslabs)

 To install a package, you use the code install.packages("package_name", dependencies =

TRUE).
 To load a package, you use the code library(package_name).
 If you also want to use a dataset from a loaded package, use the code data(dataset_name).
 To see the dataset, you can take the additional step of View(dataset_name).

1.1 INSTALLING PACKAGES

 The base version of R is quite minimal, but you can supplement its functions by installing
additional packages.
 We will be using tidyverse and dslabs packages for this course.
 Install packages from R console: install.packages("pkg_name")
 Install packages from RStudio interface: Tools > Install Packages (allows autocomplete)
 Once installed, we can use library(pkg_name) to load a package each time we want to
use it

install.packages("dslabs") # to install a single package

install.packages(c("tidyverse", "dslabs")） # to install two packages at the same
time
installed.packages() # to see the list of all installed packages

1.2 RUNNING COMMANDS BY EDITING SCRIPTS

Save script: Command + S
Run entire script: Command + Shift + Return
Run single line: Command + Return
Open new script: Command + Shift + N

library(tidyverse)
library(dslabs)
data(murders)

murders %>%
ggplot(aes(population, total, label=abb, color=region)) +
geom_label()

2. R BASICS

2.2 OBJECTS
Typing ls() - shows all the variables saved in the work space.
 To define a variable, we may use the assignment symbol, <-.
 There are two ways to see the value stored in a variable:
(1) type the variable name into the console and hit Return,
(2) use the print() function by typing print(variable_name) and hitting Return.
 Objects are things that are stored in named containers in R. They can be variables,
functions, etc.
 The ls() function shows the names of the objects saved in your workspace.
# assigning values to variables
a <-1
b <-1
c <--1
# solving the quadratic equation
(-b + sqrt(b^2 - 4*a*c))/(2*a)
(-b - sqrt(b^2 - 4*a*c))/(2*a)

2.1 FUNCTIONS
 Examples of functions sqrt, log etc
 To evaluate functions, parentheses (brackets) is used BUT not always .
 Functions can nested (Nested functions), they are evaluated from inside out when
nested : log(exp(1)).
 Help files are user manuals for the functions: help(“log”) or ?log for operators add
quotes i.e “+”
 Arguments are things expected from a function : args(log), then it can be changed to
your liking, example log(8,base2)
 Examples: base of the function log defaults to base=exp(1) making it a natural log (base,
x and value for log)
 All data sets can be seen by typing data ()
 There are data and mathematical objects that are pre built i.e. CO2, pie or infinity no
 Creating and saving scripts simplifies the coding work
2.3: DATA TYPES

 Rows represent observations and column represent variables

 Class helps us determine the type of an object i.e class (a) = numeric if a<-1
 Classes can be numerical, categorical or logical
 Data frames is the most common ways of storing data, more like a table(rows and
columns
 Data frames helps to combine different data sets into 1 object
 Loading data frames = data(“murders”)
 To find out more about this data stored in the object “murder” we use structure
function written as str
 Head function is used to show the first 6 lines of the data frame. Head(murders)
Analysing the data
 For the analysis, we need to access the variables represented by Columns, the accessor,
dollar sign is used, $. i.e murders$population
 Also Names function will provide names of the population i.e names(murders)
 The result of the accessor ($), preserves the order of rows
 There are also multiple ways to access variables in a data frame. For example, we can
use two square brackets [[ instead of the access operator $. i.e c <-murders[["abb"]]
 The results of objects that are not a single number although it has multiple entries are
called vectors
 Length function tells how many are they i.e length (pop)
 Character vectors , quotes are used to distinguish between variable names and
characters because variables also use character strings. i.e “a”
 An example of character vector = state name column
 Logical vectors must be either true or false
 == is a relational operator to determine true or false, where as = assigns values
 Factors data type are commonly confused for vectors, i.e regions in our murder data
frame
 Factors are categorical data, in the US there is only 4 regions, @ state is in one of the
four regions.
 We can see these factor data using the function called level function i.e
levels(murder$region) = NE, NC, S and W
 Use class function to ask what type of data one is to avoid confusion especially with
factors and characters.
# loading the dslabs package and the murders dataset
library(dslabs)
data(murders)

# determining that the murders dataset is of the "data frame" class

class(murders)
# finding out more about the structure of the object
str(murders)
# showing the first 6 lines of the dataset
head(murders)

# using the accessor operator to obtain the population column

murders$population
# displaying the variable names in the murders dataset
names(murders)
# determining how many entries are in a vector
pop <- murders$population
length(pop)
# vectors can be of class numeric and character
class(pop)
class(murders$state)

# logical vectors are either TRUE or FALSE

z <- 3 == 2
zd
class(z)

# factors are another type of class

class(murders$region)
# obtaining the levels of a factor
levels(murders$region)
 The "c" in `c()` is short for "concatenate," which is the action of connecting objects to a string
 The `c()` function connects all the strings into a single vector, which we can assign to `x`
 The table function is called on the variable;
r <- murders$region
x <- c("a", "a", "b", "b", "b", "c")
table(murders$region)

2.4: VECTORS

 It is the most basic unit to store data

 Complex data sets are broken down into components that are vectors
 Function C is used to create vectors, it stands for concatenate
codes <- c(380, 124, 818)
country <- c("italy", "canada", "egypt") – Character vectors we use quotes “
 The quotes used tells us that they are characters not variables. Or we can assign names to simplify
results
codes <- c(italy = 380, canada = 124, egypt = 818)
codes <- c("italy" = 380, "canada" = 124, "egypt" = 818)
names(codes) <- country

 Sequence function is also used to create vectors

Subsetting : to access elements of a vector we use [ ]. E.g. codes [2] = Canada 124 2
indicating second element

 We can get more than one entry by using multi-entry vector as an index
 codes[1:5] #to access a range from 1 to 5
codes[c(1,3)] #to access specific cities
 Calling vectors as names result the same way like calling numbers
codes["canada"]
codes[c("egypt","italy")]
 seq(7, 49, 7), the first argument defines the start, the second defines the end, and the third
defines the size of the jump.
 length.out is a seq() argument which allows to generate sequences that increment by the same
value but generate a vector of a specific length.
x <- seq(0, 100, length.out = 5). To generate num (1:5) 0, 25, 50, 75, 100
 integer class: You can create an integer by adding the letter L after an integer
 The main difference is that integers take up less space in the computer's memory. So for large
operations, using integers can have a substantial impact.
Class of seq(1,10) is interger
Class of x <- seq(0, 100, length.out = 5) is numeric

2.4.1 VECTORS COERSION

 Is an attempt by R to be flexible with data types

 More like R fixing a problem by assuming what we ment before the error.
 as.character is a function that can convert number into strings and as.numeric to return them
 NA is missing data, it is common in coercion

2.5 SORTING

 This function sorts the vector in increasing order. i.e.:

 For example we use it to organize gun murders in the states from least to most.
Sort(murders$total). But it only shows total not states
 Order function: is the index that returns the indices that sort the vector parameter in order. We use
object index
Index <- order(x)
x[index]

murders$state[1:10]
murders$abb[1:10]

index <- order(murders$total)

murders$abb[index] # order abbreviations by total murders

sort(x) and x[order(x)] produce the same result.

 There is a simpler way to do this, max function is used for large number, min function for smaller
numbers.

max(murders$total) # highest number of total murders

i_max <- which.max(murders$total) # index with highest number of murders
murders$state[i_max] # state name with highest number of total murders

 Rank function:
x <- c(31, 4, 15, 92, 65)
x
rank(x) # returns ranks (smallest to largest)
 Original – raw data
 Sort – arrangement from small to large
 Order – index needed to obtain the sorted data
 Rank – tells the arrangement of the original vectors

Original Sort Order Rank

31 4 2 3

4 15 3 1

15 31 1 2

92 65 5 5

65 92 4 4

2.6 LOGICAL OPERATORS TO INDEX VECTORS

- lets we want a state with murder rates lower than 0.71(Italy), then we:

Rate = murder$total/murder$rate*10000
Index <- murder_rate < 0.71. or
Index <= murder_rate < 0.71
Index
Murder$state[index]

 To know how many entries are true, the sum function is used. True and false are
converted to numeric which are 1 and 0
Sum(index) # the number of states with murders less than 0.71
 We previously tried to calculate the average using mean(na_example) and got NA. This
is because the mean function returns NA, if it finds at least one NA. As a consequence, a
common operation is to remove entries that are NA before applying functions like mean.
The operator ! can help us with this. The operator ! is a logical denial. !TRUE becomes
FALSE and !FALSE becomes TRUE. In the previous exercise, we defined ind as the logical
vector that tells us how many inputs are NA. We'll use ind here again.
x <- c(1, 2, 3)
ind <- c(FALSE, TRUE, FALSE)
x[!ind]
# result is 1 and 3
 Calculate the average of `na_example` after removing the
# input `NA` by using the `!` operator in `ind`
mean(na_example[!ind])
 We use and to get a vector of radicals which satisfy the 2 conditions
West < - murders$region == “West”
Safe < - murder_rate <= 1
Index < - safe & west
Murders$state(index)

2.7 VECTOR ARITHMETIC

dslabs::murders
library(dslabs)
data(murders)
# The name of the state with the maximum population is found by doing the following
murders$state[which.max(murders$population)]. ------- a state with the most
population( carlifornia

max(murders$population) ------ number of population

It won’t make sense to compare murders in California to other states due their large
population, therefore we employ murders per capital by using vector arithmetic.

vector arithmetic occur element wise; meaning each murders of the state will be divided by
their population to get the right values then multiply by 100,000 to get the right units.

# The name of the state with the maximum population is found by doing the following
murders$state[which.max(murders$population)]

# how to obtain the murder rate

murder_rate <- murders$total / murders$population * 100000

# ordering the states by murder rate, in decreasing order

murders$state[order(murder_rate, decreasing=TRUE)]

2.8 INDEXING

# Create an `ind` vector for states located in the northeast and with rates of
# homicide lower than 1.

ind <- low & murders$region == "Northeast"

2.9 BASIC PLOTS

a) Scatter plots
population_in_millions < - murders$population/10^6
total _gun_murders < - murders$total
plot(population_in_million, total _gun_murders)

b) Histogram
hist(murders$rate)

c) Box plots
boxplot(rate~region, data = murders)
DATA WRANGLING
R commands and techniques that help you wrangle and analyze data. In this section, you
will learn to:

 Wrangle data tables using functions in the dplyr package.

 Use summarize() to facilitate summarizing data in dplyr.

 Learn how to subset and summarize data using data.table.

 Learn how to sort data frames using data.table.

A: Basic Data Wrangling

For the purpose of manipulating data tables, we use dplyr package. It introduces functions
that perform the most common manipulation and uses name for these function that are
relatively easy to remember.

 To change a data table by adding a new column, or changing an existing one, we use
the mutate() function.

 To filter the data by subsetting rows, we use the function filter( ).

 To subset the data by selecting specific columns, we use the select( ) function.

 We can perform a series of operations by sending the results of one function to

another function using the pipe operator, %>%.
# installing and loading the dplyr package
install.packages("dplyr")
library(dplyr)

# adding a column with mutate

library(dslabs)
data("murders")
murders <- mutate(murders, rate = total / population * 100000)

# subsetting with filter

filter(murders, rate <= 0.71)

# selecting columns with select

new_table <- select(murders, state, region, rate)

# using the pipe

murders %>% select(state, region, rate) %>% filter(rate <= 0.71)

filter – chooses rows

select – chooses columns

 filter with not equal
# Use `filter` with the != operator to create a new data
# frame without the southern region and call it `no_south`
no_south <- filter (murders, state != "south" )

 filter with in command (%in%)

# Create a new data frame called `murders_nw` with only
# northeastern and western states
murders_nw <- filter(murders, region %in% c("northeast", "west"))

 filter given two conditions

# Add rate column
murders <- mutate(murders, rate = total / population * 100000, rank = rank(-rate))

# Create a data frame and call it `my_states` that satisfies both

# conditions, northeast or west and homicide rate less than 1
my_states <- filter(murders, rate < 1 & region == "northeast", "west")

 Using the pipe

pipe |>
- It replaces the argument that “select” brings
state, total / population * 100000, rank = rank(-rate)) |> rank)

B: Creating Data Frames

 We can use the data.frame() function to create data frames.

 Formerly, the data.frame() function turned characters into factors by default. To

avoid this, we could utilize the stringsAsFactors argument and set it equal to false. As
of R 4.0, it is no longer necessary to include the stringsAsFactors argument, because
R no longer turns characters into factors by default.

# creating a data frame with stringAsFactors = FALSE

grades <- data.frame(names = c("John", "Juan", "Jean", "Yao"),
exam_1 = c(95, 80, 90, 85),
exam_2 = c(90, 85, 85, 90),
stringsAsFactors = FALSE)

why do we add string factor, because

Class(grade$names) …#factor so we convert the names to characters with

stringsAsFactors = FALSE.
C: Summarizing with dplyr

The Summarize Function

- Some summary statistics are the mean, median, and standard deviation.
- The summarize() function from dplyr provides an easy way to compute summary
statistics.

library(tidyverse)
library(dplyr)
library(dslabs)
data(murders)
murders <- mutate(murders, rate = total / population * 10^5)

# minimum, median, and maximum murder rate for the states in the West region
s <- murders %>%
filter(region == "West") %>%
summarize(minimum = min(rate),
median = median(rate),
maximum = max(rate))

# accessing the components with the accessor $

s$median
s$maximum

# average rate unadjusted by population size

mean(murders$rate)

# average rate adjusted by population size

us_murder_rate <- murders %>%
summarize(rate = sum(total) / sum(population) * 10^5)
us_murder_rate

Summarizing with more than one value

library(tidyverse)
library(dplyr)
library(dslabs)
data(murders)
murders <- mutate(murders, rate = total / population * 10^5)

# minimum, median, and maximum murder rate for the states in the West region using
quantile
# note that this returns a vector
murders %>%
filter(region == "West") %>%
summarize(range = quantile(rate, c(0, 0.5, 1)))

# returning minimum, median, and maximum as a data frame

my_quantile <- function(x){
r <- quantile(x, c(0, 0.5, 1))
data.frame(minimum = r[1], median = r[2], maximum = r[3])
}
murders %>%
filter(region == "West") %>%
summarize(my_quantile(rate))
Pull to access columns

Since as most dplyr functions summarize always returns a data frame. This might be
problematic if you want to use the result with functions that require numeric value.

To get numeric values or vectors, we can use the access or the dollar sign or the
dplyr pull function.

Here's an example of how we could use the pull function.

# the pull function can return it as a numeric value

us_murder_rate %>% pull(rate)

If we want to save the number directly with just one line of code, we can do the
whole operation like this.

# using pull to save the number directly

us_murder_rate <- murders %>%
summarize(rate = sum(total) / sum(population) * 10^5) %>%
pull(rate)
us_murder_rate

# us_murder_rate is now stored as a number

class(us_murder_rate)

We now see that the US murder rate object is now a numeric. We can use the class
function like this.

The Dot place holder

Although not needed for the functions presented here, It is often convenient to have
a way to access the data frame being piped.

For this, we use the dot.

Here is an example that imitates the pull function,

but using the dot instead.

# using the dot to access the rate
us_murder_rate <- murders %>%
summarize(rate = sum(total) / sum(population) * 10^5) %>%
.$rate
us_murder_rate

Think of the dot as a placeholder for the data that being passed through the pipe.
Because the data object is a data frame, we can access its columns with the dollar
sign, the accessor.
Group then summarize

A common operation in data exploration is the first split data into groups and then
compute summaries for each group.

For example, we may want to compute the median murder rate in each region of the
United States.

The group by function helps us do this. If we type this, the result does not look very
different from the original murders data table, except we see regions and then a four
when we print the object.

Although not immediately obvious from its appearance, this is now a special data frame
called a group dataframe.

And dplyr functions, in particular summarize, will behave differently when acting on this
object. Conceptually, you can think of this table as many tables with the same columns
but not necessarily the same number of rows stacked together in one object.

Know what happens when we summarize the data after grouping. If we type this
command, the summarize function applies a summarization to each group separately.

Group by followed by summarize is one of the most used operations in data analysis.

library(tidyverse)
library(dplyr)
library(dslabs)
data(murders)
murders <- mutate(murders, rate = total / population * 10^5)

# group by region
murders %>% group_by(region)

# summarize after grouping

murders %>%
group_by(region) %>%
summarize(median = median(rate))

Sorting Data Tables

 To order an entire table, we can use the dplyr function arrange().

 We can also use nested sorting to order by additional columns.

 The function head() returns on the first few lines of a table.

 The function top_n() returns the top n rows of a table.

library(tidyverse)
library(dplyr)
library(dslabs)
data(murders)
murders <- mutate(murders, rate = total / population
* 10^5)

To see the states ordered by the murder rate, we can type this instead.
# order the states by population size
murders %>% arrange(population) %>% head()

Note that the default behavior is the order in ascending order, in dplyr, the function d-e-
s-c transforms a vector to be in descending order.

# order the states by murder rate - the default is

ascending order
murders %>% arrange(rate) %>% head()

# order the states by murder rate in descending order

murders %>% arrange(desc(rate)) %>% head()

If we are ordering by a column with ties, we can use a second column to break the ties.
Similarly, a third column can be used to break ties between first and second columns, and so
on.

# order the states by region and then by murder rate

within region
murders %>% arrange(region, rate) %>% head()

# return the top 10 states by murder rate

murders %>% top_n(10, rate)

# return the top 10 states ranked by murder rate,

sorted by murder rate
murders %>% arrange(desc(rate)) %>% top_n(10)

To grab an individual of minimum height and know their gender after indexing them is by

min_heights <- min(heights$height)

index_min_height <- match(min_heights, heights$height)

individual_sex <- heights$sex[1032]

individual_sex
Data table

A: Introduction to Data table

Data table is used to analyse large data faster and easier focusing on analytical and
statistical approach.

It a separate package so, we install it, the load it ; library(data.table). we will use data.table
approach by dplyr mutate, filter, select, group by and summarize.

First step to using data.table, we convert the data frame into data.table object using a set
DT function

Data.table uses an approach that avoids a new assignment, called update by reference. This
can help with large data sets that take up most of your computer memory. (:=)

Also .() tells us that things inside the parenthesis are column names not variables.

# load packages and datasets and prepare the data

library(tidyverse)
library(dplyr)
library(data.table)
library(dslabs)
data(murders)
murders <- setDT(murders)
murders[, rate := total / population * 100000]

 In this course, we often use tidyverse packages to illustrate because these packages
tend to have code that is very readable for beginners.

 There are other approaches to wrangling and analyzing data in R that are faster and
better at handling large objects, such as the data.table package.

 Selecting in data.table uses notation similar to that used with matrices.

 To add a column in data.table, you can use the := function.

 Because the data.table package is designed to avoid wasting memory, when you
make a copy of a table, it does not create a new object. The := function changes by
reference. If you want to make an actual copy, you need to use the copy() function.

# install the data.table package before you use it!

install.packages("data.table")

# load data.table package

library(data.table)

# load other packages and datasets

library(tidyverse)
library(dplyr)
library(dslabs)
data(murders)
# convert the data frame into a data.table object
murders <- setDT(murders)

# selecting in dplyr
select(murders, state, region)

# selecting in data.table - 2 methods

murders[, c("state", "region")] |> head()
murders[, .(state, region)] |> head()

# adding or changing a column in dplyr

murders <- mutate(murders, rate = total / population * 10^5)

# adding or changing a column in data.table

murders[, rate := total / population * 100000]
head(murders)
murders[, ":="(rate = total / population * 100000, rank = rank(population))]
#for multiples columns

# y is referring to x and := changes by reference

x <- data.table(a = 1)
y <- x

x[,a := 2]
y

y[,a := 1]
x

# use copy to make an actual copy

x <- data.table(a = 1)
y <- copy(x)
x[,a := 2]
y

B: Subsetting data table

With data table, we again use an approach similar to sub-setting with matrices, except
data table knows that rate refers to a column name and not to an object in the R
environment.

Look at this simple line of code that achieves the same thing.

murders[rate <= 0.7, .(state, rate)]

Notice that we can combine the filter and select into one succinct command.

The results will be the state names and rates for those states that have rates below 0.7,
just this very simple command.

# load packages and prepare the data

library(tidyverse)
library(dplyr)
library(dslabs)
data(murders)
library(data.table)
murders <- setDT(murders)
murders <- mutate(murders, rate = total / population * 10^5)
murders[, rate := total / population * 100000]

# subsetting in dplyr
filter(murders, rate <= 0.7)

# subsetting in data.table
murders[rate <= 0.7]

# combining filter and select in data.table

murders[rate <= 0.7, .(state, rate)]

# combining filter and select in dplyr

murders %>% filter(rate <= 0.7) %>% select(state, rate)

B: Summarizing with data table

These lines of codes below show how simple it is to operate summarizing with data.table
compared to dplyr.

 In data.table we can call functions inside .() and they will be applied to rows.

 The group_by followed by summarize in dplyr is performed in one line

in data.table using the by argument.

load packages and prepare the data - heights dataset

library(tidyverse)
library(dplyr)
library(dslabs)
data(heights)
heights <- setDT(heights)

# summarizing in dplyr
s <- heights %>%
summarize(average = mean(height), standard_deviation = sd(height))

# summarizing in data.table
s <- heights[, .(average = mean(height), standard_deviation = sd(height))]

# subsetting and then summarizing in dplyr

s <- heights %>%
filter(sex == "Female") %>%
summarize(average = mean(height), standard_deviation = sd(height))

# subsetting and then summarizing in data.table

s <- heights[sex == "Female", .(average = mean(height), standard_deviation =
sd(height))]

# previously defined function

median_min_max <- function(x){
qs <- quantile(x, c(0.5, 0, 1))
data.frame(median = qs[1], minimum = qs[2], maximum = qs[3])
}

# multiple summaries in data.table

heights[, .(median_min_max(height))]

# grouping then summarizing in data.table

heights[, .(average = mean(height), standard_deviation = sd(height)), by = sex]

C: Sorting Data Tables

The data table package also makes it easier to order rows based on the values of a
column.
Here's some code that orders the murders data set based on the population size of
states. We simply type the following.
# order by population
murders[order(population)] |> head()

To sort the table in descending order, we

can order by the negative of population, or use the decreasing argument,
like this.
# order by population in descending order
murders[order(population, decreasing = TRUE)]

Similarly, we can perform nested ordering by including more than one variable in the
function order.
So, for example, if we want to order by region, and then within region by rate,
we can simply type this code
# order by region and then murder rate
murders[order(region, rate)]

D: Tibbles

- A tbl (pronounced "tibble") is a special kind of data frame.

- Tibbles are the default data frame in the tidyverse.
- Tibbles display better than regular data frames.
- Subsets of tibbles are tibbles, which is useful because tidyverse functions require
data frames as inputs.
- Tibbles will warn you if you try to access a column that doesn't exist.
- Entries in tibbles can be complex - they can be lists or functions.
- The function group_by() returns a grouped tibble, which is a special kind of tibble.
-

# view the dataset

murders %>% group_by(region)

# see the class

murders %>% group_by(region) %>% class()

# compare the print output of a regular data frame to a tibble

gapminder
as_tibble(gapminder)

# compare subsetting a regular data frame and a tibble

class(murders[,1])
class(as_tibble(murders)[,1])

# access a column vector not as a tibble using $

class(as_tibble(murders)$state)

# compare what happens when accessing a column that doesn't exist in a regular
data frame to in a tibble
murders$State
as_tibble(murders)$State

# create a tibble
tibble(id = c(1, 2, 3), func = c(mean, median, sd))

PROGRAMMING BASICS

Section 3 introduces general programming features perform exploratory data analysis, build
data analysis pipelines, and prepare data analysis visualization to communicate the results:

 Use basic conditional expressions to perform different operations.

 Check if any or all elements of a logical vector are TRUE.

 Define and call functions to perform various operations.

 Pass arguments to functions, and return variables/objects from functions.

 Use for-loops to perform repeated operations.

A: CONDITIONALS
- This is most basic feature of programming is the if-else statement.
- The basic idea of this is to print the reciprocal of a unless a is 0

 If…..else conditions

#an example showing the general structure of an if-else statement

a <- 2
if(a!=0){
print(1/a)
} else{
print("No reciprocal for 0.")
}
- The general form of this condition:

if(Boolean condition){
expressions
} else{
Alternative expression
}
Using the murders data frame, the first code is to tell us states with murder rate lower than
0.5, else print there is no, the other code is set to murder rate of 0.25

# an example that tells us which states, if any, have a murder rate less than 0.5
library(dslabs)
data(murders)
murder_rate <- murders$total / murders$population*100000
ind <- which.min(murder_rate)
if(murder_rate[ind] < 0.5){
print(murders$state[ind])
} else{
print("No state has murder rate that low")
}

# changing the condition to < 0.25 changes the result

if(murder_rate[ind] < 0.25){
print(murders$state[ind])
} else{
print("No state has a murder rate that low.")
}

 Ifelse conditions (one word)

# the ifelse() function works similarly to an if-else conditional

a <- 0
ifelse(a > 0, 1/a, NA)

- This function is particularly useful because it works on vectors, it examines each

element of logical vector and returns answer accordingly

# the ifelse() function is particularly useful on vectors

a <- c(0,1,2,-4,5)
result <- ifelse(a > 0, 1/a, NA)

This function helps us to replace NAs with some other value. Now lets use dslab packages
called NA_example which has 145 values of numbers and nas all together, the code
examples as below;

#the ifelse() function is also helpful for replacing missing values

data(na_example)
no_nas <- ifelse(is.na(na_example), 0, na_example)
sum(is.na(no_nas))
 Any and All conditions

- The any function takes a vector of logicals and it returns true

if any of the entries is true.

Any returns an answer if any of the asked vector is there, while for All it only says true if
all values are available and true
Example.

# the any() and all() functions evaluate logical vectors

z <- c(TRUE, TRUE, FALSE)
any(z)
all(z)

- The nchar function indicates the number of characters contained in a

character string; eg
nchar("Leo")
nchar("Pedro")
char_len <- nchar(murders$state)
head(char_len)
 Splitting data into groups and then computing summaries for each group is a
common operation in data exploration.

 We can use the dplyr group_by() function to create a special grouped data frame to
facilitate such summaries.

Review am

Compute the sum of the integers 1 to 1,000 using the seq and sum functions.

# Calculate the sum of integers from 1 to 1000 using

# the `seq` and `sum` functions.
n <- seq(1,1000)
x <- c(n)
x
sum(x)

ind <- is.na(na_example)

x[!ind]
mean(na_example)

Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
Data Science R Basics
No ratings yet
Data Science R Basics
17 pages
STATS LAB Basics of R PDF
No ratings yet
STATS LAB Basics of R PDF
77 pages
Introduction To R PDF
No ratings yet
Introduction To R PDF
56 pages
Lecture 1
No ratings yet
Lecture 1
35 pages
2 Undefined
No ratings yet
2 Undefined
86 pages
R Programming Course Notes: Overview and History of R
No ratings yet
R Programming Course Notes: Overview and History of R
22 pages
Introduction To R
No ratings yet
Introduction To R
39 pages
Da Session 4
No ratings yet
Da Session 4
75 pages
Data Science in Process Engineering: Introduction To R
No ratings yet
Data Science in Process Engineering: Introduction To R
14 pages
R
No ratings yet
R
13 pages
RBigData NTL
No ratings yet
RBigData NTL
24 pages
basics of R
No ratings yet
basics of R
12 pages
Data Analysis Using R and Vectors
No ratings yet
Data Analysis Using R and Vectors
35 pages
ProgrammingForDS14_Rbasics
No ratings yet
ProgrammingForDS14_Rbasics
32 pages
Unit 1 Notes R Programming
No ratings yet
Unit 1 Notes R Programming
7 pages
Bdo Co1 Session 4
No ratings yet
Bdo Co1 Session 4
43 pages
Lecture 1
No ratings yet
Lecture 1
42 pages
Unit 2 Notes - Data Analysis Using r
No ratings yet
Unit 2 Notes - Data Analysis Using r
19 pages
An Introduction To R: Biostatistics 615/815
No ratings yet
An Introduction To R: Biostatistics 615/815
59 pages
How To Use The R Programming Language For Statistical Analyses
No ratings yet
How To Use The R Programming Language For Statistical Analyses
38 pages
R Programming 101 Part 1
No ratings yet
R Programming 101 Part 1
53 pages
Intro To Data Science Lecture 3
No ratings yet
Intro To Data Science Lecture 3
18 pages
Statistical Computing II-slide (1)
No ratings yet
Statistical Computing II-slide (1)
279 pages
R-pres
No ratings yet
R-pres
53 pages
W2 Advanced Data Structures, IO & Control
No ratings yet
W2 Advanced Data Structures, IO & Control
44 pages
Ahmed Rebai R-Basics
No ratings yet
Ahmed Rebai R-Basics
33 pages
Programming With R: Lecture #4
No ratings yet
Programming With R: Lecture #4
34 pages
Importing The Files
No ratings yet
Importing The Files
14 pages
CH 03
No ratings yet
CH 03
42 pages
R study material I
No ratings yet
R study material I
8 pages
My First Script.r
No ratings yet
My First Script.r
32 pages
MD115 Wk01
No ratings yet
MD115 Wk01
67 pages
Data in R
No ratings yet
Data in R
7 pages
N2 Data in R
No ratings yet
N2 Data in R
7 pages
Chapter - 03 - Review of Basic Data
No ratings yet
Chapter - 03 - Review of Basic Data
92 pages
Rbasics
No ratings yet
Rbasics
96 pages
Document (1)
No ratings yet
Document (1)
32 pages
Introduction to R
No ratings yet
Introduction to R
23 pages
RBasics Handout
No ratings yet
RBasics Handout
6 pages
Part I: Introductory Materials: Introduction To R
No ratings yet
Part I: Introductory Materials: Introduction To R
25 pages
Rtips. Revival 2012!: Paul E. Johnson June 8, 2012
No ratings yet
Rtips. Revival 2012!: Paul E. Johnson June 8, 2012
72 pages
Network Analysis and Visualization With R and Igraph
No ratings yet
Network Analysis and Visualization With R and Igraph
62 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Week1 Slides
No ratings yet
Week1 Slides
64 pages
Untitled
No ratings yet
Untitled
59 pages
R Reference Card
No ratings yet
R Reference Card
6 pages
R Reference Card
No ratings yet
R Reference Card
6 pages
Introduction To Rlogistic
No ratings yet
Introduction To Rlogistic
135 pages
Module 1 Rprogramming Introduction Part A
No ratings yet
Module 1 Rprogramming Introduction Part A
20 pages
R PPT
No ratings yet
R PPT
63 pages
Basic R Dplyr Session 4 Demonstration
No ratings yet
Basic R Dplyr Session 4 Demonstration
18 pages
R Session A
No ratings yet
R Session A
107 pages
Data Science Using R - Lab Manual-Complete Ver 2.0 - Nov 2024
No ratings yet
Data Science Using R - Lab Manual-Complete Ver 2.0 - Nov 2024
36 pages
ProgrammingForDS13_introR
No ratings yet
ProgrammingForDS13_introR
25 pages
R Programming: © 2016 SMART Training Resources Pvt. LTD
No ratings yet
R Programming: © 2016 SMART Training Resources Pvt. LTD
28 pages
R Prog
No ratings yet
R Prog
27 pages
An R Tutorial Starting Out
No ratings yet
An R Tutorial Starting Out
9 pages
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
Cable Schedule
No ratings yet
Cable Schedule
3 pages
Kubota Ductile Iron Pipe: KUBOTA Corporation KUBOTA Membrane USA Corporation
No ratings yet
Kubota Ductile Iron Pipe: KUBOTA Corporation KUBOTA Membrane USA Corporation
10 pages
About SLK
No ratings yet
About SLK
9 pages
POM Unit 1
No ratings yet
POM Unit 1
16 pages
Tl-Pa4010 Kit (Eu) 4.0
No ratings yet
Tl-Pa4010 Kit (Eu) 4.0
5 pages
MagVision Manual
No ratings yet
MagVision Manual
199 pages
PMO Presentation
No ratings yet
PMO Presentation
30 pages
chapter 5 mcqs poe
No ratings yet
chapter 5 mcqs poe
120 pages
BTCL Job Website Info
100% (1)
BTCL Job Website Info
12 pages
DHI-VTO2202F-P: IP Villa Door Station
No ratings yet
DHI-VTO2202F-P: IP Villa Door Station
2 pages
VS120, VS220, VS420, VS425, VS440, VS463 - Installation Contactors
No ratings yet
VS120, VS220, VS420, VS425, VS440, VS463 - Installation Contactors
2 pages
Engineering Colleges: KAB Educational Consultants
No ratings yet
Engineering Colleges: KAB Educational Consultants
88 pages
UCL-520 Two-Wire Ultrasonic Level Transmitter Owner's Manual
No ratings yet
UCL-520 Two-Wire Ultrasonic Level Transmitter Owner's Manual
10 pages
New Holland W110D Stage IV Wheel Loader Service Repair Manual
No ratings yet
New Holland W110D Stage IV Wheel Loader Service Repair Manual
21 pages
Effi Cycle Presentation
100% (7)
Effi Cycle Presentation
13 pages
Lec 2
No ratings yet
Lec 2
19 pages
Wasser en
No ratings yet
Wasser en
26 pages
Synove Brochure
No ratings yet
Synove Brochure
14 pages
To Investigate The Relation Between The Ratio of
No ratings yet
To Investigate The Relation Between The Ratio of
15 pages
Microchip RTG4 FPGA Clocking Resources User Guide UG0586 V11
No ratings yet
Microchip RTG4 FPGA Clocking Resources User Guide UG0586 V11
87 pages
Group 4 1a Construction Estimates and Values Engineering
100% (1)
Group 4 1a Construction Estimates and Values Engineering
48 pages
Going Beyond Lean
No ratings yet
Going Beyond Lean
7 pages
ARIN 429 Design
No ratings yet
ARIN 429 Design
5 pages
An Internal Look Into ASM Metadata, Processes and Tools
No ratings yet
An Internal Look Into ASM Metadata, Processes and Tools
40 pages
Introduction To Robotics: Mechatronics
No ratings yet
Introduction To Robotics: Mechatronics
26 pages
AI Assignment
No ratings yet
AI Assignment
11 pages
Project NOAH
No ratings yet
Project NOAH
2 pages
DAS well site Pre-job check list
No ratings yet
DAS well site Pre-job check list
3 pages
Lumicil IS32CG5317SDK PB
0% (1)
Lumicil IS32CG5317SDK PB
2 pages
Nelson Mandela Bay Stadium
No ratings yet
Nelson Mandela Bay Stadium
2 pages