R Tutorial
R Tutorial
2. Relational Operators
Relational operators evaluate the relationships between two values, and they return logical values
(TRUE/FALSE).
• > greater than
• >= greater than or equal to
• < less than
• <= less than or equal to
• == equal to (== is different from =)
• != not equal to
1 > 0
## [1] TRUE
1 == 2
## [1] FALSE
"HELLO" == "hello" # R is case sensitive
## [1] FALSE
• When applied to a vector, the operators evaluate each element of the vector.
• We can use the square brackets [ ] to index the values in a vector by placing the logical value of each
element into a vector of the same length within [ ].
• The elements whose indexing value is TRUE are extracted.
x <- c(1,2,5,7)
x
## [1] 1 2 5 7
2
x>5
## [1] 7
x[c(TRUE,FALSE,FALSE,TRUE)] # extract the first and the last elements
## [1] 1 7
y <- c(11,0,5,-2)
x==max(x) # a vector of logical values: only element where x is maximized is TRUE
## [1] -2
y[c(FALSE,FALSE,FALSE,TRUE)] # alternative
## [1] -2
## [1] 0.08248799
Decomposition:
• resume$sex
– use the $ operator to access an individual variable in a data frame
– obtain individual variable “sex” in the “resume” data frame, which is a vector
• vec1 <- resume$sex == “female”
– “==” is the relational operator “equal to”
– “==” evaluates each element of the “sex” column to see if it is equal to “female”
– if the element is equal to “female”, return a logical value of “TRUE”; otherwise “FALSE”
– vec1 is a vector of logical values
• vec2 <- resume$call[resume$sex == “female”]
– the square brackets index the values in the “call” column using the corresponding logical value in
the “vec1” vector
– extract elements whose indexing value is TRUE
– subset of the “call” column: only females
3
• mean(resume$call[resume$sex == “female”])
– use the function mean( ) to calculate the sample mean of the subsetted vector
Alternatively, the calculation of callback rate for females can be done in two steps.
• We first subset a data frame object so that it contains only the resumes of females (with all columns)
and then compute the callback rate.
• Notice that we use square brackets [ , ] to index the rows and columns of a data frame.
• Unlike in the case of indexing vectors, we use a comma to separate row and column indexes.
• This comma is important and forgetting to include it will lead to an error.
• Here, we do not specify a column index after the comma because we want to keep all columns.
# step1: subset only females, keep all columns (do not specify a column index)
resume_f <- resume[resume$sex == "female", ]
dim(resume)
## [1] 4870 4
dim(resume_f) # fewer rows/observations
## [1] 3746 4
table(resume$sex)
##
## female male
## 3746 1124
# step2: callback rate for females
mean(resume_f$call)
## [1] 0.08248799
(optional) We can also use the subset() function to construct a data frame that contains just some of the
original observations and just some of the original variables.
4
resume$BlackFemale <- ifelse(resume$race == "black" &
resume$sex == "female", 1, 0)
Decomposition:
• resume$race == “black”
– a vector of logical values
– “==” evaluates each element of the “race” column to see if it is equal to “black”
– if the element is equal to “black”, return a logical value of “TRUE”; otherwise “FALSE”
• resume$race == “black” & resume$sex == “female”
– a vector of logical values
– the element is TRUE only when both of the objects have a value of TRUE: race is black and sex
is female
• ifelse(resume$race == “black” & resume$sex == “female”, 1, 0)
– a vector of 0/1 values
– for elements in resume$race == “black” & resume$sex == “female” that are TRUE, return a
value of 1; for elements that are FALSE, return a value of 0
5. Factor Variables
A factor variable (or a categorical variable) takes a finite number of distinct values or levels.
• e.g., we wish to create a factor variable that takes one of the four values, i.e., BlackFemale, BlackMale,
WhiteFemale, and WhiteMale.
• We specify each type using the characteristics of the applicants.
##
## BlackFemale BlackMale WhiteFemale WhiteMale
## 1886 549 1860 575
5
• To drop the missing values from the calculation, we use na.rm = TRUE, which means “remove the NA
values”.
3. plot(density(x)): to plot the density curve of a vector x
• The probability density function of a vector x, denoted by f(x), describes the probability of the variable
taking certain value.
• You can use a different color by setting the col argument.
• You can use the lines() function to add a new plot.
x <- rnorm(n=50,mean=20,sd=1)
y <- rnorm(n=50,mean=21,sd=2)
plot(density(x), col="green")
lines(density(y))
density(x = x)
0.4
0.3
Density
0.2
0.1
0.0
16 18 20 22
N = 50 Bandwidth = 0.3599
##
## black white
## female 1886 1860
## male 549 575
6
t2[1,1] # female & black (the first row and the first column)
## [1] 1886
t2[2,1] # male & black (the second row and the first column)
## [1] 549
sum(t2[1,]) # total number of females (the sum of the first row)
## [1] 3746
5. t.test(x,y) function: to determine if there is a significant difference between the means of the two
groups x and y
• e.g., suppose we want to test if there is any difference between callback rate for blacks and callback rate
for whites.
• We can extract information from the t-test results.
t <- t.test(resume$call[resume$race=="black"],
resume$call[resume$race=="white"])
t
##
## Welch Two Sample t-test
##
## data: resume$call[resume$race == "black"] and resume$call[resume$race == "white"]
## t = -4.1147, df = 4711.6, p-value = 3.943e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.04729503 -0.01677067
## sample estimates:
## mean of x mean of y
## 0.06447639 0.09650924
names(t)
## mean of x
## 0.06447639
t$estimate[2] # callback rate for whites
## mean of y
## 0.09650924
t$p.value # p-value: whether the difference is significant
## [1] 3.942942e-05