R - A Practical Course
R - A Practical Course
R - A Practical Course
Basic of R
Variable Assignment
A basic concept in R programming is the variable. It allows you to store a value or an object in R. You can then
later use this variable's name to easily access the value or the object that is stored within this variable. You use
<- to assign a variable:
my_variable <- 4
Variables are great to perform arithmetic operations with. In this assignment, we have defined a
variable my_apples. You want to define another variable called my_oranges and add these two together.
my_apple+my_oranges
Common sense tells you not to add apples and oranges. The my_apples and my_oranges variables both
contained a number in the previous exercise. The + operator works with numeric variables in R. If you really
tried to add "apples" and "oranges", and assigned a text value to the variable my_oranges but not
to my_apples (see the editor), you would be trying to assign the addition of a numeric and a character
variable to the variable my_fruit. This is not possible.
converts the character string "3" in var to a numeric 3 and assigns it to var_num. However, keep in my that
it is not always possible to convert the types without losing information or getting errors.
as.integer("4.5")
as.numeric("three")
The first line will convert the character string "4.5" to the integer 4. The second one will convert the character
string "three" to an NA.
Create a Vector I
On your way from rags to riches, you will make extensive use of vectors. Vectors are one-dimension
arrays that can hold numeric data, character data, or logical data. In other words, a vector is a simple tool to
store data. For example, you can store your daily gains and losses in the casinos.In R, you create a vector with
the combine function c(). You place the vector elements separated by a comma between the brackets. For
example:
Once you have created these vectors in R, you can use them to do calculations.
1
Sometimes you only want to select a specific element from one of those vectors instead of using the entire
vector. R makes this very easy using indexing. Indexing entails the use of square brackets [] to select elements
from a vector.
For instance, numeric_vector[1] will select the first element of the
vector numeric_vector. numeric_vector[c(1,3)] will select the first and the third element of the
vector numeric_vector.
Selection by Comparison I
Sometimes you want to select elements from a vector in a more advanced fashion. This is where the use of
logical operators may come in handy.
The (logical) comparison operators known to R are: - < for less than - > for greater than - <= for less than or
equal to - >= for greater than or equal to - == for equal to each other - != not equal to each other
The nice thing about R is that you can use these comparison operators on vectors. For example, the statement
c(4,5,6) > 5 returns: FALSE FALSE TRUE. In other words, you test for every element of the vector if the
condition stated by the comparison operator is TRUE or FALSE.
Behind the scenes, R does an element-wise comparison of each element in the vector c(4,5,6) with the element
5. However, 5 is not a vector of length three. To solve this, R automatically replicates the value 5 to generate a
vector of three elements, c(5, 5, 5) and then carries out the element-wise comparison.
Script.R
#A numeric vector containing 3 elements
numeric_vector <- c(1, 10, 49)
larger_than_ten<-numeric_vector> 10
In the last exercise we saw larger_than_ten consisted of a vector of TRUE and FALSE. We make use of
this logical vector to select elements from another vector. For instance, numeric_vector[c(TRUE,
FALSE, TRUE)] will select the first and the third element from the vector numeric_vector
Matrices
In R, a matrix is a collection of elements of the same data type (numeric, character, or logical) arranged into a
fixed number of rows and columns. Since you are only working with rows and columns, a matrix is called two-
dimensional.
You can construct a matrix in R with the matrix() function. Consider the following example:
matrix(1:9, byrow = TRUE, nrow = 3, ncol = 3)
2
The first argument is the collection of elements that R will arrange into the rows and columns of the
matrix. Here, we use 1:9 which constructs the vector c(1, 2, 3, 4, 5, 6, 7, 8, 9).
The argument byrow indicates that the matrix is filled by the rows. This means that the matrix is filled
from left to right and when the first row is completed, the filling continues on the second row. If we
want the matrix to be filled by the columns, we just place byrow = FALSE.
The third argument nrow indicates that the matrix should have three rows.
The fourth argument ncol indicates the number of columns that the matrix should have
Factors
The term factor refers to a statistical data type used to store categorical variables. The difference between a
categorical variable and a continuous variable is that a categorical variable can belong to a limited number of
categories. A continuous variable, on the other hand, can correspond to an infinite number of values.
It is important that R knows whether it is dealing with a continuous or a categorical variable, as the statistical
models you will develop in the future treat both types differently.
A good example of a categorical variable is the variable student_status. An individual can either be
"student" or "not student". This means that "student" and "not student" are two values of the categorical variable
student_status and every observation can be assigned one of these values. We can do this using the
factor function.
Dataframes: What's a Data Frame?
You may remember the matrix, a multi-dimensional object that we discussed earlier. All the elements that you
put in a matrix should be of the same type. However, when performing a market research survey, you often
have questions such as:
'Are your married?' or 'yes/no' questions (= boolean data type)
'How old are you?' (= numeric data type)
'What is your opinion on this product?' or other 'open-ended' questions (= character data type)
The output, namely the respondents' answers to the questions formulated above, is a data set of different data
types. You will often find yourself working with data sets that contain different data types instead of only one.
A data frame has the variables of a data set as columns and the observations as rows. This will be a familiar
concept for those coming from different statistical software packages such as SAS or SPSS.
Inspecting dataframes
There are several functions you can use to inspect your dataframe. To name a few
head: this by default prints the first 6 rows of the dataframe
tail: this by default prints the last 6 rows to the console
str: this prints the structure of your dataframe
dim: this by default prints the dimensions, that is, the number of rows and columns of your dataframe
colnames: this prints the names of the columns of your dataframe.
3
You construct a data frame with the data.frame() function. As arguments, you should provide the above
mentioned vectors as input that should become the different columns of that data frame. Therefore, it is
important that each vector used to construct a data frame has an equal length. But do not forget that it is possible
(and likely) that they contain different types of data.
Lists
A list in R is similar to your to-do list at work or school: the different items on that list most likely differ in
length, characteristic, type of activity that has to do be done.
A list in R allows you to gather a variety of objects under one name (that is, the name of the list) in an ordered
way. These objects can be matrices, vectors, data frames, even other lists, etc. It is not even required that these
objects are related to each other.
You can easily construct a list using the list() function. In this function you can wrap the different elements
like so: list(item1, item2, item3).
A last way to grab an element from a list is using the $ sign. The following code would
select my_df from my_list: my_list$my_df.
Besides selecting components, you often need to select specific elements out of these components. For example,
with my_list[[1]][1] you select from the first component of my_list the first element. This would
select the number 1.
Introduction to R
Making a Start with Functions: Getting Help
So far we have seen many datatypes in R. The next thing to learn about is functions. We have already seen many
functions when working with vectors, dataframes and lists. For instance, when making a list, we used the
function list() to make one.
In programming, functions are used to incorporate sets of instructions that we want to use repeatedly. A function
is actually a piece of code written to carry out a specified task; it may accept arguments or parameters (or not)
and it may return one or more values (or not!).
4
Let's look at a pre-programmed function in R: mean. To consult the R documentation on this function, you can
use the following commands:
help(mean)
?mean
Try these commands out in the Datacamp console. If you do so, you'll be redirected to
www.RDocumentation.org. If you would type this function into you R studio console, a help tab would
automatically open in R studio.
There is another way of getting help on a function. For instance, if you want to know which parameters need to
be provided, you can use the R function args on the specified function. An example of using args on a
function is the following: args(mean)
Functions Continued
In the last exercise we made a start with functions. Also, we looked at how we could get help on using functions.
When getting help on the mean function, you saw that it takes an argument x. X here is just an arbitrary name
for the object that you want to find the mean of. Usually this object will be an R vector. We also saw the ....
This is called an elipsis and is used to provide a number of optional arguments to the function.
Remember that R can match arguments both by position and by name. Let's say we want to find the mean of a
vector called temperature. An example of matching by name is the following:
mean(x = temperature)
Script.R
# a grades vector
grades <- c(8.5, 7, 9, 5.5, 6)
When we looked at the documentation of mean. The documentation showed us the following method:
mean(x, trim = 0, na.rm = FALSE, ...)
As you can see, both trim and na.rm have default values. However, x doesn't. This makes x a required
argument. That means that the function mean will throw an error if x hasn't been specified. Trim and na.rm are
however optional arguments with default values and can be changed or specified by the user.
Na.rm can be changed by the user if a given vector contains missing values. For instance, if the aforementioned
vector called temperature would have missing values, calling mean on it would throw an output of NA. If you
want the mean function to exclude the NA values when calculating the mean, you can specify na.rm = TRUE.
Let's bring this into practice:
5
Script.R
# a grades vector
grades <- c(8.5, 7, 9, NA, 6)
#mean without removing NA
mean(grades)
#mean with removing NA
mean(x=grades, trim=0, na.rm = TRUE)
You could call this function and assign its result to the variable result, using the following code: result =
sum_a_b(4, 5)
Script.R
# load in the dataset
cars <-read.csv ("http://s3. amazonaws.com/ assets.datacamp.
com/ course/ uva/mtcars_semicolon.csv", sep=';')
6
Working Directories in R
In the previous assignments, we practised reading in files in R. So far, all of these files were on the internet.
However, if you would work with R studio on your own computer, you would probably like to read in local
files.
When reading in local files, it's good to have an idea what your working-directory is. Your working-directory
is basically the part of your file system that R will look for files. Usually this is something along the lines of
C:/Users/Username/documents. Of course this working directory is not static and can be changed by the user.
In R there are two important functions:
getwd(): This function will retrieve the current working directory for the user
setwd(): This functions allows the user to set her own working directory
Importing R Packages
Although base R comes with a lot of useful functions, you will not be able to fully leverage the full power of R
without being able to import R modules developed by others. Imagine we want to do some great plotting and
we want to use ggplot2 for it. If we want to do so, we need to take 2 steps:
1. Install the package ggplot2 using install.packages("ggplot2")
2. Load the package ggplot2 using library(ggplot2) or require(ggplot2)
In datacamp however, most packages are already installed and readily available. As such, you won't need to
run install.packages()
7
Explore Data
Checking The Dimensions of Your Data
We are going to start out using the mtcars data set, which contains measures of design, performance and
consumption for different cars. If we want to know how many cases and variables there are in the data set we
could count them manually, but this could take a very long time. A faster way is to use the function dim().
The first value returned by dim() is the number of cases (rows) and the second value is the number of variables
(columns).
The variables in mtcars are as follows:
[, 1] mpg Miles/(US) gallon.
[, 2] cyl Number of cylinders.
[, 3] disp Displacement (cu.in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (lb/1000)
[, 7] qsec 1/4 mile time
[, 8] vs V/S
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
[,11] carb Number of carburettors
Data Structure
Using the str() function we can look at the structure of a dataset. str() takes the name of the data set as its
first argument. The output shows the variable names, their type, and the values of the first observations.
The am variable indicates whether a car has an automatic or manual transmission. Performing
the str() function on mtcars in your console and look at your output. According to R, what type of variable
is am
str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Levels
In the last exercise you saw that the am variable of the mtcars data set was labelled by R as a factor. You can
see the levels of a factor variable by using the function levels(). Let's try this out. Remember, you can select
a specific variable using either $ or [,], If you need to check the variables in the data set, remember that you
can always use the str() function in your console.
8
# Look at the levels of the variable am
> levels(mtcars$am)
[1] "0" "1"
Recoding Variables
Currently the mpg (miles per gallon) variable of mtcars is a continuous numeric variable, but it may be more
useful if mpg was a categorical variable that immedietly told you if the car had low or high miles per gallon.
We can make categories through indexing variables that meet certain criteria.
For instance, if we want to make a new variable that categorises people over age 18 as "adult"", we might enter:
This assigns the value "adult" to the variable newvariable, for all cases where age is greater than 18.
Remember, you can select a specific variable using either $ or [,]. If you need to look at your data you can
simply enter mtcars into your console, or if you just want to check the variables you can always
enter str(mtcars) in your console.
Script.R
#Assign the value of mtcars to the new variable mtcars2
mtcars2 <- mtcars
Examining Frequencies
Frequency tables show you how often a given value occurs. To look at a frequency table of the data in R, use
the function table(). The top row of the output is the value, and the bottom row is the frequency of the value.
Let's use table() on the am variable of the mtcars data set. Remember that am is a categorical variable that
shows a 0 when a car has an automatic transmission and a 1 when a car has a manual transmission.
> table(mtcars$am)
0 1
19 13
Cumulative Frequency
In your console, look at how many cars have 3, 4 or 5 gears using table() on the
variable gear from mtcars.
In your console, calculate how many of the cars have 3 or 5 gears as a percentage of the total number
of cars.
In your script, report this percentage
9
Rconsole Script.R
>table(mtcars $gear) # What percentage of cars
have 3 or 5 gears?
3 4 5 62.5
15 12 5
>#Percentager of cars have 3
or 5 gears
>(15/32)*100+(5/32)*100
[1]62.5
Remember, you can select a specific variable using either $ or [,]. If you need to look at your data you can
simply enter mtcars into your console, or if you just want to check the variables you can always
enter str(mtcars) in your console.
Script.R
#Assign the frequency of the mtcars variable "am"
to a variable called "height"
height <- table(mtcars$am)
# Label the y axis "number of cars" and label the bars using barnames
barplot(height, ylab = "number of cars", names.arg= barnames)
10
Histograms
It can be useful to plot frequencies as histograms to visualize the spread of our data.
Let's make a histogram of the number of carburetors in our mtcars dataset using the function hist().
The first argument of hist() is vector of values for which the histogram is desired. Following this, we can
add arguments to format the graph as necessary. For instance, hist(variable, argument1,
argument2)
Script.R
# Make a histogram of the carb variable
from the mtcars data set. Set the title
to "Carburetors"
hist(mtcars$carb,main= "Carburetors")
Script.R
# Calculate the mean miles per gallon
mean(mtcars$mpg) [1] 20.09062
Mode
Sometimes it is useful to look at the the most frequent value in a data set, known as the 'mode'. R doesn't have
a standard function for mode, but we can calculate the mode easily using the table() function, which you
might be familiar with now.
When you have a large data set, the output of table() might be too long to manually identify which value is
the mode. In this case it can be useful to use the sort() function, which arranges a vector or factor into
ascending order. (You can add the argument decreasing = TRUE to sort() if you want to arrange it in
to descending order.
11
Script.R
# Produce a sorted frequency table of `carb` from `mtcars`
sort(table(mtcars$carb), decreasing = TRUE)
Range
The range of a variable is the difference between the highest and lowest value. We can find these values
using max() and min() on the variables of our choice. The value returned tells us which row (or case)
contains the requested value. We can then index this case to find the desired values. Remember, you can index
using [].
Script.R
# Minimum value
x <-min(mtcars$mpg)
# Maximum value
y <-max(mtcars$mpg)
Quartiles
You can calculate the quartiles in your data set using the function quantile(). The output
of quantile() gives you the lowest value, first quartile, second quartile, third quartile and highest value.
25% of your data lies below the first quartile value, 50% lies below the second quartile, and 75% lies below the
third quartile value.
> quantile(mtcars$qsec)
0% 25% 50% 75% 100%
14.5000 16.8925 17.7100 18.9000 22.9000
[1] 2.0075
IQR Outliers
In the boxplot you created you can see a circle above the boxplot. This indicates an outlier. We can calculate an
outlier as a value 1.5 * IQR above the third quartile, or 1.5 * IQR below the first quartile.
12
> IQR(mtcars$qsec)
[1] 2.0075
> quantile(mtcars$qsec)
0% 25% 50% 75% 100%
14.5000 16.8925 17.7100 18.9000 22.9000
> #upper threshold
> 18.9000+1.5*2.0075
[1] 21.91125
> #lower threshold
> 16.8925-1.5*2.0075
[1] 13.88125
Standard Deviation
We can also measure the spread of data through the standard deviation. You can calculate these using the
function sd(), which takes a vector of the variable in question as its first argument.
Calculating Z-scores
We can calculate the z-score for a given value (X) as (X - mean) / standard deviation. In R you can do this with
a whole variable at once by putting the variable name in the place of X
Script.R
# Calculate the z-scores of mpg
(mtcars$mpg-mean(mtcars$mpg))/sd(mtcars$mpg)
13
Correlation and Regression
Scatterplots
Saved in your console is a dataset called women which contains the height and weight of 15 women (try
typing it into your console and press enter to have a look).
Let's have a look at the relationship between height and weight through a scatterplot, using the R
function plot(). The first argument of plot() is the x-axis coordinates, and the second argument is the y-
axis coordinates.
Script.R
plot(women$weight, women$height, main = "Heights
and Weights")
> women
height weight
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
6 63 129
7 64 132
8 65 135
9 66 139
10 67 142
11 68 146
12 69 150
13 70 154
14 71 159
15 72 164
Script.R
table(smoking$tobacco, smoking$student)
>table(smoking$tobacco, smoking$student)
14
Calculating Percentage from Your Contingency Table
Have a look at the contingency table of tobacco consumption and education you made in the last exercise. It's
saved in your console as st. Let's use it to calculate some percentages!
In this exercise you need to report your answers to one decimal place. You are free to do this manually, but if
you want a quick way to do this through R you can use the round() function. The first argument
of round() is the value that you want to round (this can be in the form of a raw number, or an equation), and
the second argument is digits =, where you specify the number of decimal places you want the number
rounded to. For instance, round(12.6734, digits = 2) would return the value 12.67.
Script Console
# What percentage of high school > (17/(17+16+11))*100
students smoke 0-9g of tobacco? [1] 38.63636
38.6 > round(38.63636, digits =
1)
# Of the students who smoke the most, [1] 38.6
> (15/(11+15))*100
what percentage are in university?
[1] 57.69231
57.7
> round(57.69231, digits =
1)
[1] 57.7
Script.R
# Calculate the correlation
between var1 and var2
cor(var2, var1)
15
Script.R
# predicted values of y according to line 1
y1 <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
# predicted values of y according to line 2
y2 <- c(2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
# actual values of y
y <- c(3, 2, 1, 4, 5, 10, 8, 7, 6, 9)
# calculate the squared error of line 1
sum((y-y1)^2)
>sum((y-y1)^2)
[1] 36
>sum((y-y2)^2)
[1] 46
Script.R Console
#Our data money <- > lm(prosocial~money)
c(1,2,3,4,5,6,7,8,9,10) Call:
prosocial <- lm(formula = prosocial ~ money)
c(3,2,1,4,5,10,8,7,6,9)
Coefficients:
#Find the regression coefficients (Intercept) money
lm(prosocial~money) 1.2000 0.7818
Script.R
# Your plot
plot(money, prosocial, xlab = "Money", ylab = "Prosocial
Behavior")
# Store your regression coefficients in a variable called "line"
line <- lm(prosocial~money)
# Use "line" to tell abline() to make a line on your graph
abline(line)
16
Script.R
# Your plot
plot(money, prosocial, xlab = "Money",
ylab = "Prosocial Behavior")
R Squared I
These are the two lines you plotted in the last assignment. One line shows the mean, and one shows the
regression line. Clearly, there is less error when we use the regression line compared to the mean line. This
reduction in error from using the regression line compared to the mean line tells us how well the independent
variable (money) predicts the dependent variable (prosocial behaviour).
Conveniently, the R squared is equivalent to squaring the Pearson R correlation coefficient. We're going to
calculate the R squared for prosocial and money.
Script.R
# Calculate the R squared of prosocial and money
r <- cor(money, prosocial)
r*r
> r*r
[1] 0.6112397
17
Probability Distribution
Probability mass and density functions
From the lectures you may recall the concepts of probability mass and density functions. Probability mass
functions relate to the probability distributions discrete variables, while probability density functions relate to
probability distributions of continuous variables. Suppose we have the following probability density function:
Script.R
# the data frame
data<-data.frame(outcome= 0:5, probs = c(0.1, 0.2,
0.3, 0.2, 0.1, 0.1))
Usage
Arguments
x, q vector of quantiles.
p vector of probabilities.
n number of observations. If length(n) > 1, the length is taken to be the number
required.
mean vector of means.
sd vector of standard deviations.
log, log.p logical; if TRUE, probabilities p are given as log(p).
lower.tail logical; if TRUE (default), probabilities are \(P[X \le x]\) otherwise, \(P[X > x]\).
Details
If mean or sd are not specified they assume the default values of 0 and 1, respectively. The normal distribution
has density $$ f(x) = \frac{1}{\sqrt{2\pi}\sigma} e^{-(x-\mu)^2/2\sigma^2}$$ where \(\mu\) is the mean of the
distribution and \(\sigma\) the standard deviation.
18
Value
dnorm gives the density, pnorm gives the distribution function, qnorm gives the quantile function,
and rnorm generates random deviates.
The length of the result is determined by n for rnorm, and is the maximum of the lengths of the numerical
arguments for the other functions.
The numerical arguments other than n are recycled to the length of the result. Only the first elements of the
logical arguments are used.
For sd = 0 this gives the limit as sd decreases to 0, a point mass at mu. sd < 0 is an error and returns NaN.
Script.R
# simulating data
set.seed(11225)
data <- rnorm(10000)
All the probabilities in the table are included in the dataframe probability_distribution which
contains the variables outcome and probs. We could sum individual probabilities in order to get a cumulative
probability of a given value. However, in some cases, the function cumsum() may come in handy.
What cumsum() does is that returns a vector whose elements are the cumulative sums of the elements of the
arguments. For instance, if we would have a vector which contains the elements: c(1, 2,
3), cumsum() would return c(1, 3, 6).
Script.R
# probability that x is smaller or equal to two
prob <- (0.1 + 0.2 + 0.3)
19
Summary statistics: The mean
One of the first things that you would like to know about a probability distribution are some summary statistics
that capture the essence of the distribution. One example of such a summary statistic is the mean. The mean of
a probability distribution is calculated by taking the weighted average of all possible values that a random
variable can take. In the case of a discrete variable, you calculate the sum of each possible value times its
probability. Let's go back to our probability mass function of the first exercise.
Script.R
# calculate the expected probability value and assign it to
the variable expected_score
expected_score <- sum(data$outcome * data$probs)
[1] 2.3
Script.R
# the mean of the probability mass function
expected_score <- sum(data$outcome * data$probs)
20
Hair length is considered to be normally distributed with a mean of 25 centimeters and a standard deviation of
5. Imagine we wanted to know the probability that a
woman's hair length is less than 30. We can do this in
R using the pnorm() function. This function calculates
the cumultative probability. We can use it the following
way: pnorm(30, mean = 25, sd = 5). If you wanted to
calculate the probability of a woman having a hair
length larger or equal to 30 centimers, you can set
the lower.tail argument to FALSE. For
instance, pnorm(30, mean = 25, sd = 5, lower.tail =
FALSE). Let's visualize this. Note that the first
example is visualized on the left, while the second
example is visualized on the right:
Calculate the probability of a woman having a hair length less than 20 centimeters using a mean of 25 and a
standard deviation of 5. Use the pnorm() function and round the value to two decimal.
Script.R
# probability of a woman having a hair length of less than 20 centimeters
round(pnorm(20, mean= 25,sd=5), 2)
[1] 0.16
Script.R
# 85th percentile of female hair length
round(qnorm(0.85, mean=25, sd=5),2)
[1] 30.18
21
The normal distribution and Z scores
A special form of the normal probability distribution is the standard normal distribution, also known as the z -
distribution. A z distribution has a mean of 0 and a standard deviation of 1. Often you can transform variables
to z values. You can transform the values of a variable to z-scores by subtracting the mean, and dividing this by
the standard deviation. If you perform this transformation on the values of a data set, your transformed data set
will ave a mean of 0 and a standard deviation of 1. The formula to transform a value to a z score is the following:
𝑥𝑖 − 𝑥̅
𝑍𝑖 =
𝑠𝑥
The Z-score represents how many standard deviations from the mean a value lies.
Script.R
# calculate the z value and store it in the
variable z_value
round(z_value <- c((38-25)/5))
[1] 3
√𝑛 × 𝑝 × (1 − 𝑝)
An exam consisting of 25 multiple choice questions. Each questions has 5 possible answers. This means that
the probability of answering a question correctly by chance is 0.2. Calculate the mean of this distribution and
store it in a variable called mean_chance
Calculate the standard deviation of this distribution and store it in the variable std_chance.
Script.R
# calculate the mean and store it in the variable mean_chance
mean_chance <- c(25*0.2)
22
guessing a question correctly. In contrast to the normal distribution, when we have to deal with a binomial
distribution we can calculate the probability of exactly answering say 5 questions correctly. This is because a
binomial distribution is a discrete distribution.
When we want to calculate the probability of answering 5 questions correctly, we can use the dbinom function.
This function calculates an exact probability. If we would like to calculate an interval of probabilities, say the
probability of answer 5 or more questions correctly, we can use the pbinom function. We have already seen a
similar function when we were dealing with the normal distribution: the pnorm() function.
Calculate the exact probability of answering 5 questions correctly and store this in the variable five_correct,
Calculate the cumulative probability of answering at least 5 questions correctly and store this in the
variable atleast_five_correct,
Script.R
# probability of answering 5 questions correctly
five_correct <- dbinom(5, size = 25, prob = 0.2)
Calculate the 60th percentile of the binomial distribution of exam questions. Note that the number of questions
is 25 and the probability of guessing a question correctly is 0.2.
Script.R
# calculate the 60th percentile
qbinom(0.60, size = 25, prob = 0.2)
[1] 5
23
Sampling Distribution
Sampling from the population
In this lab we have access to the entire population. In real life, this is rarely the case. Often we gather information
by taking a sample from a population. In the lectures, you've become familiar with the male beard length (in
millimetres) of hipsters in Scandinavia. In this lab, we will be working with this example.
If we were interested in estimating the average male beard length of hipsters in Scandinavia, in R we can use
the sample() function to sample from the population. For instance, to sample 50 inhabitants from our
Scandinavian male hipster population which is included in the variable scandinavia_data, we could do
the following: sample(scandinavia_data, size = 50). This command collects a simple random
sample of size 50. If we didn't have access to the entire male hipster Scandinavian population, working with
these 50 inhabitants would be considerably simpler than having to go through the entire Scandinavian male
hipster population.
Make sure not to remove Script.R
the set.seed(11225) code. This makes #variable scandinavia_data contains the
beard lengths of scandinavian male
sure that you will get the same results as the population
solution code set.seed(11225)
Sample 100 values from the
first_sample<-sample(scandinavia_data,
dataset scandinavia_data and store this in size = 100)
a variable first_sample
Calculate the mean of first sample and print mean(first_sample)
[1] 25.42916
the result
[1] 1 2 3 4 5 6 7 8 9 10
24
that is the mean of the mean of all the samples that we took from the population will never be far away from the
population mean. Let's observe this in practice.
Calculate the mean of the population and print it. Note that the
population is included in the variable scandinavia_data.
Calculate the mean of the sample means and print it. Note that the
sample means are included in the variable sample_means.
Note how close the two are
Script.R
# set the seed such that you will get the same sample as in the
solution code
set.seed(11225)
25
Script.R
# standard deviation of the population
population_sd <- sd(scandinavia_data)
population_sd
[1] 3.466054
[1] 0.245087
26
Z-scores
Recall the concept of Z scores from the lectures. Z scores are standardized scores how far a parameter is removed
from its mean. A Z score with value 2 means that an observation is 2 standard deviations away from its
population mean. Also recall that the formula for the Z score is the following:
(𝑥𝑖 − 𝜇)
𝑍𝑖 =
𝜎
In this formula, 𝑥𝑖 refers to the observation for which you want to calculate the z score, while μ refers to the
population mean of the phenomenon and σ refers to the population standard deviation.
To illustrate the concept of the Z score, let's go back to our scandinavia_data dataset. In this population
of male hipsters from Scandinavia, the average beard length is 25. The standard deviation in the population is
3.47. Suppose we had a hipster with a beard
length of 32mm, this would be unusual for Script.R
this population and thus would have a rather # z score of hipster with a beard of 32
high Z score. millimeter
z_score <- (32-25)/3.47
Calculate the Z score of this hipster # print the variable z_score to the
with a beard of 32 millimetres and console
store it in a variable called z_score z_score
print the variable z_score to the
console [1] 2.017291
27
Let's look at how to use pnorm() and let's play around
with the lower.tail option. Script.R
# calculate the area under the
Recall that the z-score for the Scandinavian hipster curve left of the observation
in the previous exercise was 2.02. Calculate the
pnorm(2.02, lower.tail = TRUE)
area left of this observation by
specifying lower.tail = TRUE in pnorm and
print this probability. # calculate the area under the
Now calculate the area under the curve right of this curve right of the
observation by specifying lower.tail = pnorm(2.02, lower.tail = FALSE)
FALSE and print this probability
28
Script.R
#calculate the population mean
population_mean <- mean(scandinavia_data)
#calculate the standard deviation of the sampling distribution and put it in a variable
sampling_sd
sampling_sd <- population_sd/sqrt(50)
[1] 0.01810623
You may notice that this distribution of sample means is much narrower than the distribution of observations
of individual hipsters. How would you interpret the red area?
Answer: The red area is the probability of obtaining a sample with a mean beard length equal or larger than
26 millimetre based on a sample of 50 hipsters.
29
Let's break this problem into steps. Firstly, we can calculate the standard deviation of the sampling distribution.
The second step is using a function that may look familiar: pnorm(). Although we do not have a mean, we
can use our sampling and population
proportions. Our sampling proportion will Script.R
constitute the q argument here, while our # calculate the standard deviation of the
population proportion will constitute the sampling distribution and put in a variable
mean argument. Now let's get going, what is called sample_sd
sample_sd <- sqrt((0.10 * (1 - 0.10)) / 200)
the probability of finding a sample of 200
with a proportion of 0.13 or more hipsters?
# calculate the probability
Calculate the probability of finding
a sample of 200 with a proportion pnorm(0.13, mean = 0.10, sd = sample_sd,
of 0.13 or more hipsters using lower.tail = FALSE)
the pnorm() function.
[1] 0.0786496
Confidence Intervals
Confidence Interval with Known SD I
We know that in a normally distributed phenomenon, 95% of cases will fall within 1.96 standard deviations
above and below the mean. Let's see what that would look like. Imagine we magically know that the world
population mean for happiness has a value of 36.5, with a standard deviation of 7. Let's find out where 95% of
the people in the world lie.
Script.R
In your script, calculate the amount above # above
and below 36.5 that 95% of the world falls 36.5+(1.96*7) [1] 50.22
Remember to format your calculations
correctly and use () where appropriate
# below
36.5- (1.96*7) [1] 22.78
30
Script.R
# Assign the sample mean to object "m"
m <- mean(samp)
Script.R
# Assign the sample mean to object "m"
m <- mean(rrage)
31
In your console, check the frequency of "yes"
(indicating the presence have road rage) and script.R
"no" (indicating no road rage) # Make p the proportion of the
Using your script, calculate use this information sample with road rage
to calculate what proportion of your sample have p <- 70 / 200
road rage and assign this to the object p
Script.R
# Make p the proportion of the sample with road rage
p <- 70 / 200
32
In your console, copy the steps from your script, and use this to calculate the range of the 95% confidence
interval
In your console, copy and adapt the code from your script for finding the 99% confidence interval, and
use this to calculate the range of the 99% confidence interval
In your script, report the ranges of each interval
In your script, report which confidence interval is wider as a single number (95 or 99)
Script.R
# Find the standard error of p
se <- sqrt( (p * (1 - p)) / 200)
# Calculate the upper level of the 95% confidence interval
p + 1.96 * se
# Calculate the lower level of the 95% confidence interval
p - 1.96 * se
# Report the range of the 95% confidence interval
0.134192 { p + 1.96 * se- p - 1.96 * se}
# Report the range of the 99% confidence interval
0.1766405 { p + 2.58 * se- p – 2.58 * se}
# Which has the widest range?
99
Script.R
# Assign the sample mean to object "m"
m <- mean(rrage)
# Assign the sample standard error to object "s"
s <- sd(rrage) / sqrt(200)
# Calculate the upper level of the 95% confidence interval
m + ( 1.9720 * s )
# Calculate the lower level of the 95% confidence interval
m - ( 1.9720 * s)
# Calculate the range of the 95% confidence interval
3.88326 { p + 1.9720 * se- p - 1.9720 * se}
# Calculate the range of the 90% confidence interval
3.2541 { p + 1.6525 * se- p - 1.6525 * se}
# Which has the widest range?
95
33
Sample Size I Answer:
Which of the following does a large sample size reduce? The margin of error (m= CL*standard deviation)
The standard deviation
The margin of error
The level of confidence
Sample Size II
You're interested in looking at how many days in the week students drink alcohol, and need to know what kind
of sample size to use. You know that to find this out, you need a Z-score, a margin of error and a standard
deviation. Let's try to establish the standard deviation first. You expect that about 95% of people will consume
an alcoholic drink between 1 and 6 days in the week.
Assuming that your data is normally Script.R
distributed, and that 95% of people will # What is the expected mean
report consuming alcohol between 1 and 6
number of drinking days
days in the week, report the expected mean
number of drinking days. 3.5 {(1+6)/2}
Based on the expected distribution, and # Assign standard deviation to
mean, assign the expected standard deviation object "s"
to a new object called "s". s <- 1.25
Script.R
# Assign the standard deviation squared to new object "ss"
ss <- 1.25^2
# Assign the value of the Z-score squared to new object "zs"
zs <- 1.96^2
# Assign the value of the margin of error squared to the new
object "ms"
ms <- 0.2^2
# Calculate the neccessary sample size
(ss*zs)/ms
[1] 150.0625
34
Sample Size IV
Now you're conducting a study on what proportion of student drink alcohol and want to know what sample size
to use for a confidence interval of 95%, with a margin of error of 0.05.
The sample size will be p multiplied by 1-p multiplied by the Z-score squared, divided by the margin of error
squared. Let's try to find this using the 'safe approach' for p. This is the value above which the output p*(1-p)
cannot get any larger (you can always go back to the lecture notes or click 'hint' if you don't remember what this
is).
In your script, calculate the separate components of the sample size equation and assign to stated object
In your script, use the separate components to calculate sample size
Remember to format your calculations correctly and use () where appropriate.
Script.R
# Assign the value of p(1-p) to object "p"
p <- 0.5*0.5
# Assign the value of the Z-score to new object "z"
z <- 1.96
# Assign the value of the margin of error squared to the new
object "ms"
ms <- 0.05^2
# Calculate the neccessary sample size
(p*z^2)/ms
[1] 384.16
35
Hypothesis Testing
Significance testing: one-sided versus two-sided
An important consideration when doing hypothesis testing is whether to do a one-sided or a two-sided test.
Consider the example that we are working with a significance level (αα) of 0.05. In the case we are doing a one-
sided hypothesis test, we would only focus on one side of the distribution (either the right or the left side). Our
entire cumulative probability of 0.05 would then be allocated to only this side. So what does this mean in
practice? In practice this means that our rejection region starts at a probability of 0.95 when our alternative
hypothesis tests whether a given value is greater than a population value.
Alternatively, our rejection region starts at a probability of 0.05 when our
alternative hypothesis tests whether a given value is smaller than a
population value. Let's consider what this means visually:
In the visualization, we have taken as an example the sampling
distribution of the beard length of samples of 40 Scandinavian hipsters.
The mean here is 25 and the standard error is 0.55 round(3.5 /
sqrt(40), 2). The red area is considered the rejection region when
we are doing a one-sided hypothesis where the alternative hypothesis
checks whether the population mean of the beard length of Scandinavian
hipsters is larger than 25 millimetres.
The visualization mentions the value 25.90. This is Script.R
the starting value of the rejection region. Consider # calculate the value of cut_off
our example mentioned above with a mean beard
cut_off <- round(qnorm(0.95 ,mean=25
length of 25 and a standard error of 0.55.
,sd=round(3.5/sqrt(40),2)),2)
Reproduce the value of 25.90 using
the qnorm() function and assign it to the # print the value of cut_off to the
console
variable cut_off. Make sure to round every
value in this exercise to two digits. cut_off
Print the value of cut_off to the console [1] 25.9
36
Script.R
# calculate the value of the variable lower_cut_off
lower_cut_off <- round(qnorm(0.025,mean=25,sd=0.55),2)
Script.R
# calculate the z score and assign it to a variable called z_value
z_value <- round((25.95-25)/round(3.5/sqrt (40),2),2)
37
Significance testing: one-sided versus two-sided (4)
In the last exercises we calculated a p value corresponding to a one-sided test. Given the fact that we were
testing against a significance level of 0.05, we have actually found a significant result. But what if we would
have done a two-sided hypothesis test?
In the instructions of the last exercise, we found a sample mean of exactly 26. When doing a one-sided
hypothesis test, we find a corresponding p value of 0.04 to our z score of 1.81. If we would however do a two-
sided hypothesis test, we should not only look for 𝑃(> 1.81)𝑃(> 1.81). In this case we should test for
both 𝑃(> 1.81)𝑃(> 1.81) and 𝑃(< −1.81)𝑃(< −1.81). As such, to get the p value that corresponds to z score
of 1.81 we have to sum both 𝑃(> 1.81)𝑃(> 1.81) and 𝑃(< −1.81)𝑃(< −1.81). As the Z distribution we are
working with is symmetric, we could multiply the outcome of round(pnorm(1.81, lower.tail =
FALSE), 2) by 2. This would yield a p-value of 0.07 in which case we would fail to reject the null hypothesis
as 0.07 is larger than 0.05.
Imagine that we found a sample mean of 25.95 with a sample size of 40. Calculate the corresponding test
statistic, a z score in this case, and assign it to the variable z_value. Assume that the population mean
and standard deviation are the same as described above. Round all values to two decimals.
Assume that we are doing a two-sided hypothesis test. Use the function pnorm() to find the
corresponding p value and print this to the console. Round the obtained p value to two decimals.
Script.R
# calculate the z score and assign it to a variable called z_value
z_value <- round((25.95-25)/round(3.5/sqrt(40),2),2)
38
Script.R
#’ calculate the probability of answering 12 ore more questions correc
tly given
#' that the student is merely guessing and store this in the variable
p_value
p_value <- round(pbinom(11, size = 25, prob = 0.20, lower.tail = FALSE
), 2)
Calculate the mean (expected probability) and standard error and store them in the
variable average and se. Remember that we worked with an exam of 25 questions and the probability
of guessing the correct answer on a question was 0.20. Round these values to 2 digits.
Assume that a student answered 12 questions correctly. Now calculate the z value and store this in the
variable z_value. Round this value to 2 digits.
Lastly, calculate the associated p value, round this value to two digits and store it in the variable p_value.
Remember that we are doing a one-sided hypothesis test.
print p_value to the console
[1] 0
39
The t distribution
Often when comparing means of continuous variables, we use a t distribution instead of the normal distribution.
The main reason to use the t distribution here is because we often have to deal with small samples.
Now image the following example of height: They say that Dutch people are among the tallest in the world with
an average male height of 185 centimetres with a standard deviation of 5 centimes. We take a sample of 50
males from this population and find an average height of 186.5 centimetres which is above the population mean.
Imagine we want to do a one-sided hypothesis test where we check whether the population mean of Dutch male
height is larger than 185 and we use a significance level of 0.05. There are several things we can do now and 1
thing that we must do.
Firstly, we need to calculate the degrees of freedom which refers to the amount of independent samples in the
set of data, which is equal to the sample size - 1. Thus, the degrees of freedom here is 50 − 1 = 4950 − 1 =
49. Secondly, we could either calculate the associated p value or, alternatively, we could calculate the critical
cut-off value. The critical cut-off value in this case is the 95th percentile as we are doing a one-sided hypothesis
test.
Calculate the critical cut-off value using the qt() function given the fact that we perform a one-sided
hypothesis test with a significance level of 0.05. Round this value to two digits and store it in a variable
called cut_off. You can look up the help documentation of this function by typing help(qt) in the
console.
Print the value of cut_off to the console.
Script.R
# calculate the critical cut off value and store it in a variable called cut_off
cut_off <- cut_off <- round(qt(0.95, df = 49), 2)
[1] 1.68
40
The formula for the t value is the same as the formula for the Z value:
𝑥̅ − 𝜇0
𝑡=
𝑠𝑒
Using our example where we had a sample of 50 males with a mean height of 186.5 and a population
standard deviation of 5 and population mean of 185, calculate the associated standard error, round this
value to two digits and store it in the variable se.
Calculate the associated t value, round it to two digits and store it in the variable t_value. Remember
to use the same formula as when calculating a z score.
Using the pt() function with lower.tail = FALSE, calculate the associated p value, round it to
two digits and store it in a variable called p_value. Remember that we are doing a one-sided test.
Script.R
# calculate the standard error and store it in the variable se
se <- round(5 / sqrt(50), 2)
# calculate the t value and store it in a variable called t_value
t_value <- round((186.5 - 185) / se, 2)
# calculate the p value and store it in a variable called p_value
p_value <- round(pt(t_value, df = 49, lower.tail = FALSE), 2)
# print p_value to the console
p_value
[1] 0.02
Script.R
# calculate the t value and store it in the variable t_value
t_value <- round(qt(0.975, df = 49), 2)
#' calculate a 95% confidence interval as a vector with two values and store it
in a
#' a variable called conf_interval
conf_interval <- round(186.5 + c(-1, 1) * t_value * 0.71, 2)
# print conf_interval to the console
41
conf_interval
[1] 185.07, 187.93
μ-σ
42