Lab Manual Ch4
Lab Manual Ch4
Probability Distributions
Contents
Calculating probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Empirical distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
for loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
if/else statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Sampling distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
This week, we start with learning to use probability distribution functions to calculate probabilities of some
events. There are many distributions that model specific cases of random events. We will get to know 3
important distributions.
Then, we proceed with sampling distributions. They are probability distributions themselves, giving probability
of not a single element, but a sample of elements drawn from a population. As we will be dealing with samples
in any kind of statistical analysis, sampling distributions play a key role in the whole field of statistical
inference.
Calculating probabilities
Binomial distribution
A random variable representing the number of elements belonging to a category, when a certain number
of elements are drawn independently from a population, which consists of elements belonging to either
one of two possible categories, is distributed according to binomial distribution. One of the categories is
arbitrarily called “success”, and each independent draw is called a “trial”. Probability of success is the
proportion of elements in the population, that belong to the “success” category. The binomial distribution
gives probabilities of each possible outcome (number of successes) for a given number of trials and for a
specific probability of success.
Probability mass function dbinom() gives probability of observing an outcome being equal to x.
# probability of observing 5 successes out of 12
# when success probability for a single trial is 0.6
dbinom(x = 5, size = 12, prob = 0.6)
1
Binomial p=0.6 n=12
0.20
probability
0.10
0.00
0 1 2 3 4 5 6 7 8 9 11
# of success
Example 1
In a family with 5 children, what is the probability of having exactly 2 boys? (Assume 0.5:0.5 sex
ratio)
Poisson distribution
A random variable representing number of events occurring independently in a fixed time or space interval,
with a constant rate of occurrence is distributed according to Poisson distribution.
Probability mass function dpois() gives probability of observing an outcome being equal to x.
# probability of observing 4 events
# when rate of occurrence is 2.4
dpois(4, lambda = 2.4)
# probabilities of outcomes from 0 to 10
dpois(0:10, lambda = 2.4)
2
Poisson lambda=2.4
0.20
probability
0.10
0.00
0 1 2 3 4 5 6 7 8 9 10
# of events
Example 2
In a cell culture, on average 1.3 cell divisions occur per minute. You observe the culture for 5
minutes. What is the probability of observing exactly 4 divisions?
# Average is given as per minute but the question asks in 5 min interval
# so we need to convert the time unit to divisions per 5 minutes
1.3*5 # average division per 5 minute
## [1] 6.5
dpois(4, lambda = 1.3*5)
## [1] 0.1118222
Normal distribution
Normal distribution is a continuous probability distribution in contrast to the previous two distributions,
which are discrete. Many quantities in nature follow a normal distribution and we will see in next section
why this is the case and why this distribution is very important.
Each specific normal distribution is characterized by its mean and standard deviation.
When our random variable is a continuous one, calculating probability of a single outcome is meaningless
/ not defined. We will talk always talk about probability of observing an outcome for a range of values.
Cumulative distribution function pnorm() gives probability of observing an outcome X ≤ x or X > x
for a given value of x.
3
Normal (177.6, 9.7) Normal (177.6, 9.7)
Probability density of X=x
0.04
0.03
0.03
0.02
0.02
0.01
0.01
0.00
0.00
140 160 180 200 220 140 160 180 200 220
x x
Example 3
Male astronauts at NASA should be between 157.5 and 190.5 cm tall. Height of men in US is
normally distributed with a mean = 177.6 and standard deviation = 9.7. What is the probability
for a random US male to be,
• too short for NASA?
• too tall for NASA?
Empirical distributions
When we have data from a whole population, we can calculate probabilities of outcomes by simply dividing
number of elements that satisfy the condition for the outcome to the number of all elements in the population.
In cases when we have partial data from the population, we may use it as an approximation to population.
Example 4
• Import the file “KenyaFinches.csv”
• Check distribution of the beaklengths
• What is the probability of observing a random finch with beaklength less than 8 cm? Greater
than 11 cm?
finch = read.csv('KenyaFinches.csv')
head(finch)
## species mass beaklength
## 1 WB.SPARW 40 10.6
4
## 2 WB.SPARW 43 10.8
## 3 WB.SPARW 37 10.9
## 4 WB.SPARW 38 11.3
## 5 WB.SPARW 43 10.9
## 6 WB.SPARW 33 10.1
dim(finch)
## [1] 45 3
hist(finch$beaklength, main='Beak Length Distribution',
xlab = 'Length', ylim = c(0,14), xlim = c(5,12))
Beak Length Distribution
14
8 10
Frequency
6
4
2
0
5 6 7 8 9 10 11 12
Length
# probability of a finch being less than 8 cm:
sum(finch$beaklength<8)/nrow(finch)
## [1] 0.4888889
# probability of a finch being greater than 11 cm:
sum(finch$beaklength>11)/nrow(finch)
## [1] 0.06666667
for loop
Looping is used for repeated execution of a set of commands. R has several built-in loop commands and
the easiest one is the for loop. For loop iterates through a vector of integers, each time running the set of
commands written inside of it. Below is the syntax structure of a for loop:
for(i in 1:n){
# do something to be repeated n times
# i can be used within these codes to
# do the repeated operation on different values
}
5
if/else statement
When we want to execute a piece of code depending on a condition, we can use if(){}else{} statement.
The example below includes some random operations to show how the if/else statement works.
# the condition expression can be anything that produces a TRUE/FALSE value
if(conditionExpression){
# do something if condition is true
} else{
# do another thing if condition is false
}
If we want to execute a piece of code when given condition is true, but we don’t want to do anything if the
condition is false, we can use if(){} statement without the else{}.
Sampling distributions
When we take a sample from a population and calculate a characteristic from that sample, probabilities
corresponding to possible outcomes of that sample characteristic is given in a sampling distribution.
Perhaps the simplest example of a sampling distribution is the binomial distribution. It is the probability
distribution for number of successes in samples taken from a population with Bernoulli distribution. A
Bernoulli distributed population consists of elements that belong to one of two possible categories.
Bernoulli Distributed Population Sampling Distribution for # of Successes out of 10
0.25
0.6
0.20
0.5
0.4
Prob. of x
Prob. of x
0.15
0.3
0.10
0.2
0.05
0.1
0.00
0.0
0 1 0 1 2 3 4 5 6 7 8 9 10
x x
sample() function can be used to generate random samples. When we need sampling with replacement we
can set “replace = TRUE” and when drawing each element has different probabilities, we can give a vector of
probabilities instead of NULL in the example code below.
sample(x = vector_to_sample_from, size = sample_size,
replace = FALSE, prob = NULL)
6
An Arbitrary Population Sampling Distribution of Sample Means
1.5
1.5
Prob. density of x
Prob. density of x
1.0
1.0
0.5
0.5
0.0
0.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
x x
We will not attempt to prove the central limit theorem, but demonstrate it using a simulation. From some
arbitrary populations, we can take many samples and calculate their mean values. Using a large number of
samples we can approximate a probability distribution of sample means.
Example 5
• Use the beaklength of KenyaFinches data from previous example. Recall its distribution.
• Take a sample of size 20 from that population and record its mean.
• Repeat the above step for 1000 times, record each mean as a new element of a vector.
• Plot the distribution of sample means.
• In a 2*2 plotting area, draw 4 distributions of sample means with sample sizes 10, 20, 50,
100.
• For each of the 4 vectors that store sample means, calculate their means and standard
deviations. Compare them with the mean and standard deviation of the original population.
# now, we can repeat this process many times using for loop:
# first create an empty vector to store sample means
sample_means = c()
for(i in 1:1000){
one_sample = sample(finch$beaklength, 20)
sample_means[i] = mean(one_sample)
}
head(sample_means)
## [1] 8.495 8.740 8.865 9.100 8.565 8.575
hist(sample_means, col='thistle4', main='Distribution of sample means',
xlab = 'sample mean', ylim = c(0,300))
7
Distribution of sample means
250
Frequency
150
50
0
sample mean
# now repeat the process for sample sizes: 10, 20, 50, 100:
sample_means10 = c()
for(i in 1:1000){
one_sample = sample(finch$beaklength, size = 10, replace = T)
sample_means10[i] = mean(one_sample)
}
sample_means20 = c()
for(i in 1:1000){
one_sample = sample(finch$beaklength, size = 20, replace = T)
sample_means20[i] = mean(one_sample)
}
sample_means50 = c()
for(i in 1:1000){
one_sample = sample(finch$beaklength, size = 50, replace = T)
sample_means50[i] = mean(one_sample)
}
sample_means100 = c()
for(i in 1:1000){
one_sample = sample(finch$beaklength, size = 100, replace = T)
sample_means100[i] = mean(one_sample)
}
# draw all 4 sample mean distributions:
#to make distributions comparable, set same x and y limits:
par(mfrow=c(2,2))
hist(sample_means10, col='thistle4', main='n=10',
xlab = 'sample mean', xlim = c(7.5,10), ylim=c(0,350) )
hist(sample_means, col='thistle4', main='n=20',
xlab = 'sample mean', xlim = c(7.5,10), ylim=c(0,350))
hist(sample_means50, col='thistle4', main='n=50',
8
xlab = 'sample mean', xlim = c(7.5,10), ylim=c(0,350))
hist(sample_means100, col='thistle4', main='n=100',
xlab = 'sample mean', xlim = c(7.5,10), ylim=c(0,350))
n=10 n=20
300
300
Frequency
Frequency
200
200
100
100
0
0
7.5 8.0 8.5 9.0 9.5 10.0 7.5 8.0 8.5 9.0 9.5 10.0
n=50 n=100
300
300
Frequency
Frequency
200
200
100
100
0
7.5 8.0 8.5 9.0 9.5 10.0 7.5 8.0 8.5 9.0 9.5 10.0
9
Exercises
1. Obesity rate A group of researchers want to survey obesity rates and report that 461 out of randomly
selected 2428 people are obese. What is the probability of obtaining this result, if in reality 15% of the
population is obese (if in reality each random person has 0.15 probability of being obese)?
A more meaningful question would be to ask for the probability of observing 461 or more obese people out of
2428. Calculate that probability, with the same probability of success as above (0.15).
2. Seedlings On a field experiment, abundance of seedlings on a large number of rectangular plots with
same size and same properties are observed. On average there was 1.32 seedlings per plot. What is the
probability that a randomly selected plot has, no seedlings?
3. Fish From a fish population with normally distributed (µ = 82, σ = 6.7) weight values (in grams), what
is the probability that you catch a fish heavier than 85 grams?
4. Birth weight Human birth weights are normally distributed with µ = 3339 grams and σ = 612. What
is the probability that a random newborn baby weighs, less than 3000 grams?
5. Beans From a population of yellow and green beans with a 7:3 ratio, you want to sample 20 beans, and
count number of yellow ones. Generate a sampling distribution of 500 such samples. Compute how many of
these samples have less then 10 yellow beans.
6. Birth weight factors Let us simulate a hypothetical situation where birth weights are determined by
the sum of 5 genetical and 2 environmental factors.
Each genetical factor is the presence/absence of an independent allele and presence of each of the alleles add
700 gram to baby’s weight and absence contributes 0. Presence probabilities of the 5 independent alleles are
given as 0.2, 0.3, 0.5, 0.6 and 0.9.
One of the environmental factors is uniformly distributed and ranges between 200 and 900 grams of addition
to baby’s weight. The second environmental factor is distributed according to Poisson distribution and its
mean is adding 500 grams.
Add up weight contribution of each factor to obtain a random baby weight. Write a for loop to obtain 1000
baby weights. Draw a histogram of the generated distribution. Does it resemble normal distribution? Does
this simulation give an intuition about why normal distribution is frequently encountered in nature?
10