Data Science - Probability
Data Science - Probability
Section 1 Overview
Section 1 introduces you to Discrete Probability. Section 1 is divided into three parts:
There are 3 assignments that use the DataCamp platform for you to practice your
coding skills. There are also some quick probability calculations for you to perform
directly on the edX platform as well, and there is a longer set of problems at the end of
section 1.
We encourage you to use R to interactively test out your answers and further your
learning.
1.1 Introduction to Discrete Probability
Discrete Probability
Textbook link
This video corresponds to the textbook section on discrete probability External link.
Key points
The probability
of an event is the proportion of times the event occurs
when we repeat the experiment independently under the same
conditions.
The set.seed() function
Before we continue, we will briefly explain the following important line of code:
set.seed(1986)
Throughout this book, we use random number generators. This implies that
many of the results presented can actually change by chance, which then
suggests that a frozen version of the book may show a different result than
what you obtain when you try to code as shown in the book. This is actually
fine since the results are random and change from time to time. However, if
you want to to ensure that results are exactly the same every time you run
them, you can set R’s random number generation seed to a specific number.
Above we set it to 1986. We want to avoid using the same seed every time. A
popular way to pick the seed is the year - month - day. For example, we
picked 1986 on December 20, 2018: 2018 − 12 − 20 = 1986.
You can learn more about setting the seed by looking at the documentation:
?set.seed
In the exercises, we may ask you to set the seed to assure that the results
you obtain are exactly what we expect them to be.
If you are running R 3.6 or later, you can revert to the original seed setting
behavior by adding the argument sample.kind="Rounding". For example:
set.seed(1)
beads
To find the probability of drawing a blue bead at random, you can run:
mean(beads == "blue")
[1] 0.6
This code is broken down into steps inside R. First, R evaluates the logical
statement beads == "blue", which generates the vector:
0 0 1 1 1
The mean of the zeros and ones thus gives the proportion of TRUE values. As
we have learned and will continue to see, probabilities are directly related to
the proportion of events that satisfy a requirement.
Probability Distributions
Textbook link
This video corresponds to the textbook section on probability distributions.
Key points
Key points
If
two events A and B are independent, Pr(A|B) = Pr (A).
To determine the probability of multiple events occurring, we use
the multiplication rule.
Equations
The multiplication rule for independent events is:
Pr(A and B and C) = Pr (A) x Pr(B) x Pr(C).
We can expand the multiplication rule for dependent events to more than 2
events:
paste(number, suit)
paste(letters[1:5], as.character(1:5))
n <- nrow(all_phone_numbers)
all_phone_numbers[index,]
# probability of a natural 21 checking for both ace first and ace second
hand
B <- 10000
})
mean(results)
n <- 50
B <- 10000
any(duplicated(bdays))
})
mean(results) # calculates proportion of groups with duplicated bdays
sapply
Textbook links
The textbook discussion of the basics of sapply() can be found in this
textbook section External link.
The textbook discussion of sapply() for the birthday problem can be found
within the birthday problem section External link.
Key points
Code: Function for birthday problem Monte Carlo simulations
Note that the function body of compute_prob() is the code that we wrote in
the previous video. If we write this code as a function, we can
use sapply() to apply this function to several values of n.
any(duplicated(bdays))
})
mean(same_day)
x <- 1:10
y <- 1:10
x <- 1:10
plot(n, prob)
The larger the number of Monte Carlo replicates , the more accurate the
estimate.
Determining the appropriate size for can require advanced statistics.
One practical approach is to try many sizes for and look for sizes that
provide stable estimates.
Code: Estimating a practical value of B
This code runs Monte Carlo simulations to estimate the probability of shared
birthdays using several B values and plots the results. When B is large enough
that the estimated probability stays stable, then we have selected a useful
value of B.
any(duplicated(bdays))
})
mean(same_day)
Key points
The addition rule states that the probability of event A or
event B happening is the probability of event A plus the probability of
event B minus the probability of both events A and B happening
together.
The Monty Hall Problem
Textbook section
Here is the textbook section on the Monty Hall Problem External link.
Key points
})
})
Section 2 Overview
There is 1 assignment that uses the DataCamp platform for you to practice your coding
skills as well as a set of questions on the edX platform at the end of section 3.
This section corresponds to the continuous probability section of the course textbook.
We encourage you to use R to interactively test out your answers and further your
learning.
Continuous Probability
Textbook links
This video corresponds to the textbook section on continuous probability External link.
The previous discussion of CDF is from the Data Visualization course. Here is
the textbook section on the CDF External link.
Key points
The cumulative distribution function (CDF) is a distribution function for
continuous data that reports the proportion of the data below for all
values of :
library(tidyverse)
library(dslabs)
data(heights)
Given a vector x, we can define a function for computing the CDF of x using:
library(tidyverse)
library(dslabs)
data(heights)
We can estimate the probability that a male is taller than 70.5 inches using:
# probabilities in actual data over other ranges don't match normal approx as well
Probability Density
Textbook link
This video corresponds to the textbook section on probability density External link.
Key points
The probability of a single value is not defined for a continuous
distribution.
The quantity with the most similar interpretation to the probability of a
single value is the probability density function .
The probability density is defined such that the integral of over a range
gives the CDF of that range.
library(tidyverse)
x <- seq(-4, 4, length = 100)
data.frame(x, f = dnorm(x)) %>%
ggplot(aes(x, f)) +
geom_line()
library(dslabs)
data(heights)
# generate simulated height data using normal distribution - both datasets should have
n observations
n <- length(x)
s <- sd(x)
ggplot(aes(simulated_heights)) +
geom_histogram(color="black", binwidth = 2)
})
mean(tallest >= 7*12) # proportion of times that tallest person exceeded 7
feet (84 inches)
ggplot(aes(x,f)) +
geom_line()
First we'll simulate an ACT test score dataset and answer some questions
about it.
(IMPORTANT NOTE! If you use R 3.6 or later, you will need to use the
command format set.seed(x, sample.kind = "Rounding") instead
of set.seed(x). Your R version will be printed at the top of the Console
window when you start RStudio.)
Question 1a
What is the mean of act_scores ? / A: 20.8
Question 1b
What is the standard deviation of act_scores ? / A: 5.68
sd(act_scores)
Question 1c
A perfect score is 36 or greater (the maximum reported score is 36).
In act_scores , how many perfect scores are there out of 10,000 simulated
tests? / A: 41
Question 1d
In act_scores , what is the probability of an ACT score greater than 30? A:
0.0527
Question 1e
In act_scores , what is the probability of an ACT score less than or equal to 10?
/ A: 0.0282
Question 2
1 point possible (graded)
Set x equal to the sequence of integers 1 to 36. Use dnorm to determine the
value of the probability density function over x given a mean of 20.9 and
standard deviation of 5.7; save the result as f_x . Plot x against f_x .
x <- 1:36
f_x <- dnorm(x, 20.9, 5.7)
data.frame(x, f_x) %>%
ggplot(aes(x, f_x)) +
geom_line()
Question 3a
What is the probability of a Z-score greater than 2 (2 standard deviations
above the mean)? / A: 0.0233
Question 3b
What ACT score value corresponds to 2 standard deviations above the mean
(Z = 2)? / A: 32.2
2*sd(act_scores) + mean(act_scores)
Question 3c
A Z-score of 2 corresponds roughly to the 97.5th percentile.
Use qnorm() to determine the 97.5th percentile of normally distributed data
with the mean and standard deviation observed in act_scores .
What is the 97.5th percentile of act_scores ? / A: 32.0
In this 4-part question, you will write a function to create a CDF for ACT
scores.
Write a function that takes a value and produces the probability of an ACT
score less than or equal to that value (the CDF). Apply this function to the
range 1 to 36.
Question 4a
What is the minimum integer score such that the probability of that score or
lower is at least .95? / A: 31
Question 4b
Use qnorm() to determine the expected 95th percentile, the value for which the
probability of receiving that score or lower is 0.95, given a mean score of 20.9
and standard deviation of 5.7.
What is the expected 95th percentile of ACT scores? / A: 30.3
Question 4c
Question 4d
Make a corresponding set of theoretical quantiles using qnorm() over the
interval p <- seq(0.01, 0.99, 0.01) with mean 20.9 and standard deviation 5.7.
Save these as theoretical_quantiles . Make a QQ-plot
graphing sample_quantiles on the y-axis versus theoretical_quantiles on the x-
axis.
Which of the following graphs is correct?
There are 2 assignments that use the DataCamp platform for you to practice
your coding skills as well as a set of questions on the edX platform at the end
of Section 3.
We encourage you to use R to interactively test out your answers and further your
learning.
Random Variables
Textbook link
This video corresponds to the textbook section on random variables External link.
Key points
ifelse(sample(beads, 1) == "blue", 1, 0)
ifelse(sample(beads, 1) == "blue", 1, 0)
ifelse(sample(beads, 1) == "blue", 1, 0)
Sampling Models
Textbook link and additional information
This video corresponds to the textbook section on sampling models External link.
You can read more about the binomial distribution here External link.
Key points
color <- rep(c("Black", "Red", "Green"), c(18, 18, 2)) # define the urn for
the sampling model
n <- 1000
X[1:10]
We use the sampling model to run a Monte Carlo simulation and use the
results to estimate the probability of the casino losing money.
S <- replicate(B, {
})
library(tidyverse)
ggplot(aes(S, ..density..)) +
geom_histogram(color = "black", binwidth = 10) +
ylab("Probability") +
Key points
Key points
Capital letters denote random variables () and lowercase letters denote
observed values ().
In the notation , we are asking how frequently the random variable is
equal to the value . For example, if , this statement becomes .
Key points
The Central Limit Theorem (CLT) says that the distribution of the sum of
a random variable is approximated by a normal distribution.
The expected value of a random variable, E[X] = μ, is the average of the
values in the urn. This represents the expectation of one draw.
The standard error of one draw of a random variable is the standard
deviation of the values in the urn.
Theexpected value of the sum of draws is the number of draws times
the expected value of the random variable.
Thestandard error of the sum of independent draws of a random
variable is the square root of the number of draws times the standard
deviation of the urn.
Equations
These equations apply to the case where there are only two
outcomes, a and b with proportions p and 1-p respectively. The general
principles above also apply to random variables with more than two outcomes.
ap + b (1 - p)
|b – a| √ p (1− p)
√ n x |b – a| √ p (1− p)
Textbook link
This video corresponds to the textbook section on the statistical properties of
averages.
Key points
E [aX] = a μ
SE [aX] = aσ
Average of multiple draws of a random variable
E [nX] = n μ
SE [nX] = √ n σ
The expected value of the sum of different random variables is the sum of the
individual expected values for each random variable:
E [X1 + X2 + …+ Xn ] = μ1 + μ2 +…+ μ n
The standard error of the sum of different random variables is the square root
of the sum of squares of the individual standard errors:
SE [X1 + X2 + …+ Xn ] = √ σ 1 +σ 2+ …+σ n1
2 2 2
Transformation of random variables
Key points
The law of large numbers states that as n increases, the standard error
of the average of a random variable decreases. In other words,
when n is large, the average of the draws converges to the average of
the urn.
The law of large numbers is also known as the law of averages.
The law of averages only applies when n is very large and events are
independent. It is often misused to make predictions about an event
being "due" because it has happened less frequently than expected in
a small sample size.
Key points
Thesample size required for the Central Limit Theorem and Law of
Large Numbers to apply differs based on the probability of success.
If the probability of success is high, then relatively few
observations are needed.
As the probability of success decreases, more
observations are needed.
If the probability of success is extremely low, such as winning a lottery,
then the Central Limit Theorem may not apply even with extremely
large sample sizes. The normal distribution is not a good
approximation in these cases, and other distributions such as the
Poisson distribution (not discussed in these courses) may be more
appropriate.
An old version of the SAT college entrance exam had a -0.25 point penalty for
every incorrect answer and awarded 1 point for a correct answer. The
quantitative test consisted of 44 multiple-choice questions each with 5 answer
choices. Suppose a student chooses answers by guessing for all questions on
the test.
Question 1a
What is the probability of guessing correctly for one question? A: 0.2
Question 1b
What is the expected value of points for guessing on one question? A: 0
a <- 1
b <- -0.25
mu <- a*p + b*(1-p)
mu
Question 1c
What is the expected score of guessing on all 44 questions? A: 0
n <- 44
n*mu
Question 1d
What is the standard error of guessing on all 44 questions? A: 3.32
sigma <- sqrt(n) * abs(b-a) * sqrt(p*(1-p))
sigma
Question 1e
Use the Central Limit Theorem to determine the probability that a guessing
student scores 8 points or higher on the test. / A: 0.00793
Question 1f
Set the seed to 21, then run a Monte Carlo simulation of 10,000 students
guessing on the test.
(IMPORTANT! If you use R 3.6 or later, you will need to use the
command set.seed(x, sample.kind = "Rounding") instead of set.seed(x) . Your
R version will be printed at the top of the Console window when you start
RStudio.)
What is the probability that a guessing student scores 8 points or higher? A:
0.008
set.seed(21, sample.kind = "Rounding")
B <- 10000
n <- 44
p <- 0.2
tests <- replicate(B, {
X <- sample(c(1, -0.25), n, replace = TRUE, prob = c(p, 1-p))
sum(X)
})
mean(tests >= 8)
Q2:
The SAT was recently changed to reduce the number of multiple choice
options from 5 to 4 and also to eliminate the penalty for guessing.
In this two-part question, you'll explore how that affected the expected values
for the test.
Question 2a
Suppose that the number of multiple choice options is 4 and that there is no
penalty for guessing - that is, an incorrect question gives a score of 0.
What is the expected value of the score when guessing on this new test? A:
11
p <- 1/4
a <- 1
b <- 0
n <- 44
mu <- n * a*p + b*(1-p)
mu
Question 2b
Consider a range of correct answer probabilities p <- seq(0.25, 0.95,
0.05) representing a range of student skills.
What is the lowest p such that the probability of scoring over 35 exceeds
80%? A: 0.85
The following 7-part question asks you to do some calculations related to this
scenario.
Question 3a
What is the expected value of the payout for one bet? A: -0.0789
p <- 5/38
a <- 6
b <- -1
mu <- a*p + b*(1-p)
mu
Question 3b
What is the standard error of the payout for one bet? A: 2.37
Question 3c
What is the expected value of the average payout over 500 bets? A: -0.0789
Remember there is a difference between expected value of the average and
expected value of the sum.
mu
Question 3d
What is the standard error of the average payout over 500 bets? A: 0.106
Remember there is a difference between the standard error of the average
and standard error of the sum.
n <- 500
sigma/sqrt(n)
Question 3e
What is the expected value of the sum of 500 bets? A: -39.5
n*mu
Question 3f
What is the standard error of the sum of 500 bets? A: 52.9
sqrt(n) * sigma
Question 3g
Use pnorm() with the expected value of the sum and standard error of the sum
to calculate the probability of losing money over 500 bets, Pr (X ≤ 0). / A:
0.772
We encourage you to use R to interactively test out your answers and further your
learning.
Textbook link
This video corresponds to the textbook section on interest rates External link.
Correction
At 2:35, the displayed results of the code are incorrect. Here are the correct
values:
n*(p*loss_per_foreclosure + (1-p)*0)
[1] -4e+06
sqrt(n)*abs(loss_per_foreclosure)*sqrt(p*(1-p))
[1] 885438
Key points
Interest rates for loans are set using the probability of loan defaults to
calculate a rate that minimizes the probability of losing money.
We can define the outcome of loans as a random variable. We can also
define the sum of outcomes of many loans as a random variable.
The Central Limit Theorem can be applied to fit a normal distribution to
the sum of profits over many loans. We can use properties of the
normal distribution to calculate the interest rate needed to ensure a
certain probability of losing money for a given probability of default.
Code: Interest rate sampling model
n <- 1000
p <- 0.02
sum(defaults * loss_per_foreclosure)
sum(defaults * loss_per_foreclosure)
})
ggplot(aes(losses_in_millions)) +
x = - loss_per_foreclosure*p/(1-p)
x/180000
z <- qnorm(0.01)
B <- 100000
sum(draws)
})
Textbook link
This video corresponds to the textbook section on The Big Short External link.
Key points
The Central Limit Theorem states that the sum of independent draws of
a random variable follows a normal distribution. However, when the
draws are not independent, this assumption does not hold.
If an event changes the probability of default for all borrowers, then the
probability of the bank losing money changes.
Monte Carlo simulations can be used to model the effects of unknown
changes in the probability of default.
Code: Expected value with higher default rate and interest rate
p <- .04
r <- 0.05
x <- r*180000
loss_per_foreclosure*p + x*(1-p)
It follows that:
To find the value of for which is less than or equal to our desired value, we
take and solve for :
Code: Calculating number of loans for desired probability of losing
money
The number of loans required is:
z <- qnorm(0.01)
l <- loss_per_foreclosure
B <- 10000
p <- 0.04
sum(draws)
})
mean(profit)
p <- 0.04
x <- 0.05*180000
})
library(tidyverse)
library(dslabs)
Background
In the motivating example The Big Short, we discussed how discrete and
continuous probability concepts relate to bank loans and interest rates. Similar
business problems are faced by the insurance industry.
Just as banks must decide how much to charge as interest on loans based on
estimates of loan defaults, insurance companies must decide how much to
charge as premiums for policies given estimates of the probability that an
individual will collect on that policy.
We will use data from 2015 US Period Life Tables External link. Here is the code you
will need to load and examine the data from dslabs:
data(death_prob)
head(death_prob)
There are six multi-part questions for you to answer that follow.
An insurance company offers a one-year term life insurance policy that pays
$150,000 in the event of death within one year. The premium (annual cost) for
this policy for a 50 year old female is $1,150. Suppose that in the event of a
claim, the company forfeits the premium and loses a total of $150,000, and if
there is no claim the company gains the premium amount of $1,150. The
company plans to sell 1,000 policies to this demographic.
50 year old males have a different probability of death than 50 year old
females. We will calculate a profitable premium for 50 year old males in the
following four-part question.
Suppose that there is a massive demand for life insurance due to the
pandemic, and the company wants to find a premium cost for which the
probability of losing money is under 5%, assuming the death rate stays stable
at p = 0.015.
The company cannot predict whether the pandemic death rate will stay stable.
Set the seed to 29, then write a Monte Carlo simulation that for each of B =
10000 iterations:
(IMPORTANT! If you use R 3.6 or later, you will need to use the
command set.seed(x, sample.kind = "Rounding") instead
of set.seed(x). Your R version will be printed at the top of the Console
window when you start RStudio.)
The outcome should be a vector of B total profits. Use the results of the Monte
Carlo simulation to answer the following three questions.
(Hint: Use the process from lecture for modeling a situation for loans that
changes the probability of default for all borrowers simultaneously.)