3141b86-6fd4-7726-D8ad-20a1516bcd Statistics Interview Cheat Sheet - Emmading - Com. All Rights Reserved.
3141b86-6fd4-7726-D8ad-20a1516bcd Statistics Interview Cheat Sheet - Emmading - Com. All Rights Reserved.
3141b86-6fd4-7726-D8ad-20a1516bcd Statistics Interview Cheat Sheet - Emmading - Com. All Rights Reserved.
INTERVIEW
CHEAT
SHEET
Statistics Interview Cheat Sheet
This summary walks you through some of the most common statistics interview problems and gives you
steps for addressing them. Examples in this summary will help you understand what to expect in a statistics
interview and give you some practice addressing each question type.
By cracking questions in this cheat sheet, you are able to solve over 40% of statistics interview
questions!
1. Start with some context. For example, when or where is this terminology used?
2. Provide a definition of the concept. Even when explaining a concept to a technical person, you
want to keep the definition easy to understand. Try NOT to sound like Wikipedia or an
advanced textbook. Your ability to explain things in simple terms actually shows a higher level
of understanding.
3. Next, for concepts that can be represented by numbers, you might want to explain what
changes in a particular value mean. For example, what does a higher p-value mean?
4. The final step is optional. You can finish your response by talking about how this concept is
applied in practice. Think about questions such as “Why is this concept widely used?” or “Why
is this concept important to data science?”
1. Use examples and analogies, which are a great way to explain terminology to a non-technical
audience. Try to make connections to things that a layman would be more familiar with to
explain what is unfamiliar.
2. Avoid using technical terms when explaining things to a non-technical audience. For example, if
you use terms like ‘hypothesis testing’, ‘null hypothesis’, or ‘alternative hypothesis’ when
explaining the concept of the power of a test, you will only confuse your audience.
3. As with all conceptual questions, the goal should be to keep your explanations clear and
structured.
Below is a list of conceptual questions ranked by frequency. We’ll provide sample answers to
explain those concepts in a technical way, followed by an explanation to a non-technical audience.
P-value
It’s commonly used in hypothesis testing to connect the dots between observation and conclusion. It is a
conditional probability that measures the probability of obtaining results at least as extreme as those
observed in the given sample, given that the null hypothesis is true. When we say “at least as extreme,” we
mean “containing at least as much evidence in favor of the alternative hypothesis.”
The p-value is commonly used in A/B testing when we have a treatment and a control group and we want
to test whether a metric is different between those groups. Now suppose we run the experiment and
observe different metric values in each group. The smaller the p-value is, the more we are convinced that
there is a difference between the two.
Your friend claims that the average height of adults in your town is 175 cm and you decide to gather
some data to see if he’s right. You randomly select 30 adults from the town, record their height, and
average the measurements. The average probably won’t be exactly 175 cm, and if it’s at all close, your
friend will probably claim that the difference is just due to random chance.
The p-value allows you to quantify how likely your friend’s counter-argument is. Could the difference
really be due to chance, or is his explanation unlikely based on the size of the difference and the
number of individuals you analyzed?
Let’s imagine that the average height in your sample is 172 cm. Then the p-value has the following
interpretation: given the average height in the population really is 175cm, the p-value is the probability
of sampling 30 individuals with an average height that differs from 175 cm by 3 cm or more. A very
small p-value means that your data is very unlikely if the average height in the population is 175 cm;
therefore, it’s more plausible that the average height in the population is NOT, in fact, 175 cm.
A type II error occurs when we mistakenly accept a false null hypothesis; i.e., we conclude that the
observed differences are not statistically meaningful when, in fact, there is a real systematic difference
between groups.
Holding all else constant, we would prefer a test with a lower type II error rate because that would mean
we are more likely to find real differences.
Tip to remember these two concepts: As explained in this Cross Validated post - since “false”
and “negative” have similar meanings, a type II error is a “false negative” or “false false,”
because it contains two falses. By comparison, a Type I error is a “false positive“ that has only
one “false” in it.
The higher the statistical power, the better the test is. It is commonly used in experimental design, to
calculate the minimum sample size required so that one can reasonably detect an effect.
We’ve just explained Type I Error, Type II Error, and power in a technical way; let’s now describe
them to a non-technical audience. The key is to use an intuitive example to explain them. Below is
an example; feel free to come up with your own!
A person wants to test if he is infected by COVID, so we can break the problem into two scenarios.
In the first scenario, he really does have COVID. In this case, the power is the probability that his
test comes back positive. In contrast, the type II error rate is the probability his test comes back
negative. A type II error is problematic because he should quarantine and seek out medical
treatment, but he doesn’t know that he needs to.
In the second scenario, he does not have COVID. In this scenario, the type I error rate is the
probability that his test comes back positive even though he doesn’t have COVID. A type I error is
problematic because he needlessly quarantines and seeks out medical treatment when he doesn’t
need to.
N: The residuals are normally distributed; this one is less important in large samples
2
sample mean ∼ N(μ, σn )
So basically, under some independence and variance assumptions, the sampling distribution of the means
follows a normal distribution no matter what the underlying distribution of the population is.
Confidence Interval
A confidence Interval is used when we want to quantify how confident we are about a given estimate. The
confidence interval is for the true value but we never know what the true value is. That’s why we gather
data: to create a reasonable estimate of the true—but unknown—value.
The confidence interval is a range of numbers with an accompanying confidence level. The confidence
level is the probability that a confidence interval generated from a new (but identically distributed) data set
will contain the true value. Most data scientists like to use a confidence level of 95%, but others values are
common too.
Higher confidence levels require wider confidence intervals. Gathering more data will typically make your
confidence intervals narrower because the additional samples give you more information about the true
value.
The direction of the linear relationship The strength of the linear relationship between two
Measure
between two variables. variables.
Cov (X, Y ) = rX,Y = C ov(X,Y )
Equation Var( X ) Var( Y )
E [(X − E [X ]) (Y − E [Y ])]
Unit Product of the units of the two variables None
Over 80% of calculation questions involve combinatorics and probability. The equations below are
helpful to review before interviews because they can be used to answer most calculation questions.
If there are m ways to arrange a, n ways to arrange b, and a and b cannot happen simultaneously,
then the number of ways to arrange a or b is m + n:
Ways to do a or b = m + n
Rule of Product
Given a and b, which can happen simultaneously, and n ways to do a and m ways to do b, then n x m
ways exist to do both:
Ways to do a and b = n ∗ m
Probability
Bayes’ Theorem (Bayes' rule)
Bayes’ rule is one of the most important rules in probability. It deals with conditional probabilities and
provides a rule for changing existing beliefs based on new data.
P (B ∣A)∗P (A)
P (A∣B) = P (B )
A, B are events
P (A∣B): the probability of A occurring given that B is true
P (B∣A): the probability of B occurring given that A is true
P (A), P (B): independent probabilities of A and B
Expectation
The expected value of a random variable with a finite number of outcomes is a weighted average of all
possible outcomes.
E[X] = ∑∞
n=1 xn ∗ P (X = xn )
Binomial Distribution
Number of successes among n independent and identically distributed Bernoulli trials. Each trial has the
same probability of getting a success p. It is commonly used to model the number of successes in a
sample of size n drawn with replacement from a population of size n.
P (X = k) = ( )pk (1 − p)n−k =
N n!
pk (1 − p)n−k
k k!(n − k)!
Most implementation questions are about t-tests, so it’s essential to review one-sample and two-
sample t-tests as well as Welch’s t-test.
One-Sample T-test
Let X1 , ..., Xn be a small random sample (n ≤ 30) sample from a normal population with mean μ.
If the population of differences is approximately normal then we can construct our t-statistic. Under H0
ˉ − μ0
X
T = ∼ tn−1
s/ n
ˉ is the sample mean, μ0 is a constant, and s is the sample standard deviation
Where X s=
2
∑ ni=1 (X i − Xˉ )
n−1
.
ˉ ± tn−1,α/2 s
X
n
ˉ
If the sample size is large, we could instead use: X ± zα/2 s
n
t-statistic
Xˉ1 − Xˉ2
T = ∼ tn1 +n2 −2,α/2
1 1
sp n1 + n2
ˉ1 − X
ˉ 2 ± tn +n −2,α/2 ∗ sp 1 1
X 1 2
+
n1 n2
t-statistic
s 21 s 22
Unpooled standard error: n1
+ n2
Xˉ1 − Xˉ2
T = ∼ tdf ,α/2
s 21 s 22
n1
+ n2
2 2
∑ ni=1 (X 1i − Xˉ1 ) ∑ ni=1 (X 2i − Xˉ2 )
where s1 = n1 −1 , s2 = n2 −1
(s 21 /n1 +s 22 /n2 )2
and df = (s 1 /n1 ) /(n1 −1)+(s 22 /n2 )2 /(n2 −1)
2 2 .
Confidence interval
ˉ1 − X
ˉ 2 ± tdf , α s21 s2
X + 2
2
n1 n2
2 2
∑ ni=1 (X 1i − Xˉ1 ) ∑ ni=1 (X 2i − Xˉ2 )
where s1 = n1 −1
, s2 = n2 −1
(s 21 /n1 +s 22 /n2 )2
and df = (s 1 /n1 ) /(n1 −1)+(s 22 /n2 )2 /(n2 −1)
2 2 .