Business Statistics Midterm Exam: Fall 2019: BUS41000
Business Statistics Midterm Exam: Fall 2019: BUS41000
Useful formulas
where s2a denotes the sample variance of group a and naq the number of observations in group a.
• The standard error for a proportion is defined by: sp̂ = p̂(1− n
p̂)
where p̂1 and p̂2 denote two independent proportions, and n1 and n2 are the number of trials.
• Bayes’s formula:
P (A and B) P (B|A)P (A)
P (A|B) = =
P (B) P (B)
where A, B are two events.
• For Z ∼ N (0, 1), P (−1 ≤ Z ≤ 1) = 68%, P (−2 ≤ Z ≤ 2) = 95%, P (−3 ≤ Z ≤ 3) = 99%.
• Similarly, X ∼ N (µ, σ 2 ), P (µ − 2σ ≤ X ≤ µ + 2σ) = 95%.
• Standardization to standard normal: assume X ∼ N (µ, σ 2 ), Z ∼ N (0, 1), then
a−µ b−µ
P (a ≤ X ≤ b) = P ( ≤Z≤ ).
σ σ
1
Grading Sheet for TA:
Problem Score
P1
P2
P3
P4
P5
P6
P7
P8
P9
Total
2
Problem 1: B-17 flying fortress and Wald. [15 points]
During a period in World War II, the U.S. Army Air Forces (AAF) would send over 300 B-17 bombers daily
to raid factories in Germany. These missions, originating in the U.K., were dangerous. In the peak of the
campaign, the return probability for a B-17 crew was only 80%.
In trying to reduce the probability of a failed mission, a Navy statistician, Abraham Wald, was put in charge
of studying the damage patterns in the B-17’s that successfully made back from a mission. His ultimate goal
was to decide where to add extra armor in the planes (you could not just add heavy armor everywhere, as
the planes would be too heavy to fly!). Wald was able to learn that if a plane made back from a mission,
there was a 67% probability that it was shot in the fuselage, 15% in the fuel systems, 10% in the cockpit area
and 8% in the engines.
From experiments, Wald was also able to deduce that during combat, a B-17 would be shot in the fuselage
with 56% probability, in the fuel systems with 14%, in the cockpit area 14% and engine 16%.
1. Based on this information, what was Wald’s recommendation to the AAF, i.e., if they had to choose one
area of the plane, where should they add extra armor to the B-17’s? (Hint: Wald suggested to improve
on the weakest area: the area with the smallest returning probability given it is shot.) [10 points]
3
Problem 2: Choosing an agent. [10 points]
You are considering to purchase a house. On a rating site, you have collected data on the two potential
real estate agents in Chicago. For each rating, there are only two categories, YES (recommend) or NO (not
recommend).
Is the Agent SMALL better? Justify your answer using either hypothesis testing or confidence interval (with
95% confidence guarantee). [10 points]
1644 192
= 0.75, = 0.8
1644 + 548 192 + 48
Difference is
0.75 − 0.8 = −0.05
Standard deviation is r
0.75(1 − 0.75) 0.8(1 − 0.8)
s= + = 0.0274
1644 + 548 192 + 48
Confidence interval at 95% level
4
Problem 3: Which insurance to purchase? [10 points]
The next step is to choose a house insurance policy. Suppose there are three options available: standard
policy, premium policy, and no policy (not insured). If you decide on a policy, you will have to buy it for the
whole year.
Policy Cost per month Deductible if you file claim for house damage
Standard $50 $5000
Premium $55 $500
Suppose in one year, there is a 1% chance of house damage, and you estimate that the damage will cost you
$200,000.
1. For one year, which one of the three options you would like to choose in expectation? Which option
has the smallest amount of variability? [5 points]
If the house is damaged, you pay deductible if you have insurance, otherwise 200,000.
Standard
50 ∗ 12 + 5000 ∗ 0.01 = 650
Premium
55 ∗ 12 + 500 ∗ 0.01 = 665
No insurance
200000 ∗ 0.01 = 2000
Variance
Standard
0.99 ∗ (600 − 650)2 + 0.01 ∗ (5600 − 650)2 = 247500
Premium
0.99 ∗ (660 − 665)2 + 0.01 ∗ (1160 − 665)2 = 2475
No insurance
0.99 ∗ (0 − 2000)2 + 0.01 ∗ (200000 − 2000)2 = 3.96 ∗ 108
2. Now suppose you want to stick to a policy for two years. The insurance company is currently running
a promotion: if you do not file a claim in the first year, your monthly cost will be zero; otherwise,
your monthly fee will stay the same. Suppose the probability of house damage is 1% each year and is
independent. Now, which policy you prefer in expectation? [5 points]
Four possible cases: no damage happens, one damage happens in the first year, one damage happens in
the second year and damages happen in both years, with probability 0.9801, 0.0099, 0.0099 and 0.0001
respectively.
Standard
0.9801 ∗ 50 ∗ 12 + 0.0099 ∗ (50 ∗ 12 + 5000) + 0.0099 ∗ (50 ∗ 24 + 5000) + 0.0001 ∗ (50 ∗ 24 + 5000 ∗ 2) = 706
Premium
0.9801 ∗ 55 ∗ 12 + 0.0099 ∗ (55 ∗ 12 + 500) + 0.0099 ∗ (55 ∗ 24 + 500) + 0.0001 ∗ (55 ∗ 24 + 500 ∗ 2) = 676.6
No insurance
0.01 ∗ 200000 ∗ 2 = 4000
5
Problem 4: Portfolio. [10 points]
I am building a portfolio composed of SP500 and Bonds. Assume that SP 500 ∼ N (11, 192 ) and Bonds ∼
N (4, 62 ). Here we measure the annual return in percentage (i.e., the Bond has an expected annual return of
4%, with a standard deviation of 6%).
1. Consider the 50-50 split between SP500 and Bonds, assume the standard deviation of this 50-50 portfolio
is
sd(0.5SP 500 + 0.5Bonds) = 11.000
Can you figure out the covariance between SP500 and Bonds, as well as the correlation? [3 points]
Because
3. Suppose that you decide to invest $50,000 in a 50-50 split portfolio based on SP500 and Bonds, at the
beginning of 2020. By the end of 2020, you would need to pay for the property tax, which follows a
normal distribution with a mean $9,750, and a standard deviation $2,398. What is the probability that
the return of your portfolio would be enough to cover your 2020’s property tax? [5 points] My return
will be X = 50000 ∗ (r/100) = 500r where r ∼ N (7.5, 112 ). So X ∼ N (3750, 55002 ). The tax follows
T ∼ N (9750, 23982 ). Suppose X and T are independent,
X − T ∼ N (−6000, 55002 + 23982 )
6000
P (X − T > 0) = P (Z > √ ) ≈ P (Z > 1) = 0.16
55002 + 23982
6
Problem 5: Confidence interval and hypothesis testing. [15 points]
The following table summarizes the annual returns on the SP500 from 1900 until the end of 2015, in total of
116 years (in percentage terms):
1. Based on these results, what is the probability of the SP500 returning less than 20% next year? In
addition, give a 95% prediction interval for next year’s SP500 return. [3 points]
0.2 − 0.072
P (X < 0.2) = P Z< = P (Z < 0.98) ≈ 0.84
0.13
The predictive interval is
[7.2 − 2 ∗ 13, 7.2 + 2 ∗ 13] = [−18.8, 33.2]
2. Use a 99% confidence interval, to test the hypothesis that the expected return (true mean) of the SP500 is
equal to 4% a year. [2 points]
p
The standard deviation is 132 /116 = 1.21, confidence interval is 7.2 ± 3 × 1.21 = [3.58, 10.82]. We see that
4% is inside the interval so cannot reject the null hypothesis.
3. In addition, suppose the 95% confidence interval (constructed based on our dataset) for the population
mean of SP500 return µ is [4.7, 9.6]. Which one below best describes the statistical meaning? [5 points]
• (a) P (µ lies in [4.7, 9.6]) = 95%, in other words, the probability that true mean of SP500 µ lies in the
interval [4.7, 9.6] is 95%.
• (b) If we recollect datasets and build confidence intervals many times, 95% of the times, these intervals
will cover the true µ. Correct
• (c) We are 95% sure that the true mean is in the interval [4.7, 9.6].
4. We want to test the null hypothesis H0 : µ = 11 vs. H1 : µ 6= 11. We calculate the t-statistics, which is
t = −3.48. Which one below describes the statistical meaning? [5 points]
• (a) We reject the null hypothesis with 95% confidence. Here the 95% confidence means that when the
null is wrong, the probability of correctly rejecting the null is 95%.
• (b) We reject the null hypothesis with 95% confidence. Here the 95% confidence means that when the
null is correct, the probability of wrongfully rejecting the null is 5%. Correct
• (c) Both (a) and (b).
7
A 95% confidence level means that there is a 5% chance that your test results are the result of a type 1 error
(false positive).
8
Problem 6: Regression. [15 points]
7 10
5 5
3 0
Y1
Y2
1
−5
−1
−10
−3
−5 −15
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
X X
5
1
Y3
−1
−3
−5
−3 −2 −1 0 1 2 3
X
In the above scatterplots, three different variables Y 1, Y 2, Y 3 are regressed onto the same X (in all three
scatterplot we have the exact same n = 200 values for X). The line is the least square regression line. In this
question, we can think of residual standard error (s) for each regression as the uncertainty of the error term,
Y = b0 + b1 X + , ∼ N (0, s2 ).
Carefully examine the plots and answer the questions below:
1. Which of the following is the least square estimates of the slope (b1 ) and intercept (b0 ) for the regression
of Y 3 on X? [3 points]
• (a) b1 = 0.34, b0 = −0.24 Correct
• (b) b1 = 0.78, b0 = 0.02
• (c) b1 = 2.53, b0 = −0.05
2. Which of the following is the least square estimates of the slope (b1 ) and residual standard error (s),
for regression Y 2 on X? [2 points]
• (a) b1 = −0.9, s = 6.3
• (b) b1 = −2.0, s = 3.2 Correct
• (c) b1 = −4.3, s = 3.1
9
3. Which of the following is the correlation (R) and residual standard error (s), for regression Y 1 on X?
[3 points]
• (a) R = 0.988, s = 0.5 Correct
• (b) R = 0.707, s = 0.97
• (c) R = 0.261, s = 0.1
5. Using all the information provided so far, give a rough approximation for the 99% prediction interval
for Y 1 given X = 0. [2 points]
Y1 | X ∼ N (1, 0.5)
[1 ± 3 × 0.5] = [−0.5, 2.5]
10
Problem 7: Boston housing data. [10 points]
In this question, we take a look at a dataset that contains information collected by the U.S Census Service
concerning housing in the area of Boston Massachusetts. In total, there are 506 areas in the dataset.
variables
medv median value of owner-occupied homes in $1000s
rm average number of rooms per dwelling
dis weighted mean of distances to five Boston employment centers
lstat lower status of the population (percent)
We run three simple linear regression, aiming to figure out the quality of rm, dist, lstat in explaining medv.
50 50
40 40
30
medv
medv
30
20
20
10
10
0
50
40
30
medv
20
10
0
0 10 20 30
lstat
1. Based on the least square regression plot, which one of the variables is the worst in terms of explaining
medv? [2 points] dis
11
2. For the simple linear regression using dis to predict medv (the top-right plot in sub-problem 1), let us
look at the residual
residi = medvi − f ittedi
against f ittedi for each data point i.
30
20
10
.resid
−10
20 24 28 32
.fitted
Suppose we run a new linear regression, based on the plot above, using resid and f itted
resid = b0 + b1 · f itted.
12
3. We want to understand the weak positive correlation between dis and medv, namely, why the further
the distance is to the employment center, the more expensive the house is. To simplify the discussion,
we look at the following subset of the data according to the variable log(crim), which quantifies the
per capita crime rate. In the following plots, ‘40 denotes data with a low crime rate, and ‘+0 indicates
data with a high crime rate.
50
7.5
40
medv
dis
5.0 30
20
2.5
40 40
medv
medv
30
30
20
20
ABCD
13
Problem 8: Envelope game. [10 points]
At the end of BUS41000 class, Professor Liang decides to reward Tom for his hard work, and how much
reward he can get depends on his probability skills. Professor Liang places two checks (one check is $30, the
other is $70) into two envelopes. Note Tom has no idea about the value of the checks.
1. First, Tom decides to pick one envelope randomly. How much is his reward, in expectation? [2 points]
30 × 0.5 + 70 × 0.5 = 50
2. Suppose that the rule is changed slightly: Tom is allowed to choose one envelope, open it, and review
the value of the check. Then he can decide whether to stick with the opened envelope or to swap to the
other one. Tom recalls that one could use a randomized strategy to win more money. Here is Tom’s
new strategy: he draws a random number X using R/Excel from a normal distribution X ∼ N (50, 102 ),
then compares this X with the value of the check he just opened. He will only keep the check if its
value is larger than X. Otherwise, he will swap to the other envelope. Using this strategy, how much is
Tom’s reward, in expectation? How much more money he is going to get compared to the sub-problem
1? [8 points]
So the expectation is
70 × 0.5 × 0.975 + 30 × 0.5 × 0.025 × 2 = 69
and 69 − 50 = 19
14
Problem 9: Your intuition about correlation. [5 points]
1. Consider three stocks: A, B and C. Suppose Corr(A, B) = 0, Corr(A, C) = 0, what is the possible
range of Corr(B, C)? [1 points]
• (a) [-1, 1] Correct
• (b) 0
• (c) None of the above
2. Suppose Corr(A, B) = 1, Corr(A, C) = −1, what is the possible range of Corr(B, C)? [1 points]
• (a) -1 Correct
• (b) 1
• (c) None of the above
3. Suppose Corr(A, B) = 0.5, Corr(A, C) = 0.5, what is the possible range of Corr(B, C)? [3 points]
• (a) [0.5, 1]
• (b) [-0.5, 1] Correct
• (c) [-1, 0.5]
• (d) None of the above
15
Extra Page for Calculations
16