Data8 sp22 Midterm Solution
Data8 sp22 Midterm Solution
Spring 2022
8 Foundations of Data Science Midterm Solution
INSTRUCTIONS
You have 1 hour and 50 minutes to complete the exam.
• The exam is closed book, closed notes, closed computer, closed calculator, except the provided midterm study
guide.
• Mark your answers on the exam itself in the spaces provided. We will not grade answers written on scratch
paper or outside the designated answer spaces.
• If you need to use the restroom, bring your phone and exam to the front of the room.
For questions with circular bubbles, you should select exactly one choice.
# You must choose either this option
# Or this one, but not both!
For questions with square checkboxes, you may select multiple choices.
2 You could select this choice.
2 You could select this one too!
Preliminaries
You can complete and submit these questions before the exam starts.
(a) What is your full name?
(c) Who is your lab GSI? You may write self-service if you have no lab GSI.
(d) Sign here to confirm that all work on this exam is your own (or type your name if online).
2
The players table contains a row for each of the 528 players in the 2020 NBA season. Columns are the player’s
name, 2019 salary (2019), 2020 salary (2020), 2019 team name (19team), and 2020 team name (20team). For
players who joined in 2020, their 2019 value is 0 and their 19team value is No Team. The first three rows are:
sort('arena')
column('name')
division count
Atlantic 5
Central 5
Southwest 5
teams._________.group(_________)
(a) (b)
Reminders:
• The teams table has columns name, division, conference, and arena.
• The players table has columns name, 2019, 2020, 19team, and 20team.
i. (2.0 pt) Fill in blank (a).
where('conference', 'Eastern')
group('division', sum)
5
max(players.column('2020') - players.column('2019'))
ii. (3.0 pt) The number of players in 2020 (an integer) who played for the same team in 2019 and
2020.
sum(players.column('19team') == players.column('20team'))
iii. (2.0 pt) The number of teams (an integer) that have an arena size that is above average.
iv. (4.0 pt) Select all of the quantities below that can be computed from these two tables.
The number of divisions that had at least 5 players paid more than $20,000,000 in 2020
The name of the team that paid the most player salary per seat in its arena in 2020 (Note: The
number of seats in an arena is its capacity.)
2 The number of players who retired after the 2019 season.
2 The name of the player that made the most additional salary by changing teams in 2020 compared
to the amount they would have made staying at their 2019 team
2 None of these
6
i. (2.0 pt) About what percentage of the players who had the same 19team and 20team had a salary
between $10 million and $20 million in 2020?
15%
# 20%
# 25%
# 30%
# 35%
ii. (2.0 pt) About what percentage of the players who had the same 19team and 20team had a salary
of $10 million or more in 2020?
# 10%
30%
# 50%
# 70%
# 90%
iii. (4.0 pt) About how many players played on different teams in 2019 and 2020 and made between $5
million and $10 million in 2020?
Please express your answer as a Python expression (e.g., 0.1 * 0.2 + 0.3) rather than simplifying it
to a single number.
iv. (4.0 pt) Select all of the quantities below that can be determined from only these two histograms
and the additional information that appears just above the histograms.
Reminder: The additional information was that among the 440 players who played in both 2019 and
2020, 60% played on the same team and 40% played on different teams.
2 The total number of players who played in 2019 and had a 2020 salary below $2 million
The total number of players who played in 2019 and had a 2020 salary below $20 million
Among all players who played in both 2019 and 2020, the proportion who had a salary of $20
million or more
Among all players who played in 2019 and had a 2020 salary of $20 million or more, the proportion
who played on the same team in 2019 and 2020
2 None of these.
v. (2.0 pt) How would you use these histograms to determine whether the 2020 salary distribution was
different for players with a different team than for players with the same team?
Compare the two histograms visually and look for differences.
# Use the two histograms to perform an A/B test.
# Use the histograms to compute the average salary for both groups and compare those averages.
# Use the histograms to compute the total salary for both groups and compare those totals.
vi. (2.0 pt) The $30-$40 million bin is slightly taller for players with a different team (right histogram)
than for players with the same team (left histogram). What can we conclude from this difference?
# Players who switch teams are paid more.
# Players who switch teams are more likely to end up with a salary of $30-$40 million.
Within that bin, the density among players with a different team is higher than the density among
players with the same team.
# Within that bin, the number of players with a different team is higher than the number of players
with the same team.
8
ii. (1.0 pt) The association between game length and number of tasks completed.
# Bar Chart
# Histogram
# Line Plot
Scatter Plot
iii. (1.0 pt) The average game length for each outcome.
Bar Chart
# Histogram
# Line Plot
# Scatter Plot
9
ii. (4.0 pt) The result of which of the following expressions contains in one of its cells the total number
of tasks completed in all games for which Matty was a Crewmate and lost? Select all that apply.
2 games.pivot('completed', 'team', 'outcome', collect=sum)
games.pivot('team', 'outcome', 'completed', collect=sum)
2 games.group('team').group('outcome').group('completed', collect=sum)
games.group(['team', 'outcome'], collect=sum)
2 None of these
10
ii. (2.0 pt) When two pets are chosen at random with replacement, the probability that they are both
dogs.
(9 / 20) ** 2
# (10 / 20) * (1 / 20)
# (10 / 20) + (1 / 20)
# 1 - (9 / 20) ** 2
# 1 - (10 / 20) * (1 / 20)
# 1 - (10 / 20) + (1 / 20)
iii. (2.0 pt) When two pets are chosen at random with replacement, the probability that the first is a
cat and the second is not.
# 10 / 20 + 10 / 20
(10 / 20) * (10 / 20)
# (10 / 20) * (9 / 20) * (1 / 20)
# 1 - (10 / 20) * (10 / 20)
# 1 - (10 / 20 + 10 / 20)
# 1 - (10 / 20) * (9 / 20) * (1 / 20)
iv. (2.0 pt) When two pets are chosen at random with replacement, the probability that the first chases
the second. Assume dogs only chase cats, cats only chase birds, and birds don’t chase.
(10 / 20) * (10 / 20)
# (19 / 20) * (10 / 20)
(10 / 20) * (1 / 20) + (9 / 20) * (10 / 20)
# 1 - ((9 / 20) * (1 / 20) + (10 / 20) * (9 / 20))
# 1 - ((10 / 20) ** 2 + (9 / 20) ** 2 + (1 / 20) ** 2)
# 1 - ((10 / 20) ** 2 + (9 / 20) ** 2 + (1 / 20))
11
iii. (2.0 pt) Which Python expression evaluates to the probability that B is not 0 and not 1, but instead
a proportion between 0 and 1?
# 0
# 1
# 0.5 ** 40
# 1 - (0.5 ** 40)
# 0.5 ** 40 + 0.5 ** 40
1 - (0.5 ** 40 + 0.5 ** 40)
12
iii. (3.0 pt) Fill in blank (c). You may include one or more commas.
np.arange(1, 7), 5
1 2 3 4 5 6 X
0.0 0.17 0.33 0.27 0.20 0.02 0.01
1 2 3 4 5 6 X
0.0 0.09 0.25 0.32 0.28 0.03 0.03
ii. (2.0 pt) Complete the alternative hypothesis: The distribution of guess counts for UC Berkeley
students is . . .
# the same as the distribution of guess counts for all Wordle players.
different from the distribution of guess counts for all Wordle players.
# the same as the uniform distribution.
# different from the uniform distribution.
14
iii. (2.0 pt) Which test statistic is best for choosing between the null and alternative hypotheses?
# total guess count
# most common guess count
# guess count
total variation distance
# observed average
iv. (2.0 pt) Which line of code simulates a distribution of proportions for 1000 Berkeley students under
the null hypothesis?
# sample_proportions(1000, berkeley)
sample_proportions(1000, everyone)
# sample_proportions(1000, make_array('1', '2', '3', '4', '5', '6', 'X'))
# sample_proportions(1000, make_array(1/7, 1/7, 1/7, 1/7, 1/7, 1/7, 1/7))
v. (2.0 pt) How does increasing the number of times a distribution is simulated under the null hypothesis
affect the outcome of the hypothesis test?
# The probability that the null hypothesis is false will increase.
# The probability that the null hypothesis is true will increase.
# The observed distribution of guess counts for Berkeley students will be more similar to the
distribution for all players.
# The observed test statistic for Berkeley students will be more similar to the test statistic for all
players.
The empirical distribution of the test statistic under the null hypothesis will be more similar to its
theoretical distribution.
vi. (2.0 pt) If the null hypothesis is rejected because the p-value of this hypothesis test is very small,
what can we conclude? Select all that apply.
2 Attending Berkeley improves most people’s Wordle performance.
2 Attending Berkeley changes most people’s Wordle performance.
2 Attending Berkeley does not improve most people’s Wordle performance.
2 Attending Berkeley does not change most people’s Wordle performance.
None of these.
15
sum or np.count_nonzero
sim
obs
sim
16
(c) Define reading more as spending an extra two hours a day reading The New York Times, and a good game
of Wordle as one in which the player guesses the word in 3 or fewer tries. We want to test if reading more
leads to a higher proportion of good games.
Among the 1000 Berkeley students who played Wordle yesterday, 500 were selected at random (without
replacement) one month ago and asked to read more. All 1000 played yesterday’s Wordle, and the number
of guesses each student took was recorded.
i. (2.0 pt) How would a permutation test be used to investigate whether reading more leads to a higher
proportion of good games?
Repeatedly, all 1000 students would be partitioned at random without replacement into two groups
of 500, and the proportion of good games in those two groups would be compared for simulating a
null distribution.
# Repeatedly, all 1000 students would be partitioned at random without replacement into two groups
of 500, and within each group the proportion of good games for students who read more would be
compared to that of the students who didn’t.
# Repeatedly, the proportion of good games for students who read more would be compared with
the proportion of good games of a random permutation of those who didn’t.
# Repeatedly, the proportion of good games for students who read more would be compared with
the proportion of good games of a random permutation of all 1000 students.
ii. (2.0 pt) Suppose we consider the following alternative hypothesis: Among the 1000 students, the
proportion of good games would be higher if they all read more than if none of them read more.
Complete this null hypothesis: Among the 1000 students, the proportion of good games . . .
# would be lower for students who read more than for those who didn’t.
# for the 500 students who were selected to read more is the same as for the other 500 students.
# for students who read more would be 50%.
would be the same whether they all read more or none of them read more.
iii. (2.0 pt) Which of the following test statistics is best for choosing between the null and alternative
hypotheses above?
The difference between the proportion of good games in each group.
# The absolute difference between the proportion of good games in each group.
# The difference between the proportion of good games in the “read more” group and 0.5.
# The difference between the proportion of good games in the “didn’t read more” group and 0.5.
iv. (2.0 pt) When we conduct this permutation test, we compute a p-value of 0.002. Assume we had
chosen a p-value cut-off of 0.05. Which of the following can we conclude about the 1000 Berkeley
students based on this result? Select all that apply.
Reading more increases the proportion of good games.
There is an association between reading more and the proportion of good games.
2 Being a Berkeley student is a confounding factor for the association between reading more and the
proportion of good games.
2 None of these