Data8 Fa23 Final
Data8 Fa23 Final
Fall 2023
8 Foundations of Data Science Final Exam
INSTRUCTIONS
You have 2 hours and 50 minutes to complete the exam.
• The exam is closed book, closed notes, closed computer, closed calculator, except the provided reference sheet.
• Mark your answers on the exam itself in the spaces provided. We will not grade answers written on scratch
paper or outside the designated answer spaces.
• If you need to use the restroom, bring your phone and exam to the front of the room.
For questions with circular bubbles, you should select exactly one choice.
# You must choose either this option
# Or this one, but not both!
For questions with square checkboxes, you may select multiple choices.
2 You could select this choice.
2 You could select this one too!
Preliminaries
You can complete and submit these questions before the exam starts.
The exam is worth 140 points.
The sections are as follows:
True or False - 30 points
Community - 12 points
Merch - 40 points
Spotify - 30 points
Bears - 28 points
There is also a Just For Fun section, worth 0 points, and a Last Words section, where you can state any
assumptions you made on the exam, also worth 0 points.
(a) What is your full name?
(c) Who is your lab GSI? You may write Unknown if you don’t know their name.
(d) Sign here to confirm that all work on this exam is your own (or type your name if online).
3
(b) (2.0 pt) When building a classifier, ensuring that you have a large and diverse training set is a good way
to mitigate overfitting.
# True
# False
(c) (2.0 pt) According to the Central Limit Theorem, if a sample is large, and drawn at random from the
population with replacement, then the probability distribution of the sample mean is roughly normal.
# True
# False
(d) (2.0 pt) If we use linear regression to predict y-values based on our x-values, where both x and y are
standardized, the estimate of the intercept could be negative.
# True intercept = 0 ???
# False
(e) (2.0 pt) If you are sampling a numerical attribute that can only take on values of 0 or 1, the SD of your
sample could have a value of 0.5.
# True
# False
(f ) (2.0 pt) If we use linear regression to predict y-values based on our x-values, the average of our residuals
will always be zero, regardless of whether x and y are standardized.
# True
# False
(g) (2.0 pt) If you use k-nearest neighbors on a data set that has only 2 possible categories for class (e.g. 0
or 1) and a k of 4, there is guaranteed to be a unique class that has the majority among the k nearest
neighbors in the training set.
# True
# False
(h) (2.0 pt) The total variation distance can only be applied to categorical distributions in which there are 3
or more unique categories.
# True
# False
4
(i) (2.0 pt) The recommended way to estimate a classifier’s accuracy on the population is to evaluate its
accuracy on the training set.
# True
# False
(j) (2.0 pt) For any distribution, the percent of data that lies within 3 SDs of the average is at least 80%.
# True
# False
(k) (2.0 pt) When conducting a randomized control experiment, random assignment of treatment and control
serves as a way to simulate data from the null hypothesis.
# True
# False
(l) (2.0 pt) You can have two individuals whose distance is zero if calculated using only 1 numerical attribute,
but whose distance is greater than zero if calculated using 2 numerical attributes.
# True
# False
(m) (2.0 pt) Chebychev’s Rule allows us to model subjective beliefs about events that involve randomness.
# True
# False
(n) (2.0 pt) Modern neural networks are powerful machine learning models for classifying images because
their features are learned (as opposed to being inputted as columns from the training set).
# True
# False
(o) (2.0 pt) If a scatterplot has a correlation coefficient of 0, there is no way that all of the points lie on a
straight line.
# True
# False
5
Donald Glover, an actor from the original Community TV show, hasn’t yet confirmed whether he will
return for the movie.
Suppose we know the following conditional probabilities:
(b) • If the third act has a paintball fight, there is a 20% chance Glover will return for the movie
• If the third act has a multiverse theme, there is a 50% chance Glover will return for the movie
i. (3.0 pt) What is the chance that the third act has a paintball fight and Glover does not return for
the movie?
# 0.4 × 0.8
0.6 * 0.8
# 0.8
# 1 − (0.2 + 0.5)
# 0.6 × 0.8 + 0.4 × 0.5
# None of the above.
ii. (3.0 pt) Suppose the script has now been finalized and Glover announces that he will be returning
for the movie.
What is the probability that the third act will be a paintball fight?
# 0.2×0.6
0.2×0.6+0.5×0.8
# 0.6 × 0.2
# 0.6×0.2
0.6×0.2+0.4×0.2
# 0.6×0.2
0.6×0.2+0.4×0.5
iii. (3.0 pt) Suppose that before the script is finalized, there is a leak on social media that indicates the
chance of the third act being a paintball fight is 90%.
The script then gets finalized and Glover announces that he will be returning for the movie.
Given the new information in the leak, what is our updated uprobability that the third act will be a
paintball fight?
# 0.2×0.9
0.2×0.9+0.5×0.8
# 0.9×0.2
0.9×0.2+0.1×0.2
0.2 * 0.9 / (0.2 * 0.9 + 0.5 * 0.1)
# 0.9 × 0.2
# 0.9×0.2
0.9×0.2+0.4×0.5
(b) (2.0 pt) Suppose that Mollie uses bootstrapping to create a 95% confidence interval using a sample size
smaller than the one from part (a). Ernest states that the interval is guaranteed to be wider than $2
million.
Is Ernest’s statement true or false?
Note: Assume your answer in part (a) is correct.
# True
# False
(c) (3.0 pt) Suppose that Mollie uses the sample size from part (a) and constructs a 95% confidence interval
of [35.1, 78.9].
What is the probability that the true population mean of merchandise sales is outside of this interval?
# 2.5%
# 5%
# 10%
# 95%
# There is not enough information to answers because we don’t know the endpoints of the confidence
interval.
# None of the above. There is no chance involved in whether our confidence interval contains the true
parameter.
8
(d) (3.0 pt) Suppose that Mollie wants to create a 95% confidence interval for the population 75th percentile
of merchandise sales.
Which of the following methods could be used to create such an interval?
Select all that apply.
2 Chebychev’s Inequality
2 Bootstrapping
2 Central Limit Theorem
2 Randomized Control Experiment
2 None of the above
Mollie suspects that a movie’s Rotten Tomatoes score might have a relationship with the amount of
merchandise it sells within the first month of theatrical release.
She randomly samples movies released in 2023 from Rotten Tomatoes and collects them into a table called
movies. The first few rows are shown here:
correlation = _________
(a)
return _________
(c)
The intercept function returns the intercept of the regression line.
Note: Both functions take in arrays as input.
9
np.mean(su(x) * su(y))
np.std(y) / np.std(x)
ii. (3.0 pt) Mollie fits a regression line to predict Sales from Critics and gets a slope of -2.1.
Which of the following would she expect to happen with the regression line’s predictions?
Select all that apply.
2 The regression line will tend to overestimate Sales for movies with a below average Critics score.
2 The regression line will tend to underestimate Sales for movies with a below average Critics score.
2 The regression line will tend to overestimate Sales for movies with an above average Critics score.
2 The regression line will tend to underestimate Sales for movies with an above average Critics score.
2 None of the above.
iii. (3.0 pt) Ernest thinks the true slope of the regression line in the population is 0 and that the value
observed in the sample above is due to chance. He bootstraps the data in movies to generate a
confidence interval for the true slope.
Which of the following statements are true?
Select all that apply.
2 Every bootstrapped estimate of the slope will be negative.
2 The size of the bootstrap resamples will all be exactly 60.
2 All 60 movies in the original sample will appear in every bootstrap resample.
2 The bootstrap process is equivalent to permuting the rows of the dataset repeatedly.
2 None of the above.
iv. (3.0 pt) Ernest constructs a 90% confidence interval for the true slope and finds it to be [-4.5, -1.1].
Assuming a p-value cutoff of 5%, which of the following can Ernest conclude based on his confidence
interval?
Select all that apply.
2 The true slope in the population is 0.
2 The true slope in the population is not 0.
2 The true slope in the population is less than 0.
2 None of the above.
v. (3.0 pt) Mollie’s sister, Anna, argues that Ernest should have made a confidence interval for the
correlation coefficient instead.
Which of the following statements are true?
Select all that apply.
2 The correlation coefficient should be used instead because it is unitless.
2 The correlation coefficient should be used instead because the magnitude of the slope could be
affected by the units of the x-axis and y-axis.
2 It doesn’t matter which value is used since the slope is equal to the correlation coefficient.
2 It doesn’t matter which value is used since a slope of 0 implies the correlation coefficient is 0 as
well.
2 None of the above.
11
(f ) Rather than using the critics’ scores, Mollie thinks it’s a better idea to use the audience scores to predict
merchandise sales.
Suppose she knows the following:
• the Audience column has a mean of 70 and a standard deviation of 10
• the Sales column has a mean of 100 and a standard deviation of 50
• the correlation between the Audience and Sales columns is 0.4
i. (3.0 pt) If Mollie wants to predict Sales from Audience, what would be the intercept of her regression
line?
Please draw a box around your final answer.
slope = 0.4 * 50 / 10 = 2
intercept = 100 - 2 * 70 = -40
ii. (3.0 pt) For a movie that has an audience score of 80, what would the regression line above predict
as the merchandise sales?
# 200
# 150
# 140
# 120
# 110
# None of the above
iii. (2.0 pt) What are the units for the slope in the above regression?
# Dollars per Percent
# Millions of Dollars
# Millions of Dollars per Tomato
# Dollars per Ounce of Ketchup
# None of the above
millions dollars per Percent
12
(b) (3.0 pt) Write a Python expression that returns a table with more than 3 columns that displays the
average play duration for each unique combination of artist and song.
(c) (3.0 pt) Write a Python expression that returns the name of the artist that has the largest number of
unique songs in the table.
Identifier DisplayName
jmarsdenofficial James Marsden
margarita23 Inez De Leon
ken_the_og Ken Hyun
(e) Jeanine notices that average play durations for ’Pop’ songs are typically lower than those for ’Hip-Hop’
songs across all 15 friends.
Barbara argues that any differences observed in the sample are only due to chance.
Recall : The spotify table has columns Username, Artist, Song, Genre and Duration.
i. (3.0 pt) Which of the following is an alternative hypothesis that Jeanine could use to assess her
claims?
Select all that apply.
2 ’Pop’ song plays have a lower Duration on average than ’Hip-Hop’ song plays.
2 ’Pop’ song plays have have the same Duration distribution as ’Hip-Hop’ song plays.
2 All ’Pop’ song plays have a lower Duration than all ’Hip-Hop’ song plays.
2 ’Hip-Hop’ song plays have a higher Duration on average than ’Pop’ song plays.
2 None of the above.
ii. (3.0 pt) Which of the following test statistics could Jeanine use to assess her claims?
Select all that apply.
2 The total variation distance between the Duration distribution of ’Pop’ song plays and the Duration
distribution of ’Hip-Hop’ song plays.
2 The mean Duration among ’Hip-Hop’ song plays minus the mean Duration among ’Pop’ song
plays.
2 The mean Duration among ’Pop’ song plays minus the mean Duration among ’Hip-Hop’ song
plays.
2 The mean Duration among ’Pop’ song plays.
2 The mean Duration among ’Hip-Hop’ song plays plus the mean Duration among ’Pop’ song
plays.
2 None of the above.
iii. (3.0 pt) Jeanine chooses a test statistic such that large values favor the alternative.
She simulates the test statistic many times and stores these in an array called test_stats. Suppose
the observed value of the test statistic is 12.1.
Write a Python expression that returns the p-value for this hypothesis test.
iv. (3.0 pt) Jeanine use a p-value cutoff of 5% and finds that this corresponds to a simulated test statistic
of 10.2.
Given the information in part (iii), Which of the following can she conclude?
Select all that apply.
2 The data are consistent with the null hypothesis.
2 The data are consistent with the alternative hypothesis.
2 There is a 5% chance that the null hypothesis is true.
2 There is a 5% chance that the alternative hypothesis is true.
2 ’Pop’ song plays had a lower duration on average than ’Hip-Hop’ songs.
2 There is not enough information to make a conclusion of any kind.
16
(b) (2.0 pt) Prof. Lawrence wants to see the 50th percentile of Score for every combination of Conference
and Public.
Which of the following functions could be used to visualize this information?
Select all that apply.
2 scatter
2 pivot
2 hist
2 group
2 barh
2 None of the Above
Prof. Lawrence creates the following chart showing Score against Ranking only for schools in the dataset
that are part of the 'ACC' or 'Big Ten' conferences.
(c) i. (2.0 pt) Prof. Lawrence uses the chart above to build a k-nearest-neighbor classifier with k = 4 to
predict the conference of schools outside the training set.
Skip this question
UCLA has a Score of 1000.25 and Ranking of 15.
What would this nearest neighbor classifier predict as UCLA’s Conference?
# ’ACC’
# ’Big Ten’
# There is no majority class.
18
iv. (3.0 pt) Prof. Lawrence’s student, Atticus, thinks that the data should be standardized before building
the k-nearest neighbors classifier.
Which of the following statements are true?
# It doesn’t matter if the data is standardized, since the set of nearest neighbors will be unchanged.
# It is important to standardize, since the mangnitude of the features affects how distance is
calculated.
# None of the above.
19
(d) Prof. Strauss now wants to use k-nearest-neighbors to predict a school’s Ranking based on its Score and
Ratio (i.e., he wants to predict a numerical value instead of a category).
i. To use the k-nearest-neighbors to perform this prediction, Prof. Strauss needs to first find the k nearest
neighbors of the school with respect to Score and Ratio.
He writes a neighbors() function, which takes in the following arguments:
• train: A three-column table in which the first column is labeled Score, the second column is
labeled Ratio, and the third column is labeled Ranking. Each row of the table represents a school
in the training set.
• new_school: An array of length two containing a school’s score and ratio. For example,
array([1200, 50]) corresponds to a school with a Director’s Cup score of 1200 and a student-to-
faculty ratio of 50.
• k: The value of k to use for k-nearest-neighbors.
The function returns a table containing the k neighbors in train that are closest to new_school. It
is shown, partially completed, here:
def neighbors(train, new_school, k):
score_diffs = ________(a)_________
ratio_diffs = _________(b)_________
distances = (________(c)_________) ** 0.5
train_dist = train.with_column(‘Distance’, distances)
return ________(d)_________
A. (3.0 pt) Write a Python expression to fill in blank (a).
train.column("Score") - new_school[0]
train.column("Ratio") - new_school[1]
B. (3.0 pt) Write a Python expression to fill in blank (c).
Note: We did not have you fill out blank (b) to save you some time!
score_diffs ** 2 + ratio_diffs ** 2
train_dist.sort("Distance").take(np.arange(k))
20
ii. Now that he has a way to determine the k nearest neighbors, the last step is to create a prediction for
a new school’s US News Ranking.
To do this, Prof. Strauss will use the geometric mean of the k-nearest-neighbors’ Ranking values. The
geometric mean is calculated by multiplying all k of the Ranking values and then taking the k-th root
of the multiple.
For example, if k = 3 and the 3 nearest neighbors have Ranking values of 40, 50 and 60, then the
geometric mean is:
√
3
40 × 50 × 60 = 49.32
Prof. Strauss writes the following partially completed code to generate the k nearest neighbors:
def prod(array):
result = 1
for value in array:
result = result * value
return result
prod(neighbors_rankings) ** (1 / k)
21