Data8 Fa21 Midterm
Data8 Fa21 Midterm
INSTRUCTIONS
This is your exam. Complete it either at exam.cs61a.org or, if that doesn’t work, by emailing course staff with your
solutions before the exam deadline.
This exam is intended for the student with email address <EMAILADDRESS>. If this is not your email address, notify
course staff immediately, as each exam is different. Do not distribute this exam PDF even after the exam ends, as
some students may be taking the exam in a different time zone.
For questions with circular bubbles, you should select exactly one choice.
# You must choose either this option
# Or this one, but not both!
For questions with square checkboxes, you may select multiple choices.
2 You could select this choice.
2 You could select this one too!
You may start your exam now. Your exam is due at <DEADLINE> Pacific Time. Go to the next page
to begin.
Exam generated for <EMAILADDRESS> 2
Preliminaries
You can complete and submit these questions before the exam starts. Note ‘. . . ’ can mean any code after the
given variable.
(a) What is your full name?
(. . . 76 more rows)
(a) (3 pt) Help Will count how many restaurants there are for each cuisine. Write a line of code that outputs
a table with two columns: one column with the type of cuisine, and one column containing a count of how
many restaurants there are with that cuisine.
Reminder: the columns of RESTS_TBL are “REST_NAME”, “CUISTYP”, “Rating”, and “Distance From
Sproul”.
(b) (3 pt) Will wants to eat at the highest-rated restaurant. Write a line of code that evaluates to the name
of the restaurant with the highest rating. (You can assume there is only one restaurant with the highest
rating; there are no ties.)
Reminder: the columns of RESTS_TBL are “REST_NAME”, “CUISTYP”, “Rating”, and “Distance From
Sproul”.
(c) (3 pt) Melissa only wants to eat at a Thai restaurant. Write a line of code that evaluates to a table
containing all four columns but only the rows for restaurants whose cuisine is “Thai”.
Reminder: the columns of RESTS_TBL are “REST_NAME”, “CUISTYP”, “Rating”, and “Distance From
Sproul”.
Exam generated for <EMAILADDRESS> 4
(d) (3 pt) Eddie didn’t want to walk to any restaurants that were further than one mile away from Sproul.
Fill in the code below to assign the variable EDDIE_CHOICE to a table containing only restaurants that are
less than one mile from Sproul.
EDDIE_CHOICE = ...
Reminder: the columns of RESTS_TBL are “REST_NAME”, “CUISTYP”, “Rating”, and “Distance From
Sproul”.
(e) (3 pt) Will decides to randomly pick a restaurant from the restaurants that are less than one mile from
Sproul. Write code to randomly pick a restaurant from the EDDIE_CHOICE table and assigns the variable
WILL_CHOICE to the name of that restaurant.
WILL_CHOICE = ...
Reminder: the columns of RESTS_TBL are “REST_NAME”, “CUISTYP”, “Rating”, and “Distance From
Sproul”.
(f ) (3 pt) Write a line of code that evaluates to the number of different cuisines that appear in the “CUISTYP”
column of the RESTS_TBL table.
Exam generated for <EMAILADDRESS> 5
(. . . 47 more rows)
(a) (3 pt) Write a line of code that evaluates to the total capacity if we reserved every room in the rooms
table.
Reminder: rooms’s columns are “Room”, “Capacity”, and “Region”. RESEVS’s columns are “STNAME”,
“Room”, “DAYCOL”, and “Time”.
(b) (3 pt) Write a line of code that evaluates to the number of reservations that TARGETPERSON has made.
Reminder: rooms’s columns are “Room”, “Capacity”, and “Region”. RESEVS’s columns are “STNAME”,
“Room”, “DAYCOL”, and “Time”.
(c) (4 pt) Write code that assigns the variable TOP_REGION to the region of campus that has the most number
of reservations. Note that the “Region” column of the rooms table shows the campus region for each room.
TOP_REGION = ...
Reminder: rooms’s columns are “Room”, “Capacity”, and “Region”. RESEVS’s columns are “STNAME”,
“Room”, “DAYCOL”, and “Time”.
Exam generated for <EMAILADDRESS> 6
(b) (4 pt) Noor brings her mug to each Data 8 lecture. Next week, Data 8 lectures will be on Monday,
Wednesday, and Friday. What is the chance that Noor brings a black mug to at least one of the three
lectures?
# 39 + 39 + 39
# 93 × 39 × 39
# 1 − 69 × 69 × 69
# 1 − 39 × 39 × 39
(c) (4 pt) One of Noor’s classes has online office hours in the morning. She will attend the office hours on
Tuesday and Thursday next week, bringing her mug with her. What is the chance that the mugs she has
on those two days are the same color?
3
# 9
3 3
# ×
9 9
3 6
# 1− ×
9 9
Exam generated for <EMAILADDRESS> 7
(b) (2 pt) For which sample size below is there a higher chance that the percent of LinkedIn users in the
sample will be at least 50%?
# 200
# 400
(c) (2 pt) For which sample size below is there a higher chance that the percent of LinkedIn users in the
sample will be at least 25% but less than 50%?
# 200
# 400
Temperature Sex
30.8 M
31.5 F
32.4 F
(. . . 97 more rows)
(a) (6 pt) Ellen decides to visualize her data before doing any inference. She creates the following histograms,
using the same bins for female and male turtles. All bars of the histograms are clearly visible.
(b) (4 pt) Ellen performs an A/B test to see whether females in the population in general have higher
incubation temperatures than the males, or if the observed difference in distributions is due to chance.
Ellen’s test statistic is the difference between average incubation temperatures, defined as “female average
minus male average”. She simulates the statistic 1000 times under the null hypothesis. The histogram
below shows the 1000 simulated differences. The red dot shows the observed difference.
(b) (3 pt) In order to decide between their two hypotheses, the data scientists have picked an appropriate
test statistic and simulated it 10,000 times under appropriate conditions. One of the graphs below is the
histogram of their simulated values. Which one is it, and why? [Note that in each graph, some relevant
values are labeled on the horizontal axis.]
#
#
#
Exam generated for <EMAILADDRESS> 11
Testing Option A
Testing Option B
Testing Option C
Exam generated for <EMAILADDRESS> 12
(c) (2 pt) Explain your choice above. One or two sentences should suffice.
(d) (3 pt) The 10,000 simulated values of the data scientists’ test statistic are in an array called SIM_STAT_ARR.
Write an expression that evaluates to the p-value of the test.
Exam generated for <EMAILADDRESS> 13
(b) (5 pt) The data scientists are using a 1% cutoff for the p-value of the test. They run the test and the
p-value comes out to be 0.5%, that is, 1 in 200.
Select all of the true statements below. Only one may be true, or more. Make sure you select all that are
true.
2 The data scientists will conclude that the data are consistent with the null hypothesis.
2 There is only a 1 in 200 chance that the null hypothesis is true.
2 There is a 199 in 200 chance that the alternative hypothesis is true.
2 The data scientists will reject the null hypothesis.
2 The assumptions made in the null hypothesis are used in the calculation of the p-value.
2 None of the above statements is true.
Exam generated for <EMAILADDRESS> 14
Year Extraversion Number of Textees Hours of Sleep Handedness First Pant Leg Sleep Position
Second 8 5 6 Right- Right Left
handed
Second 7 8 7.5 Right- Right Left
handed
(b) (5 pt) Complete the code below so that the last line evaluates to an array of 10,000 simulated values of
the maximum extraversion score in a random sample of size 25 drawn without replacement from all the
students. Your code should use the function FUN_NAME that you defined above.
repetitions = ...
SIM_VALS = ...
SIM_VALS
Exam generated for <EMAILADDRESS> 16
(c) (3 pt) A student mistypes the sample size in the previous question to be 55 instead of 25. One of the
histograms below shows the distribution of the maximum values simulated by this student. The other
shows the distribution of the maximum values that you simulated using a sample size of 25. Which is
which?
A:
B:
# A is sample of 25, B is sample of 55
# A is sample of 55, B is sample of 25
Exam generated for <EMAILADDRESS> 17
(. . . 86 more rows)
(a) (3 pt) The histogram below shows the distribution of the triple hop distances, drawn using the following
code.
jump.hist('Triple Hop', bins=np.arange(300, 900, 50))
(b) (5 pt) The scatter plot below has a point for each of the athletes. Pick all the conclusions that can be
drawn from the scatter plot. Make sure you pick all that apply.
2 More than half the athletes jumped less than 60 centimeters vertically.
2 Most of the athletes whose triple hop distances were longer than average also jumped higher than
average.
2 If athletes were to increase their triple hop distances then they would be able to jump higher.
2 If athletes were to increase the heights of their vertical jumps, they would be able to triple hop longer
distances.
2 None of the above conclusions can be drawn from the scatter plot.
Exam generated for <EMAILADDRESS> 21
No more questions.