Stat 1010 Guiding Answers
Stat 1010 Guiding Answers
Fall 2024
1 Chapter 2
Data don’t always arrive in the most convenient form. You may have the data, but they’re not
in the same place or the same format. Unless you have data organized in a common data table,
you’ll find it hard or impossible to get the answers you need.
For this exercise, you must prepare the data that are needed to explore the relationship between
the sales of a company, retailer Best Buy, and the health of the economy. Is there a relationship
between the amount of money available to be spent, called the disposable income, and the net
sales of retailer Best Buy?
The economic data come from the online repository hosted by the Federal Reserve Bank of St.
Louis. The data are collected monthly, are reported in the number of billions of dollars (at an
annual rate), and date back to 1959. The data for 2010 are as shown in the following table:
The data for Best Buy come from another source. Compustat maintains a database of company
information gleaned from reports that are required of all publicly traded firms. For a company to
have its stock bought and sold, it must report data such as these quarterly gross profits, given in
millions of dollars. The company data extend back to 2005 (only one year is shown).
1
Year Quarter Gross Profits
2010 1 $3, 036.00
2010 2 $3, 156.00
2010 3 $3, 233.00
2010 4 $4, 214.00
(a) Explain why it would it be useful to merge these two data tables. What questions do you
think would be interesting to answer based on the merged information?
Answer: It would be useful to merge the two datasets to explore the relationship between
the amount of money available to be spent, called the disposable income, and the net
sales of Best Buy, captured by its quarterly gross profits. An interesting question to be
asked is whether there is an association between these two variables in the sense that one
of them influences the other or not.
(b) Describe the difference in interpretation of a row in the two tables. Do the tables have a
common frequency?
Answer: In the first table, the rows indicate monthly disposable income (in billion
dollars) for 12 months of 2010. In the second table, the rows indicate quarterly gross
profits (in million dollars) of Best Buy for the 4 quarters of 2010. Since the first table
records monthly data and the second table records quarterly data, these two tables do
not have a common frequency.
(c) The separate data tables each have a numerical column of gross profits or disposable income.
Are the units of these comparable, or should they be expressed with common scales?
Answer: In the first table, the numerical column of monthly disposable income is ex-
pressed in billion dollars, whereas in the second table, the numerical column of quarterly
gross profits is expressed in million dollars. In order to merge the two tables it would be
better to have a common scale.
(d) What should you do if you want to arrange the data in a table that has a quarterly time fre-
quency? Can you copy the data columns directly, or do you have to perform some aggregation
or recording first?
Answer: In order to arrange the data in a table having a quarterly frequency, we have
to aggregate the monthly disposable income data into 4 quarters of 2010. For the gross
profit data, since it is already in quarterly frequency, we can directly copy the numerical
column from the second table.
Page 2
(e) Suggest improved names for the columns in the merged data table. How do you want to
represent the information about the date?
Answer: In the merged data table, we can have column names: Year, Quarter, Dispos-
able Income (in billion dollars) and Best Buy’s Gross Profit (in billion dollars).
Answer:
(g) With the data merged, what annual shopping ritual becomes apparent?
Answer: Even while disposable income does rise progressively each quarter, Best Buy’s
gross earnings climbed by more than a billion dollars in the fourth quarter compared to
the preceding quarters. Most likely, Black Friday and holiday shopping are the reasons
for this.
0.5 marks for signing the work, 0.5 marks for submitting a pdf, 0.5 marks for each question but
part (f), 1 mark for part (f).
2 Chapter 3
• 2 marks for completing a submission.
• 2 marks for using customized colours (not using the default colours).
Page 3
3 Chapter 4
The following table summarizes sales by day of the week at a convenience store operated by a
major domestic gasoline retailer. The data cover about one year, with 52 values for each day of
the week.
(a) Which consecutive two-day period produces the highest total level of sales during the year?
Answer: The sale totals for each consecutive two-day period are approximately
5592, 5949, 6267, 6186, 7299, 8006. The two-day period with the largest sale totals are
Saturday and Sunday, corresponding to the weekend – as expected.
(b) Do the distributions of the sales data grouped by day (as summarized here) overlap, or are
the seven groups relatively distinct?
Answer: The distributions overlap. When ordering the means from smallest to largest
and looking at the differences, ordering their differences from smallest to largest yields
14, 56, 81, 262, 392, 626. Meanwhile, the sorted standard deviations from smallest to
largest are 314, 393, 415, 575, 632, 712, 865. Most of the differences in means are less than
the smallest standard deviation, and are all less than the largest standard deviation. We
would therefore expect them to overlap if the empirical rule holds.
We can take a quick sanity check by plotting seven normal distributions with the means
and standard deviations of the sales data on Desmos. This confirms the overlap.
Figure 1: Overlapping distributions of normal distributions with the means and standard
deviations given by the sales data.
Page 4
(c) These data summarize sales over 52 × 7 = 364 consecutive days. With that in mind, what
important aspect of these data is hidden by this table?
Answer: This aggregated data hides trends and other seasonal patterns. The sales take
place over time, and as such the sales on one day are not necessarily independent of the
sales on another. For instance, sales by the major domestic gasoline retailer may be on
an uptrend or downtrend, and aggregating the data by day of week obscures this. There
may also be additional important seasonal variation in the data – perhaps sales pick up
during the summer and wane during the winter.
(a) Find the standard deviation and interquartile range of the sizes of the songs (in megabytes).
(b) Which of these summary statistics is most affected by the presence of the outlier? How do
you know?
Answer: The IQR is not as affected by the presence of the outlier. It is based off
quantiles, which are more robust to outliers than the standard deviation, which is based
off the mean. We can confirm this by excluding the most extreme outlier from the
data and finding the mean and standard deviation without it, inspired by the following
problem. The standard deviation without the song is 0.5777002, while the IQR without
it is 0.7606113. Only the standard deviation changes by much.
(c) Exclude the most extreme outlier from the data and find the mean and SD without this song.
Which summary changes more when the outlier is excluded: the mean or SD?
Answer: The mean including the outlier is 2.734379, while the mean without it is
2.587528. The standard deviation without the outlier is 0.5777002. The mean changes
by 0.146851 while the standard deviation changes by 0.3726481. The standard deviation
changes by more than the mean.
This question would have made more sense if the question had asked to compare the
standard deviation and IQR. We will accept answers for this as well.
Page 5
(d) Create a histogram of the data and describe its shape.
Answer:
Figure 2: Histogram of Beatles data. The size and time columns are right-skewed with
a single outlier, and are very correlated. Most Beatles songs are around 2-4 MB and
between 2-4 minutes, but Hey Jude is the exception at around 7 minutes long. The year
column is relatively uniformly distributed and bounded between 1962 and 1970, apart
from two peaks in 1964 and 1969.
From the introduction, one should have used the size of the songs in the exercise for the
histogram. Given that it was not clear and there were many questions about this in office
hours, we will also accept credit for providing a histogram of at least one column that
contains numerical data. The point breakdown should be as follows:
• 1 point for noting the outlier in either size or time, or for noting that the number
of years is bounded in an interval between 1962 to 1970.
• 1 point for noting the right-skew in the size and time histograms, or for noting the
lack of right-skew in the year histogram.
Page 6
5 Chapter 5
To gauge the reactions of possible customers, the manufacturer of a new type of cellular telephone
displayed the product at a kiosk in a busy shopping mall. The following table summarizes the
results for the customers who stopped to look at the phone:
(a) Is the reaction to the new phone associated with the sex of the customer? How strong is the
association?
Answer: In the mosaic plot on the right below, we see that the proportion of different
reactions are different for the men and the women. Thus according to the mosaic plot
of the data, the reaction seems to be associated with the sex of the customer. However,
the association does not seem to be very strong.
(b) How should the company use the information from this study when marketing its new prod-
uct?
Answer: From the study, it is observed that among the men who responded to the sur-
vey, most of the reactions were ambivalent, and among the women who responded, more
than half of the reactions were favorable. It would be better to use different marketing
strategies for men and women. For the men, the company should focus on turning the
ambivalent reactions into favorable ones as ”ambivalent male” customers have the highest
frequency. Whereas for women, the company should focus on meeting the expectations
of the ”favorable female” customers.
(c) Can you think of an underlying lurking variable that might complicate the relationship shown
here? Justify your answer.
Answer: An underlying lurking variable in this study can be the age of the customers.
It is possible that the males who responded to the survey were mostly elderly people
which may be a possible explanation for a lot of ambivalent reactions from the males.
Whereas it is possible that the females surveyed were younger people having a favorable
approach towards new technology. Another possible underlying lurking variable can be
the time of the day the survey is conducted. Stratifying the data using these lurking
variables may lead to phenomenon like Simpson’s Paradox.
Page 7
8 marks for the mosaic plot. A correct mosaic plot of the reactions of the customers versus
their gender is shown below.
6 Chapter 6
Drive Preferences These data give the percentage of new vehicles bought with four-wheel drive,
state by state in the continental United States in 2014. The data include the average temperature
in that state in January.
(a) Do you expect the correlation between these variables to be positive or negative? Explain
your choice.
Answer: We expect the correlation between the two variables to be negative. Four-wheel
drive vehicles are very helpful in especially poor winter weather, when the temperature
is low and conditions are poor. As such, as the temperature in January decreases, we
would expect the percentage of vehicles bought with four-wheel drive to increase.
(b) Draw the associated scatterplot. Does the direction of the association match what you
expected to find?
Answer:
Page 8
Figure 4: Scatterplot of percentage of new vehicles with four-wheel drive against the
average temperature in January, where each datapoint represents a state within the
continental United States. The direction of the association matches what we expect.
(c) In which states is the percentage choosing four-wheel drive the highest? Lowest?
Answer: It is lowest in region 11, where 11% of new vehicles have four-wheel drive and
the average temperature in January is 59 degrees Fahrenheit. It is highest in regions 36
and 43, where 76% of new vehicles have four-wheel drive and the average temperatures
in January are 20.8 and 7.58 degrees Fahrenheit respectively.
(d) Find the correlation between the variables. Is it weaker or stronger than you expected?
Answer: The correlation between the two variables is -0.8315649. This is stronger than
what one would expect, as the temperature in winter is only one of many factors (price,
availability, resale value, etc.) that one would consider when buying a car.
Page 9