BIOS 521 HW3 Solutions
BIOS 521 HW3 Solutions
22
(a) Conservatives% = 372/910 * 100 = 40.87%
(b) Favor citizenship% = 278/910 * 100 = 30.54%
(c) Conservatives & favor citizenship% = 57/910 * 100 = 6.26%
(e) Political ideology and immigration views are not independent. The proportion of those who
favor citizenship differs among different categories for political ideology based on part (d). The
data implies that there could be a negative relation between the conservativeness of ideology and
openness to illegal immigrants.
Exercise 2.32
We would expect this distribution to be left-skewed.
The scores have a mean of 85, a standard deviation (SD) of 15 and a maximum value of 100.
Recall that SD is approximately the average distance of a test score from the mean value of 85.
The maximum distance that a score can be above the mean is 15 points (i.e. a perfect 100 score).
There must be scores more than 15 points away from the mean in order to obtain a SD of 15, and
these scores must be smaller than the mean. Thus there is likely to be a longer tail of data to the
left of the mean, making the data left-skewed. Note, that this skewing effect is a fairly common
phenomenon when you have measurements with a natural upper or lower bound.
Exercise 2.34
(a) From the histogram, we can see that the distribution is bimodal, which is not apparent in the
boxplot. In contrast the boxplot does a better job of displaying the location of extreme values on
the right end of the distribution.
(b) The data contains finishing times for both male and female marathon winners. It is possible
that the two peaks correspond to times for males and females since there is likely a sex difference
in marathon times.
NOTE: When you see multimodality in a histogram, it usually means you have a mixture of
groups that have distinct shapes (peaks). It is good practice to figure out what groups the peaks
correspond to and ensure these groupings make sense and are not the result of some type of
unexpected or unintended data artifacts.
(c) Based on the separate boxplots, the marathon finishing times for men are, on average, faster
than that of women. In both groups we see larger “outlier” values on the right sides of the
boxplots.
(d) Here we can see how marathon times have changed over time. For both men and women,
marathon times dropped dramatically between 1970 and 1975 and remained fairly steady
thereafter to 2000. At any point in time, the marathon time for men is shorter than that of
women. Neither of these trends were evident in the histogram or boxplots.
Exercise 8.4 (specify the direction, strength, and shape of the association)
(a) Strong relationship, the direction of the relationship changes (non-monotonic, cubic),
non-linear
(b) Strong relationship, the direction of the relationship changes (non-monotonic, quadratic),
non-linear
(c) Strong relationship, positive association, linear
(d) Weak relationship, very slight positive association, linear
(e) Weak relationship, negative association, linear
(f) Moderate relationship, negative association, linear
Exercise 8.6
(a) The relationship between husbands’ and wives’ ages is strong, positive and linear. There are a
couple data points that do not fit the overall pattern however, for example a man in his early 40s
marrying a woman in her early-to-mid 50s.
(b) The relationship between husbands’ and wives’ heights is weak but still positive.
(c) The plot for age shows a stronger correlation because the points are more tightly clustered
together in a linear pattern.
(d) Correlation would be unaffected by rescaling of units. That is, shorter people are still more
likely to marry shorter people and taller people are still more likely to marry taller people
regardless of whether you measure height in inches or centimeters.
Exercise 8.7
Match the following correlations:
Exercise 8.9
(a) There seems to be a weak, positive relationship between the two variables.
(b) There could be several plausible reasons for such association. As indicated in (c), gender
could be a variable that is both associated with variable height and fastest speed. Men tend to
have higher heights and are prone to drive fast compared to women.
(c) Gender's role is similar to that of a confounding variable (definition stated in p.25 of
textbook) since it is correlated with both height variable and fastest speed variable.
Exercise 8.25
(a) Murder = −29.901 + 2.559 × poverty%
(b) The expected rate of murders in metropolitan areas with 0% poverty is -29.901 per million.
This is not a meaningful value. Murder rate is a proportion and cannot be less than zero, and
(unfortunately) there is not a metropolitan area with 0% poverty.
(c) For each additional percentage increase in poverty, the expected number of murders per million
increases by 2.559.
(d) The R squared value, by definition, is the proportion of variance of estimated responses with
respect to the variance of actual responses. R squared being 0.7052 implies that the linear
model accounts for 70 percent of variance of y. This implies that the linear model is a
reasonable model.
(e) In simple linear regression, R squared equals the squared value of correlation coefficient. Thus
square root of 0.7052 = 0.8398.
Computing question 1
Since both of the var1 and var2 variables are quantitative, a scatterplot would make the most
sense to display the bivariate relationship between the variable
Based on the scatterplot, the variables are not associated. Alternatively we might say they are
independent since a change in one variable does not lead to a change in the second variable.
Computing question 2
Since both variables are quantitative, we can use correlation as the descriptive statistic to
quantify their relationship: the computed value for correlation is very close to zero, consistent
with the independence we observed in the scatterplot.