0% found this document useful (0 votes)
48 views17 pages

Chapter 3 Solutions

Uploaded by

26YiJie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views17 pages

Chapter 3 Solutions

Uploaded by

26YiJie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

1. Outliers are observations that fall well above or below the overall bulk of the data.

Consider a set of 50 (univariate) data points with a single outlier. Suppose the
outlier is removed from the data set, which of the following is/are always true?
Select all that apply.
(A) The removal will cause the mean to decrease.
(B) The removal will cause the interquartile range to decrease.
(C) The removal will cause the standard deviation to decrease.
(D) The removal will cause the range to change.
(C) and (D) are true. The mean will increase if the outlier falls below the bulk of
the data. Interquartile range depends on the value of Q1 and Q3 . When an outlier is
removed, suppose we assume the outlier falls above the bulk of the data, the values
of Q1 and Q3 can either remain the same or become smaller. Depending on which
happens, and also the magnitude of the changes in Q1 and Q3 , the interquartile range
can increase, remain the same or decrease. The same argument applies if the outlier
falls below the bulk of the data. For example,
ˆ the IQR decreases when 60 (the outlier) is removed from 4, 6, 6, 7, 11, 60.
ˆ the IQR increases when 60 (the outlier) is removed from 1, 5, 6, 7, 7, 60.
ˆ the IQR remains unchanged when 60 (the outlier) is removed from 2, 2, 2, 4,
4, 60.
The standard deviation will decrease if the outlier is removed. The range is the
difference between the maximum and minimum values, so the removal will cause a
decrease in the range.
2. The GEA1000 midterm results for the year 2050 Semester 1 are shown in the boxplot
below. There were 50 students who took the test, and the test scores are out of 100.
No outliers were removed.

1
Which of the following can be derived from the boxplot? Select all that apply.
(A) There is at least one outlier.
(B) The range is 40.
(C) The interquartile range is 40.
(D) The standard deviation is 14.
Only (A) is correct. There is one outlier shown at 100. The range is the difference
between the maximum and minimum, which is 50. The interquartile range is the
difference between the 3rd and 1st quartiles, which is 14. The standard deviation
cannot be derived from the boxplot.
3. Suppose that there are 76 pairs of siblings living in a particular block in Ang Sua,
where the older sibling is always heavier than the younger sibling. Consider a scatter
plot using the younger sibling’s weight to predict the older sibling’s weight, where each
point in the scatter plot represents the weights of a pair of two siblings in the block.
Which of the following statements must be true?
(I) There is a positive association between the older and younger siblings’ weights.
(II) All the points lie above the line y = x in the scatter plot.
(A) Only (I).
(B) Only (II).
(C) Neither (I) nor (II).
(D) Both (I) and (II).
Answer is (B). Since each older sibling is heavier than each younger sibling, for each
point (x, y), we must have y > x. Thus, all the points will lie above the line y = x in
the scatter plot. With only this condition that all the points lie above the line y = x,
it is not possible to determine the direction of association between the two variables.
For example, y can be negatively associated with x as shown in the plot below.

4. Consider data sets A, B and C, each consisting of 10,000 numbers with mean 5. The
histograms for A, B and C are shown below.

2
Order the data sets according to the values of their standard deviations, from the
smallest to the largest.
(A) A, B, C.
(B) A, C, B.
(C) B, A, C.
(D) B, C, A.
Answer is (C). Note that B has a smaller standard deviation compared to A, since more
of the values are closer to the mean compared to A. C has a larger standard deviation
compared to A, since more of the values are further from the mean compared to A.
Thus the required order is B, A, C.
5. The five-number summary for a numerical variable X with 77 values is given as
57, 68, 70, 72, 77. Define Y = 10 − 2X. What is the IQR of Y ?
(A) −8.
(B) −2.
(C) 4.
(D) 8.
Answer is (D). The IQR of X is Q3 − Q1 = 72 − 68 = 4. The effect of transforming
X into Y is that the data points will be reordered in ‘reverse’ way, scaled by a factor
of 2, and finally translated by a magnitude of 10. It means that, for example, the
maximum value of X will be mapped to a minimum value for Y , Q1 for X will
become Q3 for Y , etc. Thus the five-number summary of Y = 10 − 2X will be
−144, −134, −130, −126, −104 and the IQR of Y is −126 − (−134) = 8.
6. The boxplot below shows the distribution of the marks of 30 students.

3
Which of the following statements must be true?
(I) There is only one student who scored higher than 23.5 marks.
(II) The range of the marks of the 30 students is 17.5.
(A) Only (I).
(B) Only (II).
(C) Both (I) and (II).
(D) Neither (I) nor (II).
Answer is (D). As there can be more than one student who scored 40, statement
(I) may not be true. Statement (II) is not true since the range is 40–6 = 34, not
17.5(23.5–6).
7. Professor X conducted a test for his class of 16 students, and tabulated the following
five-number summary for the test scores:
Minimum Q1 Median Q3 Maximum
41.20 45.00 50.75 54.12 58.90
Two days later, he discovered, to his horror, that he had made a mistake in the
computation of the test scores, and everyone should get 10 marks more.
The new (and correct) median score is (1) and the IQR is (2) .
Fill in the blanks for the statement above, giving your answers correct to 2 decimal
places.
Since all the scores are shifted by the same amount of 10, the new median score
is shifted by 10 as well to 50.75 + 10 = 60.75. The IQR remained unchanged at
54.12 − 45.00 = 9.12, as Q1 and Q3 are shifted by the same amount.
8. Consider the following data set, which we will refer to as set A:

{15, 23, 13, 17, 8, 42, 4, 37, 12, 16}.

A student decided to do a check for outliers, after which such value(s) was/were
removed. Let us designate the set of remaining data points as set B. Which of the
following statements is/are true? Select all that apply.
(A) The range of B is 19.
(B) The median of B is lower than the median of A.

4
(C) The median of B is greater than the mean of B.
(D) The median of B is lower than the mean of A.
Only (B) and (D)are true. Re-arranging set A in ascending order, we get

{4, 8, 12, 13, 15, 16, 17, 23, 37, 42}.

As such, Q3 is 23, Q1 is 12, and IQR is 11 (which makes 1.5 × IQR = 16.5). Only the
value 42 qualifies as an outlier, since 42 > 23 + 16.5, which gives us the set B as

{4, 8, 12, 13, 15, 16, 17, 23, 37}.

Hence the range of B is 37 − 4 = 33 and thus (A) is incorrect. The median of A is


15.5 while the median of B is 15; hence (B) is correct. The mean of A is 18.7 while
the mean of B is 16.11 (to 2 d.p.); hence (C) is incorrect and (D) is correct.
9. The following histogram is constructed using 100 observations of a discrete numerical
variable X. For the first bin, [0, 1], both 0 and 1 are included in the bin. For every
other bin, the left endpoint is excluded while the right endpoint is included.

Based on the histogram, which of the following statements is/are definitely true?
Select all that apply.
(A) The distribution is right-skewed.
(B) The maximum value is 8.
(C) The value that occurs the most often in this data set is in the bin (1, 2].
(D) Only a quarter of the observations is larger than 3.
(A) and (D) are correct. The distribution is right-skewed, hence (A) is true. (B) is
false as the maximum value could be any value between 7 and 8 (excluding 7). (C) is
false. Although the histogram shows that the range (1, 2] has the highest frequency,
it does not mean that the most frequent value in the data set must be in the range
(1, 2]. As an example, there could be 16 values at 1.3, 16 values at 1.4 and 24 values

5
at 2.5, making 2.5 the most frequent value, which is not between 1 and 2. (D) is true
as there are exactly 25 (14+8+2+1=25) values greater than 3.
10. The following two diagrams are adapted from a paper published on Nature titled
“Irregular sleep/wake patterns are associated with poorer academic performance and
delayed circadian and sleep/wake timing”, which studied a group of students. The
diagrams describe the association between the variables Grade Point Average (GPA),
Sleep Regularity Index (SRI) and Actual Dim Light Salivary Onset (Actual DLMO).
Actual DLMO was not recorded for participants that have neither regular nor irregular
sleep/wake patterns.

6
Based only on the two diagrams above, which of the following is necessarily true?
(A) If the researchers collected information about the average household income
amongst the participants and found a positive association between average house-
hold income and Grade Point Average, then they may conclude that average
household income and SRI are positively associated amongst the participants.
(B) The predicted Actual DLMO value for a student who has neither regular nor
irregular sleep/wake patterns is less than 24.
(C) A higher SRI value is associated with a lower Actual DLMO value for students
who have regular sleep/wake patterns.
(D) Given a student’s GPA, the researchers should use the equation of the regression
line of GPA against SRI to predict SRI.
Answer is (C). (A) may not be true since associations are not necessarily transitive.
Since Actual DLMO was not recorded for participants who have neither regular nor
irregular sleep/wake patterns, (B) is incorrect. (C) is correct since we can see from
the 2nd diagram that the best fit line for that group will have a negative slope. (D)
is incorrect because we would need the regression line of SRI against Grade Point
Average to predict SRI.
11. You’ve been helping a friend generate some nice-looking figures in Radiant. Unfor-
tunately, you lost track of which data sets were being used for which histograms and
boxplots. You don’t want to make them all over again (though it would be trivial if
you saved your script). Which boxplot (1-4) goes with which histogram (A-D)?

(A) 1C, 2A, 3D, 4B


(B) 1C, 2B, 3D, 4A
(C) 1D, 2A, 3C, 4B
(D) 1B, 2C, 3D, 4A
Answer is (B).
Based on the boxplot, distribution 1 clearly has the highest median, equal to about
the 3rd quartile for the other three. This corresponds pretty clearly to histogram C.
This leaves only answers (a) and (b). Histograms A and B show roughly symmetric
distributions, but with a wider IQR for B. Among the two remaining boxplots, 2 has
the wider IQR compared to 4 so that the remaining pairs must be 2B and 4A.
12. Suppose that the following are 10 data points for a numerical variable X:

16, 82, 72, 100, r, 22, 83, 62, −2, 99,

7
Where r is an unknown whole number less than 72. It is known that there is only one
outlier in this data set. An outlier is defined as a data point having a value greater
than Q3 + 1.5*IQR or less than Q1 – 1.5*IQR. What is the maximum possible value
of r?
Answer is -85.
Since r is less than 72, the largest 5 numbers in the data set can be arranged as 72,
82, 83, 99, 100. Q3 is therefore 83. Excluding r, the other 4 numbers are arranged as
-2, 16, 22 and 62. The Q1 value will change depending on the value of r. If r is less
than 16, the Q1 will be 16. If r is greater than 22, the Q1 will be 22. If r is between
16 and 22, then r will be the Q1.
Case 1: Suppose r is less than 16. IQR = 83 – 16 = 67. Values greater than 83 +
1.5*67 = 183.5 or less than 16 – 1.5*67 = -84.5 will be considered as outliers. Since
the greatest value is 100 which is not an outlier, r has to be less than -84.5 for the
data set to have an outlier. Therefore, the maximum possible value of r is -85 in this
case.
Case 2: Suppose r is greater than 22 but less than 72. IQR = 83 – 22 = 61. Values
greater than 83 + 1.5*61 = 174.5 or less than 22 – 1.5*61 = -69.5 will be considered
as outliers. However, this would mean there is no outlier in the data set, since the
minimum value is -2 and the maximum value is 100. Hence this case is not possible.
Case 3: Suppose r is between 16 and 22 (inclusive). Note that Q3 + 1.5*IQR will
not be less than the value in case 2, i.e. 174.5, since the IQR is longer or of the same
length. On the other hand, the Q1 - 1.5*IQR will not be greater than the value in
case 2, i.e. -69.5, because of the same reason. This also implies that there is no outlier
in the data set, since the minimum value is -2 and the maximum value is 100. Hence
this case is also not possible.
Therefore, r has to be less than 16, and the maximum value of r is -85.
13. 3000 students took a multiple-choice quiz in school. The quiz consisted of 10 questions.
Each student answered all 10 questions. For each question answered correctly, 1 mark
was awarded, and for each question answered wrongly, no marks were awarded. There
was no partial credit awarded. The average score was 5, and the standard deviation
of the scores was 2.
The number of correct answers and wrong answers for each student was plotted in a
scatter plot, with the number of correct answers represented on the horizontal axis
and the number of wrong answers on the vertical axis.
The correlation coefficient between the number of correct answers and number of
wrong answers is:
(A) 1.
(B) −1.
(C) 0.
(D) Unable to tell from the information provided.
Answer is (B). Sketch a scatter plot for some values of correct and wrong answers,
with the number of correct answers represented on the horizontal axis and the number
of wrong answers on the vertical axis. Since the standard deviation of the scores is 2,
not all the students will have exactly the same number of right and wrong answers, so
the scatter plot will not have a congregation of all points at one spot.

8
You can see from the scatter plot sketch that all points will lie on a straight line with
negative gradient so the value of r is −1. Alternatively, observe that the number
of correct and wrong answers follow a deterministic linear relationship given by the
equation

number of wrong answers = 10 − number of correct answers.

This is because if a question is not answered correctly, then it is answered wrongly


and vice versa.
14. Of the four values below, which would be that of a correlation coefficient with the
strongest correlation?
(A) −1.4.
(B) −0.9.
(C) 0.3.
(D) 0.7.
Answer is (B). A correlation coefficient always lies between −1 and 1 (inclusive). The
higher the magnitude of the correlation coefficient, the stronger the correlation.
15. What will happen to the correlation coefficient between X and Y if a point with
coordinates (80, 110) is added to the scatter plot shown below?

(A) It will increase.


(B) It will decrease.
(C) It will remain the same.
Answer is (A). The correlation coefficient between X and Y from the scatter plot
above is very close to −1. Adding a point with coordinates (80, 110) will decrease the
strength of the correlation. Thus, the correlation coefficient will increase.
16. A system for marking students’ R computer programs, called markeR, has been used
successfully at a university. markeR takes into account both program correctness and
program style when marking students’ assignments.
To evaluate its effectiveness, markeR was used to grade the R assignments of a class
of 40 students. These scores, which range from 10.5 to 19, were then compared to the
scores given by the instructor of the class. The results are summarised below.

9
Variable Sample mean Sample standard deviation
markeR score (x) 16.5 1.5
Instructor score (y) 14.5 2.25
The sample correlation between y and x is 0.85. A least squares regression line is used
to predict the average instructor score from the markeR score. We are given that the
regression line passes through the point (16.5, 14.5).
(Fill in the blank.) When the markeR score is 15, the predicted average instructor
score is (rounded to 2 decimal places.)
The answer is 12.59. Suppose the regression line is given by y = a + bx. Then
sy 2.25
b=r = 0.85 × = 1.275.
sx 1.5
Since the line passes through (16.5, 14.5), we have

14.5 = a + 1.275 × 16.5,

which gives a = −6.5375. When x = 15, the predicted (average) value for y is
−6.5375 + 1.275 × 15 = 12.5875. Rounded to 2 decimal places, the answer is 12.59.
17. Below is a scatter plot showing preliminary exam and final exam scores for students
in a secondary school along with the linear regression line.

The average scores for the preliminary exam and final exam were both 60, with stan-
dard deviations of 5.1 and 6.6 respectively. What does the slope of 0.98 of the linear
regression line predict?
(A) The increase in average final exam scores, corresponding to an increase of 1 mark
in the preliminary exam.
(B) The correlation between the final and preliminary exam scores.
(C) The average final exam score of students who scored 0 on the preliminary exam.
(D) None of the other options.

10
Answer is (A). The gradient in a linear regression equation gives the difference in
average Y values for two groups who differ by one unit in the X value. It is not
the correlation coefficient between the final and preliminary exam scores as visually,
the correlation coefficient should not be so close to 1. In fact, it is about 0.76. In
general, the correlation coefficient is not equal to the slope of the regression line. 0.98
is also not the average final exam score for students who scored 0 in the preliminary
exam because that is the Y-intercept of the regression line, which is theoretically the
predicted average Y value when the X value is equal to 0.
18. The scatter plot below shows the relationship between height and shoulder girth (cir-
cumference of shoulders measured over deltoid muscles).

The equation of the regression line for height vs shoulder girth is given by y = 0.6x +
106, where y refers to the height and x refers to shoulder girth. Which of the following
statements below is/are correct? Select all that apply.
(A) If we were to predict average shoulder girth from height using simple linear re-
gression, the gradient of the regression line is also positive.
(B) Using simple linear regression, when the shoulder girth is equal to 141cm, the
predicted average height is 190.6cm.
(C) Using simple linear regression, when the height of the individual is 170cm, the
predicted average shoulder girth is 106.67cm.
(D) If the shoulder girth of all individuals above are 2cm shorter, then the gradient
of the regression line for height vs shoulder girth is 0.6.
(A) and (D) are correct. Interchanging x and y does not change the correlation
coefficient, and so the gradient is also positive. As 141cm is outside the range of the
x variable, in general, we cannot use the equation above to predict average height.
Similarly, we have remarked in Chapter 3 that a regression line for using x to predict
y cannot be used to predict x using y. So we cannot use the equation above to predict
average shoulder girth from height. Recall that adding/subtracting a constant to/from
x does not change the standard deviation for x. As the gradient and correlation
coefficient are related by
sy
m=r ,
sx
we see that m does not change.

11
19. A researcher examined the relationship between variables X and Y among 250 male
and female subjects. He graphed the relationship in the scatter plot shown below. Let
r be the correlation coefficient for all 250 subjects, r1 be the correlation coefficient
among male subjects only and r2 be the correlation coefficient among female subjects
only.

Which of the following correctly describes the relationship between r, r1 and r2 ?


(A) r1 < r < r2 .
(B) r1 > r > r2 .
(C) r > r1 > r2 .
(D) r < r1 < r2 .
Answer is (A). The correlation coefficient for all subjects is closer to zero when com-
pared to either r1 or r2 . The correlation coefficient for males only is negative, while
the correlation coefficient for females only is positive.
20. A researcher is interested in the correlation between the amount of time an individ-
ual spends on social media and the individual’s level of happiness. Suppose that she
observed that the correlation coefficient r1 for males only is 0.8, and that the corre-
lation coefficient r2 for females only is also 0.8. Which of the following statements
must be true for r, the correlation coefficient when the data for males and females are
combined?
(A) 0 ≤ r ≤ 0.8.
(B) r = 0.8.
(C) 0.8 < r ≤ 1.
(D) None of the other given options is correct.
Answer is (D). It is possible that the correlation coefficient in the combined data set
is negative (see example below), so none of the other three options is correct.

12
21. Based on the scatter plot shown below, which of the following is closest to the equation
for the regression line? Here, W is the weight of the car and C is the consumption.

(A) W = 3 − 0.1C.
(B) W = 5 − 0.1C.
(C) W = 3 + 0.8C.
(D) W = 5 + 0.8C.
Answer is (B). The regression line should pass through the cloud of points in the

13
scatter diagram. So its slope should be negative. Also, from the scatter plot, its
y-intercept is more likely to be 5 than 3. Hence W = 5 − 0.1C is the correct answer.
22. Which of the following is/are true about a non-zero correlation coefficient? Select all that apply.
(A) The correlation coefficient does not change when we add 5 to all the values of
one variable.
(B) The correlation coefficient is positive when the slope of the regression line is
positive.
(C) The correlation coefficient does not change when we multiply all the values of
one variable by 2.
(D) A correlation of −0.3 is stronger than a correlation of −0.8.
(A), (B) and (C) are correct. The correlation coefficient r does not change when we
S
add or multiply all the values of one variable by a positive number. Since m = r Sxy ,
r and m will have the same sign. Only (D) is incorrect, as a correlation of −0.8 is
stronger (since it is closer to −1) than a correlation of −0.3.
23. The relationship between the number of glasses of beer consumed daily (x) and blood
alcohol content in percentage (y) was studied in young adults. The equation of the
regression line is y = −0.015 + 0.02x for 1 ≤ x ≤ 10. The legal limit to drive in
Singapore is having a blood alcohol content below 0.08%. Des, a young adult, had
just finished 5 glasses of beer. After that, he wanted to take his car out for a drive. Is
it legal for him to drive in Singapore?
(A) Yes.
(B) No.
(C) Unable to determine.
Answer is (C). The regression line only provides the predicted average blood alcohol
content for someone who drank 5 glasses of beer, which is 0.085%. Although the value
is in the illegal range, Des’ blood alcohol content may have been below average, and
not have hit 0.08%.
24. Three father-son pairs had their heights measured. The following table shows their
heights:
Pair Father (inches) Son (inches)
A 68 72
B 70 71
C 66 70
Using these three data points, the standard deviation for the fathers would be 2 and
for the sons it would be 1. From the table, what is the standard unit for the son from
pair A?
(A) −1.
(B) 0.
(C) 1.
(D) 1.88.
Answer is (C). Since the son’s average height is 71, the standard unit for the son from
pair A is
72 − 71
SU = = 1.
1

14
25. Suppose that there are 40 male students in a class and each student scored 5 less
marks for his maths test than what he scored for his science test. What can we say
about their maths and science test marks? Select all that apply.
(A) The interquartile range of science test marks is higher than that for maths test
marks.
(B) If student A scored a higher mark for the maths test than student B, then he
must have scored a higher mark than student B for the science test.
(C) The science test marks and maths test marks are perfectly negatively correlated.
(D) The standard deviation of maths test marks is equal to that of science test marks.
(B) and (D) are correct. Since quartile 1 and quartile 3 of the maths test marks
decrease by the same amount (5 marks) as compared to quartile 1 and quartile 3 of
the science test marks, there is no difference in the interquartile ranges of the maths
and science test marks.
As standard deviation does not change when we subtract or add a number to every
data point in a data set, the standard deviations of the maths and science test marks
are equal.
Suppose that we let x and y denote the maths and science test marks of the students,
respectively. Then we see that y = x + 5, that is, there is a perfect positive correlation
between the science and maths test marks. In addition, y increases as x increases.
Thus, if student A scored a higher mark for maths than student B, then he must have
scored a higher mark than student B for the science test as well.
26. The regression line for Y vs X is given by Y = 0.82X + 59.1. The standard deviations
for X and Y are 1.5 and 2.2 respectively. Suppose now we construct a regression line
that uses Y to predict X.
The predicted average increase of X when Y is increased by 1 unit is .
(Give your answer correct to 2 decimal places.)
Answer is 0.38. The correlation coefficient can be determined by
msX
r= .
sY
We find that r = 0.559. The gradient of the regression line for X vs Y is given by
rsX
m= = 0.38.
sY
−59.1)
It should be noted that if we simply rearrange Y = 0.82X + 59.1 to X = (Y 0.82 , we
1
obtain 0.82 = 1.22 as the gradient, which is different from 0.38. This shows that in
general, we cannot use the regression line for Y vs X to predict X as a function of Y .

27. A professor wants to know the percentage of right-handed students in NUS. Since he is
teaching a course in NUS this semester, he decides to do a survey in his class. From the
single survey, he concluded that eighty percent of students in NUS are right-handed.
Which one of the following fallacies was committed by the professor?
(A) Atomistic fallacy.
(B) Ecological fallacy.
(C) None of the other options.

15
Answer is (C). Atomistic fallacy occurs when a person generalizes the correlation
about individuals towards ecological correlation. Ecological fallacy occurs when a
person deduces the inferences on correlation about individuals based on ecological
correlation. In this question, the professor merely generalises the result of the class to
the entire NUS. Note that correlation and ecological correlation were not computed
by the professor. Hence, both fallacies were not committed.
28. The total number of people who are infected by a disease (denoted by y) can be
predicted using the regression model y = 2x+1 − 1, where x is the number of days from
the first infection, up till the 30th day. Based on the information above, which of the
following is true?
(A) After 3 days from the first infection, there will be exactly 15 people infected.
(B) If there were 7 people infected, it means that exactly 2 days have passed from
the first infection.
(C) After exactly 20 days, there will be approximately less than 2 million people
infected.
(D) The relationship can be modelled as a simple linear regression Y = mX + c,
where Y = y, X = 2x , m = 2, and c = −1.
Answer is (D).
(A) is incorrect as the regression line only predicts and does not guarantee 15 are
infected. For (B), like the reasoning given for (A), the y values in a regression equation
are only predicted numbers. In addition, the regression line was modelled using x to
predict y and so the equation may not be the same when we model a line using y
to predict x. (C) is incorrect as there will be approximately 2097151 people infected
after exactly 20 days. (D) is correct as y = 2x+1 − 1 = 2(2x ) − 1.
29. Bivariate numerical data can be represented in the form (x, y). Which of these 4 data
sets, after having added an additional data point (2, 8), would have the magnitude of
their correlation coefficient decrease as a result? Select all that apply.
(A) (2, 2), (8, 2), (8, 8)
(B) (2, 2), (4, 5), (6, 2)
(C) (2, 2), (5, 5), (8, 8)
(D) (2, 8), (5, 5), (8, 2)
(A) and (C) are correct.
(A) is correct because the addition of (2, 8) nullifies the existing sloping trend from
the data set as a result. (B) is incorrect because the present three points are vertically
symmetrical and display zero correlation, and the addition of (2, 8) breaks the sym-
metry and moves the magnitude of correlation away from 0 as a result. (C) is correct
because the addition of (2, 8) will cause the existing perfect correlation to decrease
as a result. (D) is incorrect because the addition of (2, 8) will not cause the existing
perfect correlation to change.
30. ”The relation between anxiety and BMI - is it all in our curves?” was published in
the journal Psychiatry Research in 2016. As stated in the abstract of that research
paper, ”The relation between anxiety and excessive weight is unclear. The aims of the
present study were three-fold: First, we examined the association between anxiety and
Body Mass Index (BMI). Second, we examined this association separately for female
and male participants...”

16
The first result reported was: No linear correlation between anxiety scores and BMI
among all the participants was observed. If the researchers had not proceeded to
investigate the association between anxiety scores and BMI separately for female and
male participants, but concluded straightaway from their first result that ”there is no
linear correlation between anxiety scores and BMI among the females and among the
males separately’, what mistake would they have committed?
(A) Ecological fallacy
(B) Atomistic fallacy
(C) Confusing correlation and causation
(D) None of the other options is correct
Answer is (D).
Atomistic fallacy occurs when a person claims that the ecological correlation (corre-
lation between two sets of averages across certain subgroups) will be the same as the
correlation obtained for individuals. Ecological fallacy occurs when a person deduces
the inferences on correlation for individuals based on ecological correlation. We can
see that ecological correlation was not mentioned nor relevant and neither of these
fallacies was committed. In addition, the researchers did not indicate causation but
only correlation in the results. Hence, none of the other options is correct.

17

You might also like