Assignment5 - Fall 2024
Assignment5 - Fall 2024
Purposes
This assignment has two parts. The first part assesses your knowledge of the properties of
the distribution of the difference between two sample means, X 1 −X 2, and your ability to
conduct a two-sample t-test, a paired t-test, and a one-proportion z test (optional). The first
part also assesses your understanding of the chi-square goodness-of-fit test and the chi-
square independence test. The second part assesses your ability to use R commander to
conduct a two-sample t-test, a paired t-test, a chi-square goodness-of-fit test and a chi-
square independence test.
Instructions
For every assignment in this course, you are required to complete the questions or tasks in
Part A by hand. This means that to do any calculation or drawing, you will NOT use R
commander or any computer application. That is, you are meant to do the calculations
manually with a non-programmable scientific calculator and use a pen or pencil to draw
figures or build a distribution table on paper (or on an iPad/tablet). Then you will submit a
photo of your written solution using the appropriate submission box on the corresponding
Crowd mark submission page.
Before you complete Part B using R commander, you should read and practice the R
commander steps by following the related examples in the Demos and the Lab Manual,
which you can download via a link in the Course Content folder on mêskanâs.
Note: For all questions in this assignment that require you to use a two-sample t-test,
always use the non-pooled procedure (that is, the procedure that assumes population
standard deviations are not equal). Further, when using the non-pooled procedure, use the
following formula to calculate the degrees of freedom.
( )
2 2 2
s 1 s2
+
n 1 n2
df = .
( )( ) ( )( )
2 2 2 2
1 s1 1 s2
+
n1−1 n 1 n2−1 n 2
Note: if your t-table does not include the correct df value, please round down to the next
available df value.
In both Part A and Part B, show all your calculations and include units. For each hypothesis
test, provide a full write-up that includes your hypotheses, level of significance, statement
and discussion of the assumptions, test statistic, p-value, decision, and conclusion. For each
confidence interval, provide a full write-up that includes assumptions, the interval (lower
endpoint, upper endpoint) with units, interpretation of the interval, and, when indicated, a
statement explaining why the interval does or does not supports significance of an
alternative hypothesis.
1
Part A
1. Suppose X 1 is normally distributed with a mean of 4 and a standard deviation of 1 and
suppose X 2 is normally distributed with a mean of 5 and a standard deviation of 2.
(a) For independent samples of sizes 9 and 16, respectively, find the mean and
standard deviation of X 1 −X 2. (3 marks)
(b) Would your answers in (a) change if X 1 and/or X 2 were not normally distributed?
Explain your answer. (2 marks)
(c) Explain why X 1 −X 2 is normally distributed, even though both n1 and n2 are small.
(2 marks)
(d) Suppose two independent samples are randomly obtained from populations 1 and
2, with sample sizes n1=9 and n2 =16. Compute the probability that either sample
average is at least 2 greater than the other. (6 marks)
No 30 $227666.70 $164696.80
(a) Test if the mean assessed value is lower among the houses without garages.
Perform a full 6 step hypothesis testing procedure, including stating and discussing
assumptions. Use the 5% significance level. Let μ1 be the mean assessed value of
the houses without garages and μ2 be the mean assessed value of the houses with
garages. (8 marks)
(b) Obtain and interpret a two-sided 95% confidence interval for the mean difference
between assessed values of the two populations (no garage – yes garage) of
households. Note necessary assumptions. Is there evidence at the 5% significance
level that there is a difference in the mean assessed values of the “no garage” and
the “yes garage” group. Why or why not? (8 marks)
2
LCMT LCFE LCMT-LCFE
81.58 69.74 11.84
68.42 68.42 0
71.05 65.79 5.26
60.53 60.53 0
65.79 68.42 -2.63
55.26 61.84 -6.58
63.16 69.74 -6.58
84.21 50.00 34.21
89.47 88.16 1.31
81.58 73.68 7.9
92.11 78.95 13.16
86.84 84.21 2.63
97.37 81.58 15.79
68.42 63.16 5.26
68.42 68.42 0
52.63 65.79 -13.16
63.16 60.53 2.63
97.37 98.68 -1.31
68.42 60.53 7.89
78.95 84.21 -5.26
73.68 64.47 9.21
86.84 80.26 6.58
52.63 56.58 -3.95
50.00 71.05 -21.05
78.95 76.32 2.63
44.74 68.42 -23.68
73.68 76.32 -2.64
71.05 65.79 5.26
71.05 57.89 13.16
81.58 67.11 14.47
68.42 59.21 9.21
Note: the sample mean x d =¿2.630968 and the sample standard deviation sd =11.05528.
The student may verify this with R or Excel or an online calculator, if desired, but it is not
necessary.
(a) Why are these data best regarded as a paired sample? (2 marks)
(b) Do the data provide evidence that on average, the two deliverables of lecture
midterm (LCMT) and lecture final exam (LCFE) give different results? Test at the
10% level of significance. Include all steps of your hypothesis test, and make sure
to justify your assumptions. (8 marks)
(c) Construct and interpret a 95% confidence interval for the mean difference in the
results of the two measurement methods. Include assumption discussion. Indicate
if we have significant evidence that to support the hypothesis that the mean
difference in the two grades differs from 0, and state why. (8 marks)
3
(d) The following plots display the histogram, normal probability plot and boxplot of
the sample distribution of paired differences (LCMT - LCFE). Do the plots provide
any clear indication against the assumption of a normally distributed population
of paired differences? Explain. (3 marks)
Does the above give evidence to suggest the proportions of right-handed, left-handed,
and/or ambidextrous Albertans differ from the proportions stated in the online source?
Include all steps of your hypothesis test, and make sure to justify your assumptions.
Test at the 1% significance level. (8 marks)
5. The data set HEARTFAILUREPREDICTION looks at several variables that play a role in
heart failure prediction for 918 people in the United States. (For the curious, this set of
open data can be found at https://www.kaggle.com/datasets/fedesoriano/heart-failure-
prediction .) Columns found in the dataset include Sex: sex of the patient [M: Male, F:
Female] and RestingECG: resting electrocardiogram results [Normal: Normal, ST:
having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of
> 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes'
criteria]. We wish to investigate the relationship between gender and resting ECG. The
following summarizes the joint frequency distribution.
Resting ECG
F 47 118 28 193
4
Total 188 552 178 918
(a) At the 10% significance level, does the data provide sufficient evidence to conclude
that an association exists between gender and resting ECG? Include all steps of your
hypothesis test, and make sure to justify your assumptions. (8 marks)
(b) What do you conclude about how gender and resting ECG are associated by examining
the cell contributions to the test statistic? (2 marks)
Part B
Finish the following questions using R and R commander. Make sure that you copy and
paste the computer outputs into the space below each question and write down your
answers in statements. Graphs should include titles and axis labels, as appropriate.
1.
(a) The dataset GRADESINTROSTATS is a file that contains information from a
random sample of 31 students who took Statistics at MacEwan University in a
recent term. Your variables of interest are A2 (grade on Assignment 2) and A1
(grade on Assignment 1). Use the proper analytical and graphical tools in R to test
whether there is significant evidence that, on average, the grade a student attains
on A2 is less than the grade a student attains of A1. Use the paired difference A2 –
A1 to solve this problem. Use the 1% significance level. Include all steps of your
hypothesis test, and make sure to justify your assumptions. (8 marks)
5
(b) The dataset GRADESINTROSTATS is a file that contains information from a
random sample of 31 students who took Statistics at MacEwan University in a
recent term. Your variables of interest are LBFE (lab final exam grade) and LCFE
(lecture final exam grade). Use the proper analytical and graphical tools in R to
test whether there is significant evidence that, on average, the grade a student
attains on the lab final exam is greater than the grade a student attains on the
lecture final exam. Use the paired difference LBFE – LCFE to solve this problem.
Use the 1% significance level. Include all steps of your hypothesis test, and make
sure to justify your assumptions. (8 marks)
6
hypothesis T-test p-value= 2P(t>1.325) at 10% significance level
t= 1.325 there isnt sufficient evidence
Ho: μd ¿ 0 p-value = 0.1952 that the grade a student
Ha: μd ≠ 0 df= 30 attains on lecture final exam
significance since 0.1952 is differs from grade attained
level: greater than 0.10 on the lecture midterm.
a=0.10 then the p-value is
greater than the alpha
so we dont reject Ho.
(d) Construct and interpret a 90% confidence interval for the mean difference in the
grades of LCMT-LCFE using the data found in GRADESINTROSTATS. Discuss if
your assumptions hold. Make a decision about whether you have a significant
result and explain why. Your decision should match the answer to Part c. (8
marks)
We are 90% confident that the true mean difference of LCMT -LCFE falls between -
0.7390912 and 6.0010267 and since 0 falls inside the interval, we dont have significant
evidence that the true mean difference of LCMT -LCFE differs from 0. Since the test statistic
is 1.325 and p-values is 0.1952 we dont reject Ho.
7
pooled approach since the increase in statistical power is generally quite small, and it
requires an additional assumption that should only be made that if we are very familiar
with the background population. Your lecture instructor will let you know what approach
you should use in the lecture. In the lab, we will solve all two independent sample
problems using the non-pooled approach (i.e. without the assumption of equal variances).
R allows for either approach; the default is “unpooled”.)
2.
(a) The data set HEARTFAILUREPREDICTION contains randomly collected data that
looks at several variables that play a role in heart failure prediction for 918 people
in the United States. (For the curious, this set of open data can be found at
https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction .) The
variables of interest to us in this question are “cholesterol” (the serum cholesterol
of study participants, in mm/dl)) and “sex” (the reported gender of study
participants). You may assume the males and females are independent samples.
Use the proper analytical tools in R to determine if there is significant evidence
that the average serum cholesterol of females differs from that of males. Use the
1% significance level. Include all steps of your hypothesis test, and make sure to
justify your assumptions. (8 marks)
(b) Use the proper analytical tools in R to obtain a 99% confidence interval for the
difference in average serum cholesterol of female and male study participants. Are
the assumptions met for this confidence interval? (3 marks)
8
at 99% confidence interval for the
difference in average serum
cholesterol of female and male
study participants is (34.48351,
72.88407).
(c) Explain and interpret the confidence interval obtained in part (b). Does the
interval provide evidence to indicate that the average serum cholesterol differs
between females and males? Does the interval support the conclusion of the
hypothesis test in part (a)? Justify your answer. (5 marks)
We are 99% confident that the difference in average serum cholesterol of female
and male study participants is between 34.48351 and 72.884407. because the
above 99% confidence interval does not contain zero (0), there is no evidence
that the average serum cholesterol differs between females and males.
hypothesis test rejects Ho therefore this interval does not support the
conclusion of the hypothesis test in part (a).
Choose (indicate) the most correct (closest) answer. HINT. Be careful here. Recall
that when doing a 2 independent samples t problem, R will calculate the
numerator of the test statistic by subtracting the “Yes” sample mean from the “No”
9
sample mean. NOTE: (You will do a full write-up here on the assignment, but this
would not be necessary on a lab quiz) (8 marks)
Answers:
(i) Your test statistic is 1.7883, your pvalue is 0.04073, and you reject your null
hypothesis.
(ii) Your test statistic is 1.7883, your pvalue is 0.04073, and you fail to reject your
null hypothesis.
(iii) Your test statistic is 1.7883, your pvalue is 0.08146, and you reject your null
hypothesis.
(iv) Your test statistic is 1.7883, your pvalue is 0.08146, and you fail to reject your
null hypothesis.
10
assumptions: 1. simple random sample independent. 2. large sample size. YES
S1: Ho: µ = 0 vs µ<0
S2: a=0.10
S3: t=-3.6411, df= 12.446
S4: pvalue = 0.001597
S5: reject Ho since p-value < a
S6: at the 1% significance level, there is sufficient evidence that the weekly
news hours of political consumption are less for a student in the no group
than for a student in the yes group
NOTE: (You will do a full write-up here on the assignment, but this would not be
necessary on a lab quiz)
Answers:
(i) Your test statistic is -3.6411, your p-value is 0.001597 and you reject your null
hypothesis.
(ii) Your test statistic is -3.6411, your p-value is 0.003197 and you reject your null
hypothesis.
(iii) Your p-value is -3.6411, your p-value is 0.001597, and you fail to reject your null
hypothesis.
(iv) Your test statistic is -3.6411, your p-value is 0.003197 and you fail to reject your
null hypothesis.
(a) Use R to create a frequency table to find the counts of owners and renters in the
Edmonton dataset (out of the 3578 who answered this question). (2 marks).
11
(b) Use R to conduct a goodness-of-fit test to test if the distribution of owners/renters
among Edmonton households in 2021 differs from the overall Alberta percentages
of owners/renters provided. Use the 1% significance level. Include all steps in
your write-up, making sure to discuss the assumptions (you will have to calculate
the expected frequencies to check them by hand). (8 marks)
Assumptions:
1. simple random sample? YES
2. all expected frequencies are at least 1? YES
3. at most 20% of the expected frequencies are less than 5? YES
S1: Ho: Pown=0.709 , Prent =0.291 vs Ha: at least one p differs from specified value
S2: a= 0.01
S6: at the 1% significance level, there is evidence that at least one p differs from
specified value.
4.
(a) Table 3 in the 2016 research paper “Footedness Is Associated with Self-reported
Sporting Performance and Motor Abilities in the General Population”, by Ulrich S.
Tran and Martin Voracek summarizes a study of handedness and footedness in
12720 people.
https://www.frontiersin.org/articles/10.3389/fpsyg.2016.01199/full . The
datafile TRANVORACEKHANDFOOT contains the variables of interest HAND (with
three possible outcomes RH, MH, LH for handedness) and FOOT (with three
possible outcomes RF, MF, LF for footedness). Do the data provide sufficient
evidence of an association between handedness and footedness? Test at the 1%
significance level. Include all six components of a hypothesis test, including a
discussion of the assumptions. (8 marks)
13
(b) You will have obtained a very large test statistic in part (a). What do you conclude
about how handedness and footedness are associated by examining the cell
contributions to the test statistic? (3 marks)
The test statistic is large therefore the handedness and footedness are strongly
associated meaning that those who are righthanded are more likely to be right
footed and those left handed are likely to be left footed.
Submission
Submit your work by accessing the Crowdmark email (or Crowdmark link on mêskanâs) to
submit Assignment 5. Please ensure that each picture properly oriented and easy to read
(not fuzzy, not too small, and not taken in a dark room so that it is difficult to read).
Avoiding Plagiarism: If you submit an assignment, you are claiming it is your work. Do not
allow any part of your work to be copied by anyone else. Where two or more assignments
are found to be unreasonably similar, either in whole or in part, and no assistance has been
acknowledged, all parties involved are liable to a score of zero on the assignment. MacEwan
University’s academic policies are available at:
https://www.macewan.ca/contribute/groups/public/documents/policy/academic_integri
ty.pdf
14