0% found this document useful (0 votes)
143 views14 pages

Assignment5 - Fall 2024

Uploaded by

joy kawino
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
143 views14 pages

Assignment5 - Fall 2024

Uploaded by

joy kawino
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

STAT 151 Assignment 5

Due date: refer to the Course Outline

Purposes
This assignment has two parts. The first part assesses your knowledge of the properties of
the distribution of the difference between two sample means, X 1 −X 2, and your ability to
conduct a two-sample t-test, a paired t-test, and a one-proportion z test (optional). The first
part also assesses your understanding of the chi-square goodness-of-fit test and the chi-
square independence test. The second part assesses your ability to use R commander to
conduct a two-sample t-test, a paired t-test, a chi-square goodness-of-fit test and a chi-
square independence test.

Instructions
For every assignment in this course, you are required to complete the questions or tasks in
Part A by hand. This means that to do any calculation or drawing, you will NOT use R
commander or any computer application. That is, you are meant to do the calculations
manually with a non-programmable scientific calculator and use a pen or pencil to draw
figures or build a distribution table on paper (or on an iPad/tablet). Then you will submit a
photo of your written solution using the appropriate submission box on the corresponding
Crowd mark submission page.

Before you complete Part B using R commander, you should read and practice the R
commander steps by following the related examples in the Demos and the Lab Manual,
which you can download via a link in the Course Content folder on mêskanâs.

Note: For all questions in this assignment that require you to use a two-sample t-test,
always use the non-pooled procedure (that is, the procedure that assumes population
standard deviations are not equal). Further, when using the non-pooled procedure, use the
following formula to calculate the degrees of freedom.

( )
2 2 2
s 1 s2
+
n 1 n2
df = .

( )( ) ( )( )
2 2 2 2
1 s1 1 s2
+
n1−1 n 1 n2−1 n 2
Note: if your t-table does not include the correct df value, please round down to the next
available df value.

In both Part A and Part B, show all your calculations and include units. For each hypothesis
test, provide a full write-up that includes your hypotheses, level of significance, statement
and discussion of the assumptions, test statistic, p-value, decision, and conclusion. For each
confidence interval, provide a full write-up that includes assumptions, the interval (lower
endpoint, upper endpoint) with units, interpretation of the interval, and, when indicated, a
statement explaining why the interval does or does not supports significance of an
alternative hypothesis.

1
Part A
1. Suppose X 1 is normally distributed with a mean of 4 and a standard deviation of 1 and
suppose X 2 is normally distributed with a mean of 5 and a standard deviation of 2.

(a) For independent samples of sizes 9 and 16, respectively, find the mean and
standard deviation of X 1 −X 2. (3 marks)

(b) Would your answers in (a) change if X 1 and/or X 2 were not normally distributed?
Explain your answer. (2 marks)
(c) Explain why X 1 −X 2 is normally distributed, even though both n1 and n2 are small.
(2 marks)
(d) Suppose two independent samples are randomly obtained from populations 1 and
2, with sample sizes n1=9 and n2 =16. Compute the probability that either sample
average is at least 2 greater than the other. (6 marks)

2. 60 households in the Highlands neighbourhood of Edmonton were randomly sampled,


30 with garages and 30 without. Here are the summary results of assessed values of the
houses .

Garage Sample size Sample mean Sample standard deviation

No 30 $227666.70 $164696.80

Yes 30 $437450.00 $118107.50

(a) Test if the mean assessed value is lower among the houses without garages.
Perform a full 6 step hypothesis testing procedure, including stating and discussing
assumptions. Use the 5% significance level. Let μ1 be the mean assessed value of
the houses without garages and μ2 be the mean assessed value of the houses with
garages. (8 marks)
(b) Obtain and interpret a two-sided 95% confidence interval for the mean difference
between assessed values of the two populations (no garage – yes garage) of
households. Note necessary assumptions. Is there evidence at the 5% significance
level that there is a difference in the mean assessed values of the “no garage” and
the “yes garage” group. Why or why not? (8 marks)

3. The dataset GRADESINTROSTATS is a file that contains information from a random


sample of 31 students who took Statistics at MacEwan University in a recent term. Your
variables of interest are LCMT (lecture midterm grade) and LCFE (lecture final exam
grade). The data are given below; your instructor has calculated a column of differences
for you.

2
LCMT LCFE LCMT-LCFE
81.58 69.74 11.84
68.42 68.42 0
71.05 65.79 5.26
60.53 60.53 0
65.79 68.42 -2.63
55.26 61.84 -6.58
63.16 69.74 -6.58
84.21 50.00 34.21
89.47 88.16 1.31
81.58 73.68 7.9
92.11 78.95 13.16
86.84 84.21 2.63
97.37 81.58 15.79
68.42 63.16 5.26
68.42 68.42 0
52.63 65.79 -13.16
63.16 60.53 2.63
97.37 98.68 -1.31
68.42 60.53 7.89
78.95 84.21 -5.26
73.68 64.47 9.21
86.84 80.26 6.58
52.63 56.58 -3.95
50.00 71.05 -21.05
78.95 76.32 2.63
44.74 68.42 -23.68
73.68 76.32 -2.64
71.05 65.79 5.26
71.05 57.89 13.16
81.58 67.11 14.47
68.42 59.21 9.21

Note: the sample mean x d =¿2.630968 and the sample standard deviation sd =11.05528.
The student may verify this with R or Excel or an online calculator, if desired, but it is not
necessary.

(a) Why are these data best regarded as a paired sample? (2 marks)
(b) Do the data provide evidence that on average, the two deliverables of lecture
midterm (LCMT) and lecture final exam (LCFE) give different results? Test at the
10% level of significance. Include all steps of your hypothesis test, and make sure
to justify your assumptions. (8 marks)
(c) Construct and interpret a 95% confidence interval for the mean difference in the
results of the two measurement methods. Include assumption discussion. Indicate
if we have significant evidence that to support the hypothesis that the mean
difference in the two grades differs from 0, and state why. (8 marks)

3
(d) The following plots display the histogram, normal probability plot and boxplot of
the sample distribution of paired differences (LCMT - LCFE). Do the plots provide
any clear indication against the assumption of a normally distributed population
of paired differences? Explain. (3 marks)

4. According to an online source, in 2021, 84.3% of students in Alberta elementary schools


are right-handed, 9.5% are left-handed, and 6.2% are ambidextrous (source: Census at
School Canada, 2020-2021). A researcher wishes to test if these percentages seem
reasonable among the general population of Albertans, so she obtains a random sample
of 500 Albertans, from which she obtains the following frequency distribution:

Handedness Right-Handed Left-Handed Ambidextrous


Frequency 405 55 40

Does the above give evidence to suggest the proportions of right-handed, left-handed,
and/or ambidextrous Albertans differ from the proportions stated in the online source?
Include all steps of your hypothesis test, and make sure to justify your assumptions.
Test at the 1% significance level. (8 marks)

5. The data set HEARTFAILUREPREDICTION looks at several variables that play a role in
heart failure prediction for 918 people in the United States. (For the curious, this set of
open data can be found at https://www.kaggle.com/datasets/fedesoriano/heart-failure-
prediction .) Columns found in the dataset include Sex: sex of the patient [M: Male, F:
Female] and RestingECG: resting electrocardiogram results [Normal: Normal, ST:
having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of
> 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes'
criteria]. We wish to investigate the relationship between gender and resting ECG. The
following summarizes the joint frequency distribution.

Resting ECG

Gender LVH Normal ST Total

F 47 118 28 193

M 141 434 150 725

4
Total 188 552 178 918

(a) At the 10% significance level, does the data provide sufficient evidence to conclude
that an association exists between gender and resting ECG? Include all steps of your
hypothesis test, and make sure to justify your assumptions. (8 marks)
(b) What do you conclude about how gender and resting ECG are associated by examining
the cell contributions to the test statistic? (2 marks)

Part B
Finish the following questions using R and R commander. Make sure that you copy and
paste the computer outputs into the space below each question and write down your
answers in statements. Graphs should include titles and axis labels, as appropriate.

1.
(a) The dataset GRADESINTROSTATS is a file that contains information from a
random sample of 31 students who took Statistics at MacEwan University in a
recent term. Your variables of interest are A2 (grade on Assignment 2) and A1
(grade on Assignment 1). Use the proper analytical and graphical tools in R to test
whether there is significant evidence that, on average, the grade a student attains
on A2 is less than the grade a student attains of A1. Use the paired difference A2 –
A1 to solve this problem. Use the 1% significance level. Include all steps of your
hypothesis test, and make sure to justify your assumptions. (8 marks)

s1: simple random sample


s2: Assume normal distribution
s3: since sample size is greater
than 30 according to CLT its
normal.

Difference = A2-A1 = -11.98742

hypothesis t-test and df p-value = p(t>-3.0165) at 1% significance


p-value = 0.002586 level there is
Ho: μd =0 t=-3.0165 significant evidence
Ha: μd ¿ 0 that on average the
significance level: df = 30 p-value < a grade a student
a=0.01 0.002586 < 0.01 obtains on A2 is less
we reject Ho because than the grade
p-value is less than obtained in A1.
alpha.

5
(b) The dataset GRADESINTROSTATS is a file that contains information from a
random sample of 31 students who took Statistics at MacEwan University in a
recent term. Your variables of interest are LBFE (lab final exam grade) and LCFE
(lecture final exam grade). Use the proper analytical and graphical tools in R to
test whether there is significant evidence that, on average, the grade a student
attains on the lab final exam is greater than the grade a student attains on the
lecture final exam. Use the paired difference LBFE – LCFE to solve this problem.
Use the 1% significance level. Include all steps of your hypothesis test, and make
sure to justify your assumptions. (8 marks)

s1: simple random sample difference LBFE – LCFE = 11.02903


s2: Assume normal distribution
s3: since sample size is greater than
30 according to CLT its normal.

hypothesis T-test and df p-value = p(t>4.2918) at 1% significance level


there is sufficient
Ho: μd ¿ 0 t= 4.2918 p-value = 0.00008515 evidence that the
Ha: μd > 0 average grade a student
significance level: df = 30 since 0.00008515 obtains from lab final is
a=0.01 is less than 4.2918 we greater than the grade
reject Ho obtained in lecture
final.

(c) The dataset GRADESINTROSTATS is a file that contains information from a


random sample of 31 students who took Statistics at MacEwan University in a
recent term. Your variables of interest are LCMT (lecture midterm exam grade)
and LCFE (lecture final exam grade). Use the proper analytical and graphical tools
in R to test whether there is significant evidence that, on average, the grade a
student attains on the lecture final exam differs from the grade a student attains
on the lecture midterm. Use the paired difference LCMT – LCFE to solve this
problem. Use the 10% significance level. Include all steps of your hypothesis test,
and make sure to justify your assumptions. (8 marks)

s1: simple random sample difference LCMT - LCFE =2.630968


s2: Assume normal distribution
s3: since sample size is greater than
30 according to CLT its normal.

6
hypothesis T-test p-value= 2P(t>1.325) at 10% significance level
t= 1.325 there isnt sufficient evidence
Ho: μd ¿ 0 p-value = 0.1952 that the grade a student
Ha: μd ≠ 0 df= 30 attains on lecture final exam
significance since 0.1952 is differs from grade attained
level: greater than 0.10 on the lecture midterm.
a=0.10 then the p-value is
greater than the alpha
so we dont reject Ho.

(d) Construct and interpret a 90% confidence interval for the mean difference in the
grades of LCMT-LCFE using the data found in GRADESINTROSTATS. Discuss if
your assumptions hold. Make a decision about whether you have a significant
result and explain why. Your decision should match the answer to Part c. (8
marks)

s1: simple random sample


s2: Assume normal distribution
s3: since sample size is greater than 30
according to CLT it's normal.

We are 90% confident that the true mean difference of LCMT -LCFE falls between -
0.7390912 and 6.0010267 and since 0 falls inside the interval, we dont have significant
evidence that the true mean difference of LCMT -LCFE differs from 0. Since the test statistic
is 1.325 and p-values is 0.1952 we dont reject Ho.

Question 2 pertains to two-mean inference based on independent samples. In practice,


there are two possible versions of this test: in one case, the “pooled” approach is used (if
the sample variances are “close” in value, we assume equal population variances); in the
other case the “unpooled” approach is used (if sample variances are not “close” in value,
we do not assume equal population variances). Some instructors teach the pooled
approach as the assumption of equal variances is necessary in other tests that look at
comparing the means of several populations and, when the test is valid, the pooled
approach is slightly more powerful. However, other instructors choose to not teach the

7
pooled approach since the increase in statistical power is generally quite small, and it
requires an additional assumption that should only be made that if we are very familiar
with the background population. Your lecture instructor will let you know what approach
you should use in the lecture. In the lab, we will solve all two independent sample
problems using the non-pooled approach (i.e. without the assumption of equal variances).
R allows for either approach; the default is “unpooled”.)

Important Note: For two mean problems, R specifies the differences in


means in alphabetical order.

2.

(a) The data set HEARTFAILUREPREDICTION contains randomly collected data that
looks at several variables that play a role in heart failure prediction for 918 people
in the United States. (For the curious, this set of open data can be found at
https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction .) The
variables of interest to us in this question are “cholesterol” (the serum cholesterol
of study participants, in mm/dl)) and “sex” (the reported gender of study
participants). You may assume the males and females are independent samples.
Use the proper analytical tools in R to determine if there is significant evidence
that the average serum cholesterol of females differs from that of males. Use the
1% significance level. Include all steps of your hypothesis test, and make sure to
justify your assumptions. (8 marks)

assumptions: 1. simple random sample 2. assume normal distribution

s1: Ho: μd ¿ 0 vs Ha: μd ≠ 0


s2: a= 0.01
s3: test statistic = t =7.237 ,
df =388.82
s4: pvalue = 2.459e-12
s5: reject Ho since p-values is
less than a.
s6: at 1% significance level
there is significant evidence
thatavergae serum cholesterol
of females differs from that of
males.

(b) Use the proper analytical tools in R to obtain a 99% confidence interval for the
difference in average serum cholesterol of female and male study participants. Are
the assumptions met for this confidence interval? (3 marks)

8
at 99% confidence interval for the
difference in average serum
cholesterol of female and male
study participants is (34.48351,
72.88407).

interpret: we are 99% confident


that the difference in average
serum cholesterol of female and
male study participants is
between 34.48351 and
72.884407.

Yes, assume simple random


sample since sample size is large
so no need to assumed it’s
normally distributed.

(c) Explain and interpret the confidence interval obtained in part (b). Does the
interval provide evidence to indicate that the average serum cholesterol differs
between females and males? Does the interval support the conclusion of the
hypothesis test in part (a)? Justify your answer. (5 marks)

We are 99% confident that the difference in average serum cholesterol of female
and male study participants is between 34.48351 and 72.884407. because the
above 99% confidence interval does not contain zero (0), there is no evidence
that the average serum cholesterol differs between females and males.
hypothesis test rejects Ho therefore this interval does not support the
conclusion of the hypothesis test in part (a).

(d) (Question is similar to lab quiz question). The dataset SUMMERSTUDENTS


contains information from a random sample of 44 students who took Statistics at
MacEwan in the summer term. Your columns (variable) of interest are
MUSICSTUDY (a column that records whether students listen to music while
studying for an exam (no, yes)) and YAGE (a column that records student age). For
education purposes assume that the two groups are two independent samples
from a much larger normal population of statistics students and that the two
populations have unequal variances. Use R to determine if, for a significance level
of 1%, there is significant evidence that the mean age of students in the yes group
is lower than the mean age of students in the no group.

Choose (indicate) the most correct (closest) answer. HINT. Be careful here. Recall
that when doing a 2 independent samples t problem, R will calculate the
numerator of the test statistic by subtracting the “Yes” sample mean from the “No”

9
sample mean. NOTE: (You will do a full write-up here on the assignment, but this
would not be necessary on a lab quiz) (8 marks)

assumptions: 1. simple random sample independent. 2. large sample size. YES


S1: Ho: µ = 0 vs µ<0
S2: a=0.01
S3: t=1.7883, df= 39.173
S4: pvalue = 0.04073
S5: dont reject Ho since p-value > a
S6: at the 1% significance level, there isnt sufficient evidence that the yes
group is lower than the mean age of students in the no group.

Answers:
(i) Your test statistic is 1.7883, your pvalue is 0.04073, and you reject your null
hypothesis.
(ii) Your test statistic is 1.7883, your pvalue is 0.04073, and you fail to reject your
null hypothesis.
(iii) Your test statistic is 1.7883, your pvalue is 0.08146, and you reject your null
hypothesis.
(iv) Your test statistic is 1.7883, your pvalue is 0.08146, and you fail to reject your
null hypothesis.

(e) (Question is similar to lab quiz question). The dataset SUMMERSTUDENTS


contains information from a random sample of 44 students who took Statistics at
MacEwan in the summer term. Your columns (variable) of interest are YPOFF (a
column that records student willingness to serve in political office (no, yes)) and
YWKRPNEWS (a column that records weekly hours of student election news
consumption). For education purposes assume that the two groups are two
independent samples from a much larger normal population of statistics students
and that the two populations have unequal variances. Use R to determine if there
is significant evidence that the average weekly news hours of political
consumption are less for a student in the no group than for a student in the yes
group. Use a level of significance of 10%.

Choose the most correct (closest) answer. (8 marks)

10
assumptions: 1. simple random sample independent. 2. large sample size. YES
S1: Ho: µ = 0 vs µ<0
S2: a=0.10
S3: t=-3.6411, df= 12.446
S4: pvalue = 0.001597
S5: reject Ho since p-value < a
S6: at the 1% significance level, there is sufficient evidence that the weekly
news hours of political consumption are less for a student in the no group
than for a student in the yes group

NOTE: (You will do a full write-up here on the assignment, but this would not be
necessary on a lab quiz)

Answers:
(i) Your test statistic is -3.6411, your p-value is 0.001597 and you reject your null
hypothesis.
(ii) Your test statistic is -3.6411, your p-value is 0.003197 and you reject your null
hypothesis.
(iii) Your p-value is -3.6411, your p-value is 0.001597, and you fail to reject your null
hypothesis.
(iv) Your test statistic is -3.6411, your p-value is 0.003197 and you fail to reject your
null hypothesis.

3. The dataset MARCH2021PROPERTYASSESSMENTEDMONTON contains information


from a sample of 3651 Edmonton Insight Community households in 2021. The variable
of interest is Own_Rent…Study..Profiling… (whether a respondent owns or rents their
home). 3578 participants answered the question, and we will consider only those 3578
participants in this question. In 2021, according to
https://www150.statcan.gc.ca/n1/daily-quotidien/220921/mc-b001-eng.htm , the
percent of Albertans that owned their own home was 70.9%, and the percent who
rented was 29.1%.

(a) Use R to create a frequency table to find the counts of owners and renters in the
Edmonton dataset (out of the 3578 who answered this question). (2 marks).

11
(b) Use R to conduct a goodness-of-fit test to test if the distribution of owners/renters
among Edmonton households in 2021 differs from the overall Alberta percentages
of owners/renters provided. Use the 1% significance level. Include all steps in
your write-up, making sure to discuss the assumptions (you will have to calculate
the expected frequencies to check them by hand). (8 marks)

Assumptions:
1. simple random sample? YES
2. all expected frequencies are at least 1? YES
3. at most 20% of the expected frequencies are less than 5? YES

S1: Ho: Pown=0.709 , Prent =0.291 vs Ha: at least one p differs from specified value

S2: a= 0.01

S3: x 2 =2429.6 df= 1

S4: p-value = p( x 2< 2429.6) = 2.2e-16

S5: reject Ho since p-value < a

S6: at the 1% significance level, there is evidence that at least one p differs from
specified value.

(c) In part (b), you performed statistical inference about owning/renting in


Edmonton households using a sample of 3578 participants in an Edmonton
Insight Community study. For the test to be valid, we needed to assume that this
was a random sample of Edmontonians. Do you think this assumption is valid?
Briefly explain why, or why not. You will find the website of the study
https://data.edmonton.ca/Surveys/March-2021-Mixed-Topic-Property-
Assessment-Custome/cbwx-icpx and the website of the Edmonton Insight
Community at
https://www.edmonton.ca/programs_services/public_engagement/edmonton-
insight-community useful in answering this question (3 marks)
12
The assumption is invalid because the sample is not randomly selected from the
population. The population is self-selected which leaves a bias in the sample
affecting the results.

4.

(a) Table 3 in the 2016 research paper “Footedness Is Associated with Self-reported
Sporting Performance and Motor Abilities in the General Population”, by Ulrich S.
Tran and Martin Voracek summarizes a study of handedness and footedness in
12720 people.
https://www.frontiersin.org/articles/10.3389/fpsyg.2016.01199/full . The
datafile TRANVORACEKHANDFOOT contains the variables of interest HAND (with
three possible outcomes RH, MH, LH for handedness) and FOOT (with three
possible outcomes RF, MF, LF for footedness). Do the data provide sufficient
evidence of an association between handedness and footedness? Test at the 1%
significance level. Include all six components of a hypothesis test, including a
discussion of the assumptions. (8 marks)

assumptions: 1. Simple random sample:YES 2. Large count of sample size


S1: Ho: the two variables arent associated vs. Ha: the two variables are
associated.
S2: Significance level: a=0.01
S3: Test statistic value: 4219.20 with df=4 (from Routput)
S4: Pvalue= <2.2e-16 (from Routput)
S5: Reject Ho if pvalue < a.
S6: At the 1% significance level, there is evidence that the two variables are
associated.

13
(b) You will have obtained a very large test statistic in part (a). What do you conclude
about how handedness and footedness are associated by examining the cell
contributions to the test statistic? (3 marks)

The test statistic is large therefore the handedness and footedness are strongly
associated meaning that those who are righthanded are more likely to be right
footed and those left handed are likely to be left footed.

Submission
Submit your work by accessing the Crowdmark email (or Crowdmark link on mêskanâs) to
submit Assignment 5. Please ensure that each picture properly oriented and easy to read
(not fuzzy, not too small, and not taken in a dark room so that it is difficult to read).

All work must be submitted to Crowdmark by 6:00 PM on the due date.

Avoiding Plagiarism: If you submit an assignment, you are claiming it is your work. Do not
allow any part of your work to be copied by anyone else. Where two or more assignments
are found to be unreasonably similar, either in whole or in part, and no assistance has been
acknowledged, all parties involved are liable to a score of zero on the assignment. MacEwan
University’s academic policies are available at:
https://www.macewan.ca/contribute/groups/public/documents/policy/academic_integri
ty.pdf

14

You might also like