0% found this document useful (0 votes)
352 views14 pages

MATH 1281 Written Assignment Unit 6

- The document compares two samples and performs linear regression analysis using data from a transfusion dataset. - Boxplots and t-tests show that the distributions and means of the frequency and log-transformed frequency variables differ between the two samples. - Linear regression finds a significant increasing linear relationship between the log-transformed frequency and the time variable. The confidence interval for the slope does not include zero. - A scatterplot with regression line shows the points for frequency against the monetary variable exhibit a linear trend but are not all on the same line.

Uploaded by

Daniel Gay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
352 views14 pages

MATH 1281 Written Assignment Unit 6

- The document compares two samples and performs linear regression analysis using data from a transfusion dataset. - Boxplots and t-tests show that the distributions and means of the frequency and log-transformed frequency variables differ between the two samples. - Linear regression finds a significant increasing linear relationship between the log-transformed frequency and the time variable. The confidence interval for the slope does not include zero. - A scatterplot with regression line shows the points for frequency against the monetary variable exhibit a linear trend but are not all on the same line.

Uploaded by

Daniel Gay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Running Head: COMPARING TWO SAMPLES AND LINEAR REGRESSION

Comparing two samples and Linear Regression.

Written Assignment Unit 6

University of the People

MATH 1281

March 7, 2021
COMPARING TWO SAMPLES AND LINEAR REGRESSION 2

Comparing two samples and Linear Regression.


Comparing two samples

1. Apply the function "plot" to the formula that relates the response "frequency" to the
explanatory variable "march2007" in order to produce the two box-plots of the
response. Redo the plotting with "frequency" replaced by "log(frequency)". The
distribution of the variable "log(frequency)" is:
__ More symmetric, __ Less symmetric compared to the distribution of the variable
"frequency".
Mark the most appropriate option and attach the R code that produces the two
plots:

> transfusion <- read.csv("transfusion.csv")

> plot(frequency~march2007, data=transfusion)

> plot(log(frequency)~march2007, data=transfusion)

>

 The output boxplot for variable frequency is skewed to the right whereas
that of log(frequency) is symmetric.

2.   Mark the null hypotheses that you reject with a significance level of 5% and those

that you do not reject:

(Reject/Don't Reject) H0: The expectation of "frequency" is the same in the two subsets,

(Reject/Don't Reject) H0: The expectation of "log(frequency)" is the same in the two

subsets.

Explain your answer:


COMPARING TWO SAMPLES AND LINEAR REGRESSION 3

o Both hypothesis are rejected because from running the codes, their p-

values are less than 0.05. That is, the test for the response of frequency

and log(frequency) have 4.174e-06 and 2.472e-09 as their p-values

respectively.

> t.test(frequency~march2007,data=transfusion)

Welch Two Sample t-test

data: frequency by march2007

t = -4.7229, df = 216.86, p-value = 4.174e-06

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-4.246285 -1.745712

sample estimates:

mean in group no mean in group yes

4.801754 7.797753

> t.test(log(frequency)~march2007,data=transfusion)

Welch Two Sample t-test

data: log(frequency) by march2007

t = -6.1571, df = 289.04, p-value = 2.472e-09


COMPARING TWO SAMPLES AND LINEAR REGRESSION 4

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-0.6316889 -0.3256582

sample estimates:

mean in group no mean in group yes

1.178089 1.656762

>

3. Mark the null hypotheses that you reject with a significance level of 5% and those
that you do not reject:
(Reject/Don't Reject) H0: The variance of "frequency" is the same in the two subsets,
(Reject/Don't Reject) H0: The variance of "log(frequency)" is the same in the two
subsets.
Explain your answer:

o As shown in the code below, since the p-value of the test for the response

frequency which is less than 2.2e-16 is also less than 0.05, the null

hypothesis is rejected.

o On the other hand, since the p-value of the test for the response

log(frequency) which is equal to 0.6249 which is greater than 0.05, the

null hypothesis is not rejected.


COMPARING TWO SAMPLES AND LINEAR REGRESSION 5

> var.test(frequency~march2007,data=transfusion)

F test to compare two variances

data: frequency by march2007

F = 0.34883, num df = 569, denom df = 177, p-value < 2.2e-16

alternative hypothesis: true ratio of variances is not equal to 1

95 percent confidence interval:

0.2725525 0.4397267

sample estimates:

ratio of variances

0.3488348

> var.test(log(frequency)~march2007,data=transfusion)

F test to compare two variances

data: log(frequency) by march2007


COMPARING TWO SAMPLES AND LINEAR REGRESSION 6

F = 0.9449, num df = 569, denom df = 177, p-value = 0.6249

alternative hypothesis: true ratio of variances is not equal to 1

95 percent confidence interval:

0.738272 1.191102

sample estimates:

ratio of variances

0.9449005

>

Linear Regression:

Q4: Apply the function "plot" to the formula that relates the response "frequency" to the
explanatory variable "time" in order to produce the scatter plot. Add the regression line
to the plot. The variability of the variable "frequency, for larger values of the explanatory
variable, is:
__ Smaller, __ Larger, __ Constant.
Mark the most appropriate option and attach the R code that produces the two
plots:

o
COMPARING TWO SAMPLES AND LINEAR REGRESSION 7

> transfusion <- read.csv("transfusion.csv")

> plot(frequency~time, data=transfusion)

> abline(lm(frequency~time, data=transfusion))


50
40
30
frequency

20
10
0

0 20 40 60 80 100

time

(The R Foundation, n.d.).


50
40
30
frequency

20
10
0

0 20 40 60 80 100

time

(The R Foundation, n.d.).


COMPARING TWO SAMPLES AND LINEAR REGRESSION 8

Q5: Mark the null hypotheses that you reject with a significance level of 5% and those
that you do not reject:
(Reject/Don't Reject) H0: The slope of "time" in the regression line of the response
"frequency" is equal to zero,
(Reject/Don't Reject) H0: The slope of "time" in the regression line of the response
"log(frequency)" is equal to zero.
Explain your answer:

o Both hypothesis are rejected because from running the codes, their p-

values are less than 0.05. That is, the test for the slope of time response of

frequency and log(frequency) have their p-values less than 2.2e-16.

> summary(lm(frequency~time, data=transfusion))

Call:

lm(formula = frequency ~ time, data = transfusion)

Residuals:

Min 1Q Median 3Q Max

-11.533 -2.255 -0.255 1.266 34.794

Coefficients:
COMPARING TWO SAMPLES AND LINEAR REGRESSION 9

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.300523 0.284955 1.055 0.292

time 0.152096 0.006776 22.448 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.514 on 746 degrees of freedom

Multiple R-squared: 0.4031, Adjusted R-squared: 0.4023

F-statistic: 503.9 on 1 and 746 DF, p-value: < 2.2e-16

> summary(lm(log(frequency)~time, data=transfusion))

Call:

lm(formula = log(frequency) ~ time, data = transfusion)

Residuals:
COMPARING TWO SAMPLES AND LINEAR REGRESSION 10

Min 1Q Median 3Q Max

-2.33165 -0.48991 0.01058 0.48188 1.96348

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.394636 0.041382 9.536 <2e-16 ***

time 0.026176 0.000984 26.602 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6556 on 746 degrees of freedom

Multiple R-squared: 0.4868, Adjusted R-squared: 0.4861

F-statistic: 707.7 on 1 and 746 DF, p-value: < 2.2e-16

>

An Unpaired Design:

Q6: The 95%-confidence interval of slope of "time" in the regression line of the
response "log(frequency)" is:
COMPARING TWO SAMPLES AND LINEAR REGRESSION 11

Lower end = __0.02424411__, Upper end = __0.02810751__.

Attach the R code that produces the confidence interval:

> transfusion <- read.csv("transfusion.csv")

> confint(lm(log(frequency)~time, data=transfusion))

2.5 % 97.5 %

(Intercept) 0.31339717 0.47587556

time 0.02424411 0.02810751

>

Q7: The regression line between "time" as an explanatory variable and "log(frequency)"
as a response is:

__ Increasing, __ Decreasing, __ Constant.

Mark the most appropriate option and explain your answer:

o From running the code summary(lm(log(frequency)~time,

data=transfusion)) as shown in question 55, we obtain a slope of 0.026176


COMPARING TWO SAMPLES AND LINEAR REGRESSION 12

for the regression line between the two variables. The regression line is

considered to be increasing since the slope is a positive number.

The Relation Between Two Variables:

Q8: Apply the function "plot" to the formula that relates the response "frequency" to the
explanatory variable "monetary" in order to produce the scatter plot. Add the regression
line to the plot. The points in the scatter plot are:
__ All on the same line, __ Show a linear trend but are not on the same line, __ Don't
show a linear trend.
Mark the most appropriate option and attach the R code that produces the plot:

o A p-value less than the significance level or alpha value of 0.05 is

considered statistically significant, confirming the existence of a difference

(Yakir, 2011). Since the p-value of 0.07721 is greater than 0.05, the null

hypothesis is accepted or not rejected.

> transfusion <- read.csv("transfusion.csv")

> plot(frequency~monetary, data=transfusion)

> abline(lm(frequency~monetary, data=transfusion))


COMPARING TWO SAMPLES AND LINEAR REGRESSION 13

50
40
30
frequency

20
10
0

0 2000 4000 6000 8000 10000 12000

monetary

(The R Foundation, n.d.).


50
40
30
frequency

20
10
0

0 2000 4000 6000 8000 10000 12000

monetary

(The R Foundation, n.d.).


COMPARING TWO SAMPLES AND LINEAR REGRESSION 14

References

The R Foundation. (n.d.). The R project for statistical computing. http://www.r-project.org/

Yakir, B. (2011). Introduction to statistical thinking (with R, without Calculus): Testing

hypothesis. The Hebrew University of Jerusalem, Department of Statistics, 203-226.

https://my.uopeople.edu/pluginfile.php/1188709/mod_page/content/31/IntroStat.pdf

You might also like