0% found this document useful (0 votes)
3 views21 pages

Programming for Data Science Assignment-4

The document outlines a series of statistical analyses using R, including simple linear regression, Chi-square tests, and ANOVA. It details the processes for loading data, running models, interpreting results, and drawing conclusions based on statistical significance. Key findings include strong relationships between height and weight, girth and volume, and the significance of group means in the PlantGrowth dataset.

Uploaded by

Faisal Mohammed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views21 pages

Programming for Data Science Assignment-4

The document outlines a series of statistical analyses using R, including simple linear regression, Chi-square tests, and ANOVA. It details the processes for loading data, running models, interpreting results, and drawing conclusions based on statistical significance. Key findings include strong relationships between height and weight, girth and volume, and the significance of group means in the PlantGrowth dataset.

Uploaded by

Faisal Mohammed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Question 1.

Simple Linear Regression


a. With a screenshot, show the commands to load the tidyverse library
and show the dataframes in the tidyverse.
b. In which units are weight and height measured?
Using ?women, we see in the R documentation that:
[,1] heightnumeric Height (in)
[,2] weight numeric Weight (lbs)

Therefore, height is measured in Inches and Weight is measured in Pounds.

c. Which command would you use to determine the data types of the
columns in the women dataframe?
I would use str() command to determine the data types

d. What are the columns and their data types (use a screenshot, and then
state your answer)?

Using str() command, we see that the columns are height & weight and
their data types are numbers.

e. If you were to run a simple linear regression model, which would be the
X and which would be the Y variable. Explain your answer.
Let us assume that the weight of the person is influenced by height.
In a simple linear regression model where we predict person’s weight from
person’s height, our variables would be
X variable (Predictor): height (independent variable)
Y variable (Response): weight (dependent variable)
f. What would be the hypotheses for the simple linear regression?
The hypotheses are as follows:
Null Hypothesis (H0 ): There is no relationship between height and weight
(The slope of height is zero).
Alternative Hypothesis (H1): There is a relationship between height and weight
(The slope of height is not zero).

g. Run the simple linear regression and output the results to a dataframe
called women_model. Then use the summary() function to view the
output. Place a screenshot of the output here.

h. State the regression equation.


Weight=β0+β1×Height
From the above summary, we get the equation as follows:
=> Weight= (−87.52) + (3.45) × Height
i. Interpret your results for the coefficients of the equation.

In this equation:
Intercept (-87.52): When the height is zero, the estimated weight would be -
87.52 pounds. While this value doesn't make practical sense (since zero height is
not possible), it's mathematically part of the model.

Slope (3.45): For each additional inch in height, the weight is expected to
increase by 3.45 pounds.

j. What is the R-squared value, and what does it tell you?

An R-squared value of 0.991 indicates that 99.1% of the variance in weight can
be explained by the variance in height for this dataset. This is a very high value,
suggesting that height is an excellent predictor of weight for the women in this
dataset.

k. What is the p-value, and what does it tell you about the regression
model?
The p-value for the height variable in this model is 1.09e-14 (or 1.09×10−14).

=> The p-value tests the null hypothesis that the coefficient of the predictor
variable (height) is zero. In our example:

=> A p-value (1.09e-14) is extremely small, much smaller than a typical


significance
level like 0.05.

=> This low p-value strongly suggests that we can reject the null hypothesis
that the slope of
height is zero.

l. What is your conclusion?

Based on the results of the simple linear regression analysis:


1. The very low p-value (1.09e-14) indicates that there is a statistically
significant relationship between height and weight in the women dataset.
This suggests that height is a meaningful predictor of weight.
2. The R-squared value of 0.991 shows that 99.1% of the variation in weight can
be explained by height, indicating an excellent fit for the linear model.
3. The positive slope (3.45) implies that as height increases, weight also tends
to increase at a rate of approximately 3.45 pounds per inch.
Question 2. Simple Linear Regression
a. Load the tidyverse library and show the partial contents of the trees
dataframe by typing in the word trees in RStudio and pressing the run
icon. A screenshot is required.
b. Use the help() command to obtain information about the trees
dataframe. Provide a screenshot of the help() command and of the
Format and Source output.

c. In which units are the variables in the trees dataframe recorded.


Girth is recorded in inches
Height is recorded in feet
Volume is recorded in cubic feet

d. What would be the hypotheses for a linear regression model using the
girth and volume variables?
Hypotheses for Linear Regression Using Girth and Volume:

 Null Hypothesis (H₀): There is no relationship between girth and volume of


trees
 Alternative Hypothesis (H₁): There is a relationship between girth and volume

e. Which would be the X, and which would be the Y variables? Explain


your answer.
In a simple linear regression, we use the predictor (Girth) to explain changes in
the response variable (Volume), therefore:
X (Independent Variable): Girth.
Y (Dependent Variable): Volume.
f. Run the simple linear regression model and store the output in a
dataframe called tree_model. Include a screenshot of your command to
create the regression and show the summary command to display the
output. Use a screenshot to display the commands and the output.
g. State the regression equation.
From the above output, the regression equation is:
Volume= (−36.9435) + (5.0659) × Girth

h. Interpret your results for the coefficients of the equation.


Intercept (−36.9435): This is the expected volume when the girth is zero,
although it may not be meaningful in context since a girth of zero is unrealistic.

Slope (5.0659): For each additional inch of girth, the volume of the tree
increases by approximately 5.0659 cubic feet.

i. What is the R-squared value, and what does it tell you?


The R-squared value from the regression output is 0.9353.
A high R-squared value (close to 1) tells us that there is strong linear relationship
between girth and volume, suggesting that girth is a good predictor of volume in
this dataset.

j. What is the p-value, and what does it tell you about the regression
model?
The p-value for the Girth variable in the regression model is < 2e-16, which is
extremely small (almost zero).
The low p-value tells us that Girth is a significant predictor of Volume in this
dataset, meaning that changes in girth are likely to be associated with changes
in volume, rather than being due to random chance.

k. What is your conclusion?

The simple linear regression model indicates a strong, statistically significant


relationship between tree girth and tree volume. With an R-squared value of
0.9353, we conclude that 93.53% of the variance in tree volume can be
explained by girth, suggesting that girth is an effective predictor of volume.

The model's p-value is < 2e-16, indicating that the relationship between girth
and volume is highly statistically significant. Therefore, we reject the null
hypothesis, concluding that there is a meaningful linear relationship between
these variables.
Question 3. Chi-square goodness of fit test

a. What are the hypotheses for a Chi-square goodness of fit test?


Null Hypothesis (H₀): The observed frequencies of the categories (Small,
Medium, and Large) are consistent with the expected frequencies.
Alternative Hypothesis (H₁): The observed frequencies differ from the
expected frequencies.

b. Using a single command (you will need to use pipes) create a


dataframe called tree_volume which will contain a new variable called
Size which will split the Volume variable into three categories: Small,
Medium, and Large. Include the new column as well as the Height and
Girth variables. Use a screenshot to document your work.
c. Show the contents of the tree_volume dataframe by typing tree_volume
and pressing the Run icon.
d. Create a table containing the size variable from the tree_volume
dataframe. Provide a screenshot of your work and the output.

e. Run the command to perform the Chi-square goodness of fit test.


Document with a screenshot.
f. Interpret the result and state your conclusion.
The results of the Chi-square goodness of fit test are as follows:
 X-squared: 14
 Degrees of freedom (df): 2
 p-value: 0.0009119
Since the p-value (0.0009119) is less than the standard significance level of
0.05, we can reject the null hypothesis. This indicates that there is a statistically
significant difference between the observed and expected frequencies in the
Size categories (Small, Medium, Large). Therefore, we can conclude that
the distribution of tree volumes is not evenly spread across the three size
categories.
Question 4. Chi-square test of independence
a. Using a screenshot, show that you have the tidyverse library installed
and display some of the contents of the starwars dataframe.

b. What are the hypotheses for the Chi-square test of independence?


Null Hypothesis (H₀): There is no association between the categorical variables
(they are independent).
Alternative Hypothesis (H₁): There is an association between the categorical
variables (they are dependent).
c. There is a problem using the starwars dataset and that has to do with
the last column which is a variable called “films” . It is a list, and we
need to remove it. It is easier to create a new dataframe which we will
call my_starwars. Use a command to create the my_starwars dataframe
using the first eleven variables in starwars. Show your work with a
screenshot.

d. Create a new dataframe called star_model which will contain only the
rows in the my_starwars dataframe which are complete (no values of
NA). The variables to include in your model are species, gender, and
height. Use the mutate command to create a new column called
“stature” which uses the height column to create three categories
(“Short”, “Medium”, and “Tall”). Create the star_model dataframe and
document your work with screenshots.
e. Run a chi-square test using the table function. The argument for the
table function will be the star_model dataframe.

f. Interpret the results.

X-squared value (x2) = 123.41


Degrees of freedom (df) = 72
p-value = 0.0001564

Since the p-value (0.0001564) is significantly lower than the typical significance
level (0.05), we can reject the null hypothesis. This result indicates a statistically
significant association between the two variables. We can coclude that there is
strong evidence that these variables are not independent and that there is evidence
that one variable's distribution is related to the other.
Question 5. Analysis of Variance (ANOVA)
a. Load the tidyverse library and show the structure of the
PlantGrowth dataframe with a screenshot. What are the names
and datatypes of the variables?

The names and datatypes of the variables are Weight, Group and
Number, Factor respectively.

b. Show partial contents of the PlantGrowth dataframe with a


screenshot.
c. Use the help function to find the treatment levels for the group
variable. Show with a screenshot.

The treatment levels for the group variable are ‘ctrl’, ‘trt1’, and ‘trt2’.

d. What are the hypotheses for an ANOVA?


The hypotheses are:
Null hypothesis (H₀): The means of all groups are equal.
Alternative hypothesis (H₁): At least one group mean is different.

e. Write a command which will show the mean weights of the


various treatment groups (crtl, trt1, and trt2). Show the results
with a screenshot.
f. Does it look like the group means are the same or are they
significantly different?
It appears that the mean weight of the ‘trt2’ group is higher than both
‘ctrl’ and ‘trt1’, and the ‘ctrl’ and ‘trt1’ groups have relatively similar
means.

g. Conduct the ANOVA and place your output in a dataframe called


my_ANOVA_output. Document with a screenshot.

h. Interpret your results.


We observe the p-value is 0.0159, which is less than the typical alpha
level of 0.05. Therefore, we can reject the null hypothesis. This means that
there are significant differences in the mean weights
between at least two of the treatment groups (ctrl, trt1, and trt2). But this
ANOVA result does not tell us which specific groups differ from each other.
i. Conduct a TukeyHSD test to determine which means are
significantly different from each other. Save your results to a
dataframe called my_Tukey. Show your output with a screenshot.

j. What are your conclusions?


By observing the differences in means, confidence intervals and adjusted
p-values, we can conclude that:
=> There is a significant difference between the trt2 and trt1 groups
since p = 0.0120.
=> There are no significant differences between the ctrl and trt1
groups since p = 0.3909, and between the ctrl and trt2 groups since p
= 0.1980.
=> Therefore, trt2 has a significantly higher mean weight compared
to trt1, but no significant differences were found between the other
group pairs.

You might also like