Programming for Data Science Assignment-4
Programming for Data Science Assignment-4
c. Which command would you use to determine the data types of the
columns in the women dataframe?
I would use str() command to determine the data types
d. What are the columns and their data types (use a screenshot, and then
state your answer)?
Using str() command, we see that the columns are height & weight and
their data types are numbers.
e. If you were to run a simple linear regression model, which would be the
X and which would be the Y variable. Explain your answer.
Let us assume that the weight of the person is influenced by height.
In a simple linear regression model where we predict person’s weight from
person’s height, our variables would be
X variable (Predictor): height (independent variable)
Y variable (Response): weight (dependent variable)
f. What would be the hypotheses for the simple linear regression?
The hypotheses are as follows:
Null Hypothesis (H0 ): There is no relationship between height and weight
(The slope of height is zero).
Alternative Hypothesis (H1): There is a relationship between height and weight
(The slope of height is not zero).
g. Run the simple linear regression and output the results to a dataframe
called women_model. Then use the summary() function to view the
output. Place a screenshot of the output here.
In this equation:
Intercept (-87.52): When the height is zero, the estimated weight would be -
87.52 pounds. While this value doesn't make practical sense (since zero height is
not possible), it's mathematically part of the model.
Slope (3.45): For each additional inch in height, the weight is expected to
increase by 3.45 pounds.
An R-squared value of 0.991 indicates that 99.1% of the variance in weight can
be explained by the variance in height for this dataset. This is a very high value,
suggesting that height is an excellent predictor of weight for the women in this
dataset.
k. What is the p-value, and what does it tell you about the regression
model?
The p-value for the height variable in this model is 1.09e-14 (or 1.09×10−14).
=> The p-value tests the null hypothesis that the coefficient of the predictor
variable (height) is zero. In our example:
=> This low p-value strongly suggests that we can reject the null hypothesis
that the slope of
height is zero.
d. What would be the hypotheses for a linear regression model using the
girth and volume variables?
Hypotheses for Linear Regression Using Girth and Volume:
Slope (5.0659): For each additional inch of girth, the volume of the tree
increases by approximately 5.0659 cubic feet.
j. What is the p-value, and what does it tell you about the regression
model?
The p-value for the Girth variable in the regression model is < 2e-16, which is
extremely small (almost zero).
The low p-value tells us that Girth is a significant predictor of Volume in this
dataset, meaning that changes in girth are likely to be associated with changes
in volume, rather than being due to random chance.
The model's p-value is < 2e-16, indicating that the relationship between girth
and volume is highly statistically significant. Therefore, we reject the null
hypothesis, concluding that there is a meaningful linear relationship between
these variables.
Question 3. Chi-square goodness of fit test
d. Create a new dataframe called star_model which will contain only the
rows in the my_starwars dataframe which are complete (no values of
NA). The variables to include in your model are species, gender, and
height. Use the mutate command to create a new column called
“stature” which uses the height column to create three categories
(“Short”, “Medium”, and “Tall”). Create the star_model dataframe and
document your work with screenshots.
e. Run a chi-square test using the table function. The argument for the
table function will be the star_model dataframe.
Since the p-value (0.0001564) is significantly lower than the typical significance
level (0.05), we can reject the null hypothesis. This result indicates a statistically
significant association between the two variables. We can coclude that there is
strong evidence that these variables are not independent and that there is evidence
that one variable's distribution is related to the other.
Question 5. Analysis of Variance (ANOVA)
a. Load the tidyverse library and show the structure of the
PlantGrowth dataframe with a screenshot. What are the names
and datatypes of the variables?
The names and datatypes of the variables are Weight, Group and
Number, Factor respectively.
The treatment levels for the group variable are ‘ctrl’, ‘trt1’, and ‘trt2’.