0% found this document useful (0 votes)

3 views21 pages

Programming For Data Science Assignment-4

The document outlines a series of statistical analyses using R, including simple linear regression, Chi-square tests, and ANOVA. It details the processes for loading data, running models, interpreting results, and drawing conclusions based on statistical significance. Key findings include strong relationships between height and weight, girth and volume, and the significance of group means in the PlantGrowth dataset.

Uploaded by

Faisal Mohammed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views21 pages

Programming For Data Science Assignment-4

Uploaded by

Faisal Mohammed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 21

Question 1.

Simple Linear Regression

a. With a screenshot, show the commands to load the tidyverse library
and show the dataframes in the tidyverse.
b. In which units are weight and height measured?
Using ?women, we see in the R documentation that:
[,1] heightnumeric Height (in)
[,2] weight numeric Weight (lbs)

Therefore, height is measured in Inches and Weight is measured in Pounds.

c. Which command would you use to determine the data types of the
columns in the women dataframe?
I would use str() command to determine the data types

d. What are the columns and their data types (use a screenshot, and then
state your answer)?

Using str() command, we see that the columns are height & weight and
their data types are numbers.

e. If you were to run a simple linear regression model, which would be the
X and which would be the Y variable. Explain your answer.
Let us assume that the weight of the person is influenced by height.
In a simple linear regression model where we predict person’s weight from
person’s height, our variables would be
X variable (Predictor): height (independent variable)
Y variable (Response): weight (dependent variable)
f. What would be the hypotheses for the simple linear regression?
The hypotheses are as follows:
Null Hypothesis (H0 ): There is no relationship between height and weight
(The slope of height is zero).
Alternative Hypothesis (H1): There is a relationship between height and weight
(The slope of height is not zero).

g. Run the simple linear regression and output the results to a dataframe
called women_model. Then use the summary() function to view the
output. Place a screenshot of the output here.

h. State the regression equation.

Weight=β0+β1×Height
From the above summary, we get the equation as follows:
=> Weight= (−87.52) + (3.45) × Height
i. Interpret your results for the coefficients of the equation.

In this equation:
Intercept (-87.52): When the height is zero, the estimated weight would be -
87.52 pounds. While this value doesn't make practical sense (since zero height is
not possible), it's mathematically part of the model.

Slope (3.45): For each additional inch in height, the weight is expected to
increase by 3.45 pounds.

j. What is the R-squared value, and what does it tell you?

An R-squared value of 0.991 indicates that 99.1% of the variance in weight can
be explained by the variance in height for this dataset. This is a very high value,
suggesting that height is an excellent predictor of weight for the women in this
dataset.

k. What is the p-value, and what does it tell you about the regression
model?
The p-value for the height variable in this model is 1.09e-14 (or 1.09×10−14).

=> The p-value tests the null hypothesis that the coefficient of the predictor
variable (height) is zero. In our example:

=> A p-value (1.09e-14) is extremely small, much smaller than a typical

significance
level like 0.05.

=> This low p-value strongly suggests that we can reject the null hypothesis
that the slope of
height is zero.

l. What is your conclusion?

Based on the results of the simple linear regression analysis:

1. The very low p-value (1.09e-14) indicates that there is a statistically
significant relationship between height and weight in the women dataset.
This suggests that height is a meaningful predictor of weight.
2. The R-squared value of 0.991 shows that 99.1% of the variation in weight can
be explained by height, indicating an excellent fit for the linear model.
3. The positive slope (3.45) implies that as height increases, weight also tends
to increase at a rate of approximately 3.45 pounds per inch.
Question 2. Simple Linear Regression
a. Load the tidyverse library and show the partial contents of the trees
dataframe by typing in the word trees in RStudio and pressing the run
icon. A screenshot is required.
b. Use the help() command to obtain information about the trees
dataframe. Provide a screenshot of the help() command and of the
Format and Source output.

c. In which units are the variables in the trees dataframe recorded.

Girth is recorded in inches
Height is recorded in feet
Volume is recorded in cubic feet

d. What would be the hypotheses for a linear regression model using the
girth and volume variables?
Hypotheses for Linear Regression Using Girth and Volume:

 Null Hypothesis (H₀): There is no relationship between girth and volume of

trees
 Alternative Hypothesis (H₁): There is a relationship between girth and volume

e. Which would be the X, and which would be the Y variables? Explain

your answer.
In a simple linear regression, we use the predictor (Girth) to explain changes in
the response variable (Volume), therefore:
X (Independent Variable): Girth.
Y (Dependent Variable): Volume.
f. Run the simple linear regression model and store the output in a
dataframe called tree_model. Include a screenshot of your command to
create the regression and show the summary command to display the
output. Use a screenshot to display the commands and the output.
g. State the regression equation.
From the above output, the regression equation is:
Volume= (−36.9435) + (5.0659) × Girth

h. Interpret your results for the coefficients of the equation.

Intercept (−36.9435): This is the expected volume when the girth is zero,
although it may not be meaningful in context since a girth of zero is unrealistic.

Slope (5.0659): For each additional inch of girth, the volume of the tree
increases by approximately 5.0659 cubic feet.

i. What is the R-squared value, and what does it tell you?

The R-squared value from the regression output is 0.9353.
A high R-squared value (close to 1) tells us that there is strong linear relationship
between girth and volume, suggesting that girth is a good predictor of volume in
this dataset.

j. What is the p-value, and what does it tell you about the regression
model?
The p-value for the Girth variable in the regression model is < 2e-16, which is
extremely small (almost zero).
The low p-value tells us that Girth is a significant predictor of Volume in this
dataset, meaning that changes in girth are likely to be associated with changes
in volume, rather than being due to random chance.

k. What is your conclusion?

The simple linear regression model indicates a strong, statistically significant

relationship between tree girth and tree volume. With an R-squared value of
0.9353, we conclude that 93.53% of the variance in tree volume can be
explained by girth, suggesting that girth is an effective predictor of volume.

The model's p-value is < 2e-16, indicating that the relationship between girth
and volume is highly statistically significant. Therefore, we reject the null
hypothesis, concluding that there is a meaningful linear relationship between
these variables.
Question 3. Chi-square goodness of fit test

a. What are the hypotheses for a Chi-square goodness of fit test?

Null Hypothesis (H₀): The observed frequencies of the categories (Small,
Medium, and Large) are consistent with the expected frequencies.
Alternative Hypothesis (H₁): The observed frequencies differ from the
expected frequencies.

b. Using a single command (you will need to use pipes) create a

dataframe called tree_volume which will contain a new variable called
Size which will split the Volume variable into three categories: Small,
Medium, and Large. Include the new column as well as the Height and
Girth variables. Use a screenshot to document your work.
c. Show the contents of the tree_volume dataframe by typing tree_volume
and pressing the Run icon.
d. Create a table containing the size variable from the tree_volume
dataframe. Provide a screenshot of your work and the output.

e. Run the command to perform the Chi-square goodness of fit test.

Document with a screenshot.
f. Interpret the result and state your conclusion.
The results of the Chi-square goodness of fit test are as follows:
 X-squared: 14
 Degrees of freedom (df): 2
 p-value: 0.0009119
Since the p-value (0.0009119) is less than the standard significance level of
0.05, we can reject the null hypothesis. This indicates that there is a statistically
significant difference between the observed and expected frequencies in the
Size categories (Small, Medium, Large). Therefore, we can conclude that
the distribution of tree volumes is not evenly spread across the three size
categories.
Question 4. Chi-square test of independence
a. Using a screenshot, show that you have the tidyverse library installed
and display some of the contents of the starwars dataframe.

b. What are the hypotheses for the Chi-square test of independence?

Null Hypothesis (H₀): There is no association between the categorical variables
(they are independent).
Alternative Hypothesis (H₁): There is an association between the categorical
variables (they are dependent).
c. There is a problem using the starwars dataset and that has to do with
the last column which is a variable called “films” . It is a list, and we
need to remove it. It is easier to create a new dataframe which we will
call my_starwars. Use a command to create the my_starwars dataframe
using the first eleven variables in starwars. Show your work with a
screenshot.

d. Create a new dataframe called star_model which will contain only the
rows in the my_starwars dataframe which are complete (no values of
NA). The variables to include in your model are species, gender, and
height. Use the mutate command to create a new column called
“stature” which uses the height column to create three categories
(“Short”, “Medium”, and “Tall”). Create the star_model dataframe and
document your work with screenshots.
e. Run a chi-square test using the table function. The argument for the
table function will be the star_model dataframe.

f. Interpret the results.

X-squared value (x2) = 123.41

Degrees of freedom (df) = 72
p-value = 0.0001564

Since the p-value (0.0001564) is significantly lower than the typical significance
level (0.05), we can reject the null hypothesis. This result indicates a statistically
significant association between the two variables. We can coclude that there is
strong evidence that these variables are not independent and that there is evidence
that one variable's distribution is related to the other.
Question 5. Analysis of Variance (ANOVA)
a. Load the tidyverse library and show the structure of the
PlantGrowth dataframe with a screenshot. What are the names
and datatypes of the variables?

The names and datatypes of the variables are Weight, Group and
Number, Factor respectively.

b. Show partial contents of the PlantGrowth dataframe with a

screenshot.
c. Use the help function to find the treatment levels for the group
variable. Show with a screenshot.

The treatment levels for the group variable are ‘ctrl’, ‘trt1’, and ‘trt2’.

d. What are the hypotheses for an ANOVA?

The hypotheses are:
Null hypothesis (H₀): The means of all groups are equal.
Alternative hypothesis (H₁): At least one group mean is different.

e. Write a command which will show the mean weights of the

various treatment groups (crtl, trt1, and trt2). Show the results
with a screenshot.
f. Does it look like the group means are the same or are they
significantly different?
It appears that the mean weight of the ‘trt2’ group is higher than both
‘ctrl’ and ‘trt1’, and the ‘ctrl’ and ‘trt1’ groups have relatively similar
means.

g. Conduct the ANOVA and place your output in a dataframe called

my_ANOVA_output. Document with a screenshot.

h. Interpret your results.

We observe the p-value is 0.0159, which is less than the typical alpha
level of 0.05. Therefore, we can reject the null hypothesis. This means that
there are significant differences in the mean weights
between at least two of the treatment groups (ctrl, trt1, and trt2). But this
ANOVA result does not tell us which specific groups differ from each other.
i. Conduct a TukeyHSD test to determine which means are
significantly different from each other. Save your results to a
dataframe called my_Tukey. Show your output with a screenshot.

j. What are your conclusions?

By observing the differences in means, confidence intervals and adjusted
p-values, we can conclude that:
=> There is a significant difference between the trt2 and trt1 groups
since p = 0.0120.
=> There are no significant differences between the ctrl and trt1
groups since p = 0.3909, and between the ctrl and trt2 groups since p
= 0.1980.
=> Therefore, trt2 has a significantly higher mean weight compared
to trt1, but no significant differences were found between the other
group pairs.

Think Stats 3rd Edition Early Release - Allen Downey
No ratings yet
Think Stats 3rd Edition Early Release - Allen Downey
97 pages
Assignment 3 (MAS183)
No ratings yet
Assignment 3 (MAS183)
5 pages
DAF1101 Business Statistics-1
No ratings yet
DAF1101 Business Statistics-1
219 pages
Statistics For Economics
100% (1)
Statistics For Economics
214 pages
Asset v1 - Indic AI+PR103+2020 - T3+type@asset+block@1 Running Linear Regression in R
No ratings yet
Asset v1 - Indic AI+PR103+2020 - T3+type@asset+block@1 Running Linear Regression in R
74 pages
Simple Linear Regression 2023
No ratings yet
Simple Linear Regression 2023
33 pages
Unit 2 - Class 3-Al-830
No ratings yet
Unit 2 - Class 3-Al-830
22 pages
330 Lecture7 2014
No ratings yet
330 Lecture7 2014
31 pages
STATS 330: Lecture 6: Inference For The Multiple Regression Model
No ratings yet
STATS 330: Lecture 6: Inference For The Multiple Regression Model
26 pages
330 Lecture7 2015
No ratings yet
330 Lecture7 2015
29 pages
STATS 330: Lecture 6: Inference For The Multiple Regression Model
No ratings yet
STATS 330: Lecture 6: Inference For The Multiple Regression Model
33 pages
Assignment IV Probability
No ratings yet
Assignment IV Probability
18 pages
Homework Assignment 5
No ratings yet
Homework Assignment 5
10 pages
A1
No ratings yet
A1
8 pages
Practice Regression1
No ratings yet
Practice Regression1
5 pages
R - Program
No ratings yet
R - Program
5 pages
Theme 6 Week 12 - Tutorial With Answers
No ratings yet
Theme 6 Week 12 - Tutorial With Answers
19 pages
Biostatistics Lect 7b - 112025
No ratings yet
Biostatistics Lect 7b - 112025
50 pages
Linear Regression
No ratings yet
Linear Regression
15 pages
Ps Lregression
No ratings yet
Ps Lregression
6 pages
W3 - Linear Regression
No ratings yet
W3 - Linear Regression
4 pages
BES - R Lab 9
No ratings yet
BES - R Lab 9
7 pages
22 HW Assignment Biostat
No ratings yet
22 HW Assignment Biostat
6 pages
Regn Lect 3
No ratings yet
Regn Lect 3
10 pages
Lecture 4.3 Regression-1
No ratings yet
Lecture 4.3 Regression-1
30 pages
A Regression Equation Model For Height and Weight
No ratings yet
A Regression Equation Model For Height and Weight
8 pages
Ecmt1010 Assignment
No ratings yet
Ecmt1010 Assignment
9 pages
Analytic II - HW3 - 1106
No ratings yet
Analytic II - HW3 - 1106
6 pages
Linear Regression
No ratings yet
Linear Regression
13 pages
18 SL Regression 1 320E F21
No ratings yet
18 SL Regression 1 320E F21
40 pages
Assign2 PDF
No ratings yet
Assign2 PDF
2 pages
jl1DPGEQRai25HgJgc3J - Simple Linear Regression
No ratings yet
jl1DPGEQRai25HgJgc3J - Simple Linear Regression
7 pages
Fikret Isik - Lecture Notes For Statistics Session - IUFRO Genetics of Host-Parasite Interactions in Forestry - 2011
No ratings yet
Fikret Isik - Lecture Notes For Statistics Session - IUFRO Genetics of Host-Parasite Interactions in Forestry - 2011
47 pages
Exercise Chap 2
No ratings yet
Exercise Chap 2
3 pages
Tabla Kolmogorov
No ratings yet
Tabla Kolmogorov
4 pages
HW1 Solution Fall2024
No ratings yet
HW1 Solution Fall2024
11 pages
HW 2.4 - Influential Points and Departures From Linearity
No ratings yet
HW 2.4 - Influential Points and Departures From Linearity
2 pages
R Code
No ratings yet
R Code
3 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
18 pages
Multiple Regression: Model and Interpretation
No ratings yet
Multiple Regression: Model and Interpretation
10 pages
What Is Multiple Linear Regression
No ratings yet
What Is Multiple Linear Regression
23 pages
Sales 20: Years Advertising Expense (Millions) X Sales (Thousands) y
No ratings yet
Sales 20: Years Advertising Expense (Millions) X Sales (Thousands) y
8 pages
HW 1
No ratings yet
HW 1
1 page
Linear Model
No ratings yet
Linear Model
10 pages
Using Maple To Perform Least Squares Fit (Or Regression)
No ratings yet
Using Maple To Perform Least Squares Fit (Or Regression)
16 pages
Simple Regression Assignment
No ratings yet
Simple Regression Assignment
1 page
Assignment 2
No ratings yet
Assignment 2
2 pages
Stats101A - Chapter 1
No ratings yet
Stats101A - Chapter 1
25 pages
Regression PDF
No ratings yet
Regression PDF
18 pages
Weatherwax Weisberg Solutions
No ratings yet
Weatherwax Weisberg Solutions
162 pages
9.regression Zoom
No ratings yet
9.regression Zoom
23 pages
STAT 357 Homework #1
No ratings yet
STAT 357 Homework #1
1 page
Simple Regression Analysis: April 2011
No ratings yet
Simple Regression Analysis: April 2011
10 pages
Multiple Regression
No ratings yet
Multiple Regression
3 pages
Experiment No.8 - Fit Simple Linear Regression Models Using Built-In Functions.
No ratings yet
Experiment No.8 - Fit Simple Linear Regression Models Using Built-In Functions.
8 pages
Amazon Career Choice - Data Analytics Syll - Desconocido
No ratings yet
Amazon Career Choice - Data Analytics Syll - Desconocido
10 pages
Practice Regression1
No ratings yet
Practice Regression1
5 pages
Week13 Exercise Solutions
No ratings yet
Week13 Exercise Solutions
4 pages
Quantitative Demand Analysis
No ratings yet
Quantitative Demand Analysis
34 pages
Chapter 6 Forecasting: Quantitative Approaches To Forecasting
No ratings yet
Chapter 6 Forecasting: Quantitative Approaches To Forecasting
10 pages
Example Sheet 4 1a. Data - Read - Table ("Salary - TXT", Sep "", Header FALSE)
No ratings yet
Example Sheet 4 1a. Data - Read - Table ("Salary - TXT", Sep "", Header FALSE)
4 pages
Which Test When: 1 Exploratory Tests
No ratings yet
Which Test When: 1 Exploratory Tests
5 pages
Revision Questions On Regression
No ratings yet
Revision Questions On Regression
9 pages
CH 06
No ratings yet
CH 06
20 pages
Random Motors Project Submission: Name
No ratings yet
Random Motors Project Submission: Name
10 pages
Hetero Stata
No ratings yet
Hetero Stata
2 pages
Kruskal Wallis Test
No ratings yet
Kruskal Wallis Test
8 pages
Statistical Inference Point Estimators Estimating The Population Mean Using Confidence Intervals
No ratings yet
Statistical Inference Point Estimators Estimating The Population Mean Using Confidence Intervals
40 pages
Bayes Rule
No ratings yet
Bayes Rule
1 page
Descriptive Statistics Vs Inferential Statistics
No ratings yet
Descriptive Statistics Vs Inferential Statistics
8 pages
Q3 With Solution
No ratings yet
Q3 With Solution
3 pages
Nurulia Amanda - 2110116062 - Tugas Statistiki Ekonomi
No ratings yet
Nurulia Amanda - 2110116062 - Tugas Statistiki Ekonomi
12 pages
Assignment Econometrics
No ratings yet
Assignment Econometrics
7 pages
M E - 3 7 3 5 S Eco N D Y Ear B. C. A. (Sem - I LL) E X A M in A Tio N 301: S Ta Tis Tic A L M Eth Ods
No ratings yet
M E - 3 7 3 5 S Eco N D Y Ear B. C. A. (Sem - I LL) E X A M in A Tio N 301: S Ta Tis Tic A L M Eth Ods
4 pages
Statistics - Wikipedia
No ratings yet
Statistics - Wikipedia
23 pages
Output Uji Homogenitas
No ratings yet
Output Uji Homogenitas
10 pages
Ecs3706 Assignment 02 Sem01 2024
No ratings yet
Ecs3706 Assignment 02 Sem01 2024
4 pages
Correlations: Soal No 1 Korelasi Product Moment
No ratings yet
Correlations: Soal No 1 Korelasi Product Moment
3 pages
Rologit PDF
No ratings yet
Rologit PDF
9 pages
STAT Q4 Week 2 Enhanced.v1
No ratings yet
STAT Q4 Week 2 Enhanced.v1
11 pages
Project Work
No ratings yet
Project Work
34 pages
Chapter 14 Model Validat 2008 PEM Fuel Cell Modeling and Simulation Using
No ratings yet
Chapter 14 Model Validat 2008 PEM Fuel Cell Modeling and Simulation Using
16 pages
Practical 9
No ratings yet
Practical 9
6 pages
MTC 2 Test PDF
No ratings yet
MTC 2 Test PDF
6 pages
Simple Hypothesis Solved Exercises
No ratings yet
Simple Hypothesis Solved Exercises
6 pages
Shrinkage Content
No ratings yet
Shrinkage Content
1 page
Aastha Tripathy
No ratings yet
Aastha Tripathy
2 pages
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
From Everand
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
SUJAUL CHOWDHURY
No ratings yet
Fifth Dimension: The Light to See
From Everand
Fifth Dimension: The Light to See
Marc E. King
No ratings yet
Standard-Slope Integration: A New Approach to Numerical Integration
From Everand
Standard-Slope Integration: A New Approach to Numerical Integration
Peter James Italia, MD
No ratings yet

Programming For Data Science Assignment-4

Uploaded by

Programming For Data Science Assignment-4

Uploaded by

Question 1.

Simple Linear Regression

Therefore, height is measured in Inches and Weight is measured in Pounds.

h. State the regression equation.

j. What is the R-squared value, and what does it tell you?

=> A p-value (1.09e-14) is extremely small, much smaller than a typical

l. What is your conclusion?

Based on the results of the simple linear regression analysis:

c. In which units are the variables in the trees dataframe recorded.

 Null Hypothesis (H₀): There is no relationship between girth and volume of

e. Which would be the X, and which would be the Y variables? Explain

h. Interpret your results for the coefficients of the equation.

i. What is the R-squared value, and what does it tell you?

k. What is your conclusion?

The simple linear regression model indicates a strong, statistically significant

a. What are the hypotheses for a Chi-square goodness of fit test?

b. Using a single command (you will need to use pipes) create a

e. Run the command to perform the Chi-square goodness of fit test.

b. What are the hypotheses for the Chi-square test of independence?

f. Interpret the results.

X-squared value (x2) = 123.41

b. Show partial contents of the PlantGrowth dataframe with a

d. What are the hypotheses for an ANOVA?

e. Write a command which will show the mean weights of the

g. Conduct the ANOVA and place your output in a dataframe called

h. Interpret your results.

j. What are your conclusions?

You might also like