Section 10.1 - 2 - Shared Lab
Section 10.1 - 2 - Shared Lab
1
https://www.lock5stat.com/datapage.html (3rd edition)
Exploratory Data Analysis
Open the dataset.
1. How many of these variables are quantitative? the response variable is?
2. Make simple dotplots of the four predictor variables. Look at shape and the range of values for
each predictor.
Analysis
3. Using intuition, are there any of the four predictors that you tentatively think would do a good job
of predicting domestic gross income?
It is always good to adhere to model simplicity where you build a regression model that has the fewest
number predictors needed to explain the variability found in the response variable. Because of this, we
will not immediately use the k = 4 model (four predictors).
4. Obtain a correlation matrix plot (an array of scatterplots) and pairwise correlations for all
quantitative variables, including the response variable. Use the variable order found in the data
dictionary above.
Reminder: Stat > Basic Stats > Correlation: Under Graphs select: Correlations
When you look at the scatterplots of the response variable versus each predictor variable:
A. Is there one predictor that shows a very strong linear relationship with the response variable?
B. Is there one predictor that shows a nonlinear relationship with the response variable which
would suggest that it should not be included in the multiple regression model?
C. Are there two predictor variables that show a rather strong linear relationship which could
suggest that both predictors may not be needed in the model?
5. Fit the k = 3 multiple regression model where you include predictors: AudienceScore,
G_OpenWkend, and Budget.
Reminder: Stat > Regression > Regression > Fit Regression Model: Under Results select: Basic
Tables
6. Which predictors are effective at the 0.05 level? Which is the most effective predictor (Note: you
will need to go beyond just looking at the p-value)?
10. Predict the Domestic Gross Income (in millions $) for a movie with an Audience Score of 50%,
where Opening Weekend Gross Income is 10 million ($), from a Budget of 40 million ($).
11. On page 617 from your textbook, the authors explain how the variability in the response variable
can be partitioned into one of two explanations.
Regression: Error:
Total
Variability Variability not
Variability in the
explained by the explained by the
response variable
model model
Look at the Analysis of Variance (ANOVA) Table on the Minitab output and find the values for the
three variabilities.
SSModel (Regression) =
SSE (Error) =
SST0 (Total) =
2 SSModel
R=
SSTotal
With the k = 3 Model: Verify the calculation of R2. Does it match the number found on the Minitab
Output?
*Note: When you fit a regression model to a data set, you should always first check to see if the
model conditions are met. This is covered in Section 10.2 of your textbook. This topic is beyond
what is typically covered in Stat 200. However, when working with real data, this is a crucial step. A
second applied course in Statistics would include this essential topic.
Wildlife scientists have used crocodile skeletons to measure the lengths of both the heads and the
complete bodies (in centimeters) with both crocodile species.
Data Dictionary: The dataset Crocodile (available on Canvas) includes three variables2:
Body Length (centimeters)
Head Length (centimeters)
Species: Indian, Australian, where 17 are Indian and 15 are Australian
2
De Veaux, R., Velleman, P., and Bock, D., 2020. Data and Models, 5th edition, Pearson Education.
Exploratory Data Analysis
1. Categorize the three variables (quant or cat).
2. We’re going to explore if head length can be a predictor of body length. What is the response
variable in that situation?
3. Obtain a scatterplot of the two variables, where y = Body Length and x = Head Length. Is it
appropriate to fit a simple linear regression model to this data?
Analysis
4. Fit a simple regression model where Head Length is used to predict Body Length. Write out the
regression equation.
5. Is Head Length an effective predictor at the 0.05 level? What is the value for R 2?
Next, we will investigate this question: Could the categorical variable Species explain more the variability
in Body Length?
6. First make a scatterplot, where y = Body Length and x = Head Length. Include Species as a grouping
variable.
Reminder: Graph > Scatterplot > Groups Overlaid
7. What do you notice when looking at this scatterplot? Does it appear that Species is important
examining the relationship between the two quantitative variables? If you fit a simple regression
line for each species, what would you find?
8. Fit a regression model where you also include Species as a Categorical variable
Reminder: Stat > Regression > Regression > Fit Regression Model (now include the categorical
variable of Species)
9. Is Head Length still an effective predictor at the 0.05 level? What is the value for R 2? Does it appear
that Species helps to explain more of the variability in Body Length?