Notebook 3 - Multiple Regression
Notebook 3 - Multiple Regression
Reference Guide for R (student resource) - Check out our reference guide for a full listing
of useful R commands for this project.
0.1 Data Science Project: Use data to determine the best and worst colleges
for conquering student debt.
0.1.1 Notebook 3: Multiple Regression
Does college pay off? We’ll use some of the latest data from the US Department of Education’s
College Scorecard Database to answer that question.
In this notebook (the 3rd of 4 total notebooks), you’ll use R to create a more advanced type of
model: multiple regression models. In doing so, you’ll be able to isolate which factors (controlling
for other variables) that make certain colleges worth the price of admission.
[1]: ## Run this code but do not edit it. Hit Ctrl+Enter to run the code.
# This command downloads a useful package of R commands
library(coursekata)
1
This data is a subset of the US Department of Education’s College Scorecard Database. The data
is current as of the 2020-2021 school year.
Description of all variables: See here
Detailed data file description: See here
[2]: ## Run this code but do not edit it. Hit Ctrl+Enter to run the code.
# This command downloads data from the file 'colleges.csv' and stores it in an␣
↪object called `dat`
2
1.1 (Review Question) - Is the correlation between tuition costs and student loan default rates
positive or negative? Does the direction of the relationship suprise you? Why or why not?
Double-click this cell to type your answer here:
Below is the same graphic except, this time, we color the colleges by their graduation rates. Take
a look:
[4]: ## Run this code but do not edit it
# show scatter for default_rate ~ net_tuition, color by grad_rate
gf_point(default_rate ~ net_tuition, color = ~grad_rate, data = dat)
3
Note: As stated in the dataset description above, default_rate describes the percent of all of a
school’s borrowers that are in default on their student loans. This includes students who have
graduated, transferred, or did not complete their programs.
There’s a lot going on in this graph. For help, we recommend watching this video, which
discusses how to interpret graphs that visualize multiple variables at once.
1.2 - Look at the bottom-right corner of the graph. These are colleges that charge their students
a lot of money (high tuition) yet, somehow, they have low student loan default rates. Describe the
graduation rates of these schools.
Double-click this cell to type your answer here: The graduation rates of high tuition schools
are close to 100 because the dots on the right are generally yellow/yellow-green.
1.3 - Look at the top-left corner of the graph. These are colleges that don’t charge a lot (low
tuitions) yet, somehow, their students have high default rates. Describe the graduation rates of
these schools.
Double-click this cell to type your answer here: The graduation rates of lower tuition schools
tend to be lower, from 0 to 50%, as the dots are blue/blue-green on the left side of the graph.
1.4 - Based on your answers to the previous two questions, give a possible reason why students
at lower-cost schools (who, presumably, have less initial debt than their peers) somehow have
higher loan default rates.
Double-click this cell to type your answer here: Lower cost schools might be more accessible
for students in poorer financial scenarios, which means they may not be able to pay off loans or
4
finish schooling, which increases default rates and decreases graduation rates.
In data science, we say that graduation rates and tuition are confounded. Since they both rise
and fall together, it can be hard to tell which is really “making the difference” in default rates. Is
it possible to “tease out” which factor is more directly associated with students being able to pay
off their loans? The next section will introduce you to a new type of modeling - multiple regression
- that can help us answer this question.
2.1 (Review Question) - Use the lm command to fit and store the linear regression model that’s
visualized above, using net_tuition (predictor) in order to predict default_rate (outcome). Save
the model in an object called tuition_model and print out the model.
[8]: # Your code goes here
tuition_model <- lm(default_rate ~ net_tuition, data=dat)
tuition_model
5
Call:
lm(formula = default_rate ~ net_tuition, data = dat)
Coefficients:
(Intercept) net_tuition
8.0029 -0.2077
Check yourself: If you print out tuition_model, you should see two numbers: 8.0029 and -0.2077.
Recall that simple linear regressions follow this formula:
𝑦 ̂ = 𝛽 0 + 𝛽1 𝑥
Where: - 𝑦 ̂ is the predicted y-value (predicted outcome value) - 𝛽0 is the y-intercept –> the predicted
y-value (outcome value) when x = 0 (the predictor’s value is 0) - 𝛽1 is the slope –> the predicted
change in y (outcome) for a 1-unit increase in x (predictor) - 𝑥 is the x-value (predictor value)
2.2 (Review Question) - What is the slope value from our tuition_model? Interpret the
meaning of this value (in context).
Double-click this cell to type your answer here: The slope value is -0.2077, meaning for
every $1,000 increase in net tuition, default rate is expected to decrease 0.2077 percent.
2.3 (Review Question) - Use the summary command on tuition_model to see summary infor-
mation about the linear model. What is the 𝑅2 value from our tuition_model? What does this
value indicate about the strength of the model?
[9]: # Your code goes here
summary(tuition_model)
Call:
lm(formula = default_rate ~ net_tuition, data = dat)
Residuals:
Min 1Q Median 3Q Max
-6.4480 -1.9912 -0.5984 1.2492 25.4189
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.00294 0.21329 37.52 <2e-16 ***
net_tuition -0.20772 0.01331 -15.61 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
6
Check yourself: You should find an 𝑅2 value of 0.1882
Double-click this cell to type your answer here: the R^2 value is 0.1882, implying a weak
correlation between net tuition and default_rate
Student Note: There’s a lot going on in the following section. We recommend taking a break
to watch this video, which provides an overview of multiple regression models and walks through
interpreting the values from this model. Once you’re done with the video, continue reading below.
So far, we have only been working with simple linear regressions: models that use one predictor
variable (net_tuition) to predict the outcome variables (default_rate). If we’d like to use
multiple predictor variables at once in order to model our outcome, we can use a technique called
multiple regression.
For example, imagine we want to use both net_tuition (𝑥1 ) and grad_rate (𝑥2 ) to predict
default_rate (𝑦). We can write a new model with multiple predictors, like this:
𝑦 ̂ = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2
Where: - 𝑦 ̂ is the predicted default_rate - 𝑥1 is the net_tuition - 𝑥2 is the grad_rate
This means that… - 𝛽1 is the slope for net_tuition –> the slope between default_rate and
net_tuition, controlling for all other predictors - 𝛽2 is the slope for grad_rate –> the slope
between default_rate and grad_rate, controlling for all other predictors - 𝛽0 is the y-intercept
–> the predicted y-value when 𝑥1 = 0 and 𝑥2 = 0 (when net_tuition and grad_rate are both 0)
Let’s go ahead and fit this model, so we can understand what this all really means. In R, if we
want to use multiple predictors within our model (such as net_tuition and grad_rate), we simply
include both of them in our lm command. See below:
[10]: ## Run this code but do not edit it
# fit multiple regression: default_rate ~ net_tuition + grad_rate
tuition_grad_model <- lm(default_rate ~ net_tuition + grad_rate, data = dat)
tuition_grad_model
Call:
lm(formula = default_rate ~ net_tuition + grad_rate, data = dat)
Coefficients:
(Intercept) net_tuition grad_rate
14.478742 0.006692 -0.160296
As you can see, the values have changed a bit, and an extra slope term has now appeared in our
model. We can plug these values into our model like so:
Here’s how we can interpret the slopes in our model: - 𝛽1 = 0.007 –> For every 1,000 dollar
increase in net_tuition, we expect a 0.007 percent point increase in default_rate, controlling
7
for grad_rate - 𝛽2 = −0.160 –> For every 1 percentage point increase in grad_rate, we expect a
0.160 percentage point decrease in default_rate, controlling for net_tuition
The key is that multiple regression allows you to control for other predictors, which helps us
eliminate confounding. When we can control for graduation rates - i.e. when comparing colleges
with similar graduation rates - we see that tuition is now positively related to default rates. In
other words, if students attend colleges with similar graduation rates, we’d expect the one that
charges more in tuition to have higher rates of default.
So, charging students more for school is, in fact, associated with higher rates of default - as long
as we’re comparing among schools with similar graduation rates.
Just as we can use the summary command to find the 𝑅2 value of a simple linear regression, we can
use summary to find the 𝑅2 of our multiple regression model:
[11]: ## Run this code but do not edit it
# summary of tuition_grad_model
summary(tuition_grad_model)
Call:
lm(formula = default_rate ~ net_tuition + grad_rate, data = dat)
Residuals:
Min 1Q Median 3Q Max
-6.9530 -1.4051 -0.1913 0.9162 20.4882
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.478742 0.293922 49.261 <2e-16 ***
net_tuition 0.006692 0.013066 0.512 0.609
grad_rate -0.160296 0.006023 -26.616 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
2.4 - How does the 𝑅2 value of our multiple regression model (tuition_grad_model) compare
to the 𝑅2 value of our simple linear regression model (tuition_model). Did adding grad_rate
alongside net_tuition help make the model’s predictions stronger? Explain.
Double-click this cell to type your answer here: The R^2 for the multiple regression model
is higher, so adding grad_rate to net_tuition made the model’s predictions stronger.
8
dictor variables: net_tuition, grad_rate, and pct_PELL. Store the model in an object called
tuition_grad_pell_model and then print out the model.
tuition_grad_pell_model
Call:
lm(formula = default_rate ~ net_tuition + grad_rate + pct_PELL,
data = dat)
Coefficients:
(Intercept) net_tuition grad_rate pct_PELL
8.51264 0.03059 -0.11731 0.09045
Check yourself: When you print out the model, you should see four numbers: 8.513, 0.031, -0.117,
0.090
3.2 - Interpret (in context) the slope value for pct_PELL from your model.
Double-click this cell to type your answer here: For every 1 percent increase of the proportion
of Pell grant receiving students at the instituion, the expected value of the default rate increases
by 0.090 percent.
3.3 - Use the summary command to find the 𝑅2 value of the tuition_grad_pell_model.
Call:
lm(formula = default_rate ~ net_tuition + grad_rate + pct_PELL,
data = dat)
Residuals:
Min 1Q Median 3Q Max
-8.1491 -1.2449 -0.0454 1.0199 18.8640
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.512643 0.552863 15.397 <2e-16 ***
net_tuition 0.030587 0.012354 2.476 0.0134 *
grad_rate -0.117307 0.006603 -17.766 <2e-16 ***
pct_PELL 0.090451 0.007275 12.432 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
9
Residual standard error: 2.437 on 1049 degrees of freedom
Multiple R-squared: 0.5775, Adjusted R-squared: 0.5763
F-statistic: 478 on 3 and 1049 DF, p-value: < 2.2e-16
summary(fam_income_tuition_model)
Call:
lm(formula = default_rate ~ med_alum_earnings + grad_rate + pct_PELL,
data = dat)
Residuals:
Min 1Q Median 3Q Max
-8.124 -1.339 -0.184 1.097 18.563
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.812792 0.593973 16.521 < 2e-16 ***
med_alum_earnings -0.042631 0.008303 -5.135 3.37e-07 ***
grad_rate -0.089102 0.007170 -12.427 < 2e-16 ***
pct_PELL 0.081407 0.007222 11.272 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
10
0.1.6 Feedback (Required)
Please take 2 minutes to fill out this anonymous notebook feedback form, so we can continue
improving this notebook for future years!
11