0% found this document useful (0 votes)
219 views

Notebook 3 - Multiple Regression

This document is a guide for conducting a data science project using multiple regression analysis to evaluate the impact of various factors on student loan default rates among four-year colleges. It details the use of R programming to analyze data from the US Department of Education's College Scorecard Database, focusing on variables such as net tuition and graduation rates. The document emphasizes the importance of controlling for confounding variables to accurately assess the relationship between college costs and student debt outcomes.

Uploaded by

simoncheng
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
219 views

Notebook 3 - Multiple Regression

This document is a guide for conducting a data science project using multiple regression analysis to evaluate the impact of various factors on student loan default rates among four-year colleges. It details the use of R programming to analyze data from the US Department of Education's College Scorecard Database, focusing on variables such as net tuition and graduation rates. The document emphasizes the importance of controlling for confounding variables to accurately assess the relationship between college costs and student debt outcomes.

Uploaded by

simoncheng
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Notebook 3 - Multiple Regression

May 22, 2024

Reference Guide for R (student resource) - Check out our reference guide for a full listing
of useful R commands for this project.

0.1 Data Science Project: Use data to determine the best and worst colleges
for conquering student debt.
0.1.1 Notebook 3: Multiple Regression
Does college pay off? We’ll use some of the latest data from the US Department of Education’s
College Scorecard Database to answer that question.
In this notebook (the 3rd of 4 total notebooks), you’ll use R to create a more advanced type of
model: multiple regression models. In doing so, you’ll be able to isolate which factors (controlling
for other variables) that make certain colleges worth the price of admission.

[1]: ## Run this code but do not edit it. Hit Ctrl+Enter to run the code.
# This command downloads a useful package of R commands
library(coursekata)

�� CourseKata packages ������������������������������������


coursekata 0.15.0 ��
� dslabs 0.8.0 � Metrics
0.1.4
� Lock5withR 1.2.2 � lsr
0.5.2
� fivethirtyeightdata 0.1.0 � mosaic
1.9.1
� fivethirtyeight 0.6.2 � supernova
3.0.0

0.1.2 The Dataset (four_year_colleges.csv)


General description - In this notebook, we’ll be using the four_year_colleges.csv file, which
only includes schools that offer four-year bachelors degrees and/or higher graduate degrees. Com-
munity colleges and trade schools often have different goals (e.g. facilitating transfers, direct career
education) than institutions that offer four-year bachelors degrees. By comparing four-year colleges
only to other four-year colleges, we’ll have clearer analyses and conclusions.

1
This data is a subset of the US Department of Education’s College Scorecard Database. The data
is current as of the 2020-2021 school year.
Description of all variables: See here
Detailed data file description: See here

0.1.3 1.0 - Motivating multiple regression


To begin, let’s download our data. We’ll download the four_year_colleges.csv file from the
skewthescript.org website and store it in an R dataframe called dat.

[2]: ## Run this code but do not edit it. Hit Ctrl+Enter to run the code.
# This command downloads data from the file 'colleges.csv' and stores it in an␣
↪object called `dat`

dat <- read.csv('https://skewthescript.org/s/four_year_colleges.csv')


head(dat)

OPEID name city state region me


<int> <chr> <chr> <chr> <chr> <d
1 100200 Alabama A & M University Normal AL South 15.
2 105200 University of Alabama at Birmingham Birmingham AL South 15.
A data.frame: 6 × 26
3 105500 University of Alabama in Huntsville Huntsville AL South 14.
4 100500 Alabama State University Montgomery AL South 17.
5 105100 The University of Alabama Tuscaloosa AL South 17.
6 831000 Auburn University at Montgomery Montgomery AL South 12.
As before, we’re going to use student loan default rates as our key outcome variable in
determining whether college “pays off.”
In the previous notebook, we looked at the following predictors of student loan default rates: -
pct_PELL - percent of student body that receives PELL grants. Note: PELL grants are government
scholarships given to students from low-income families - grad_rate - percent of students who
successfully graduate - net_tuition - Net tuition (tuition minus average discounts and allowances)
per student, in thousands of dollars
In the last notebook, we fit a simple linear regression model to predict default_rate (outcome)
using net_tuition (predictor). Below is the scatterplot we produced (along with a visual of our
linear model):

[3]: ## Run this code but do not edit it


# create scatterplot: default_rate ~ net_tuition, with linear model overlayed
gf_point(default_rate ~ net_tuition, data = dat) %>% gf_lm(color = "orange")

2
1.1 (Review Question) - Is the correlation between tuition costs and student loan default rates
positive or negative? Does the direction of the relationship suprise you? Why or why not?
Double-click this cell to type your answer here:
Below is the same graphic except, this time, we color the colleges by their graduation rates. Take
a look:
[4]: ## Run this code but do not edit it
# show scatter for default_rate ~ net_tuition, color by grad_rate
gf_point(default_rate ~ net_tuition, color = ~grad_rate, data = dat)

3
Note: As stated in the dataset description above, default_rate describes the percent of all of a
school’s borrowers that are in default on their student loans. This includes students who have
graduated, transferred, or did not complete their programs.
There’s a lot going on in this graph. For help, we recommend watching this video, which
discusses how to interpret graphs that visualize multiple variables at once.
1.2 - Look at the bottom-right corner of the graph. These are colleges that charge their students
a lot of money (high tuition) yet, somehow, they have low student loan default rates. Describe the
graduation rates of these schools.
Double-click this cell to type your answer here: The graduation rates of high tuition schools
are close to 100 because the dots on the right are generally yellow/yellow-green.
1.3 - Look at the top-left corner of the graph. These are colleges that don’t charge a lot (low
tuitions) yet, somehow, their students have high default rates. Describe the graduation rates of
these schools.
Double-click this cell to type your answer here: The graduation rates of lower tuition schools
tend to be lower, from 0 to 50%, as the dots are blue/blue-green on the left side of the graph.
1.4 - Based on your answers to the previous two questions, give a possible reason why students
at lower-cost schools (who, presumably, have less initial debt than their peers) somehow have
higher loan default rates.
Double-click this cell to type your answer here: Lower cost schools might be more accessible
for students in poorer financial scenarios, which means they may not be able to pay off loans or

4
finish schooling, which increases default rates and decreases graduation rates.
In data science, we say that graduation rates and tuition are confounded. Since they both rise
and fall together, it can be hard to tell which is really “making the difference” in default rates. Is
it possible to “tease out” which factor is more directly associated with students being able to pay
off their loans? The next section will introduce you to a new type of modeling - multiple regression
- that can help us answer this question.

0.1.4 2.0 - Fitting and interpreting a multiple regression model


Again, let’s show the scatterplot between net_tuition (predictor) and default_rate (outcome),
along with the linear model:
[5]: ## Run this code but do not edit it
# create scatterplot: default_rate ~ net_tuition, with linear model overlayed
gf_point(default_rate ~ net_tuition, data = dat) %>% gf_lm(color = "orange")

2.1 (Review Question) - Use the lm command to fit and store the linear regression model that’s
visualized above, using net_tuition (predictor) in order to predict default_rate (outcome). Save
the model in an object called tuition_model and print out the model.
[8]: # Your code goes here
tuition_model <- lm(default_rate ~ net_tuition, data=dat)
tuition_model

5
Call:
lm(formula = default_rate ~ net_tuition, data = dat)

Coefficients:
(Intercept) net_tuition
8.0029 -0.2077

Check yourself: If you print out tuition_model, you should see two numbers: 8.0029 and -0.2077.
Recall that simple linear regressions follow this formula:

𝑦 ̂ = 𝛽 0 + 𝛽1 𝑥
Where: - 𝑦 ̂ is the predicted y-value (predicted outcome value) - 𝛽0 is the y-intercept –> the predicted
y-value (outcome value) when x = 0 (the predictor’s value is 0) - 𝛽1 is the slope –> the predicted
change in y (outcome) for a 1-unit increase in x (predictor) - 𝑥 is the x-value (predictor value)
2.2 (Review Question) - What is the slope value from our tuition_model? Interpret the
meaning of this value (in context).
Double-click this cell to type your answer here: The slope value is -0.2077, meaning for
every $1,000 increase in net tuition, default rate is expected to decrease 0.2077 percent.
2.3 (Review Question) - Use the summary command on tuition_model to see summary infor-
mation about the linear model. What is the 𝑅2 value from our tuition_model? What does this
value indicate about the strength of the model?
[9]: # Your code goes here
summary(tuition_model)

Call:
lm(formula = default_rate ~ net_tuition, data = dat)

Residuals:
Min 1Q Median 3Q Max
-6.4480 -1.9912 -0.5984 1.2492 25.4189

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.00294 0.21329 37.52 <2e-16 ***
net_tuition -0.20772 0.01331 -15.61 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.375 on 1051 degrees of freedom


Multiple R-squared: 0.1882, Adjusted R-squared: 0.1875
F-statistic: 243.7 on 1 and 1051 DF, p-value: < 2.2e-16

6
Check yourself: You should find an 𝑅2 value of 0.1882
Double-click this cell to type your answer here: the R^2 value is 0.1882, implying a weak
correlation between net tuition and default_rate
Student Note: There’s a lot going on in the following section. We recommend taking a break
to watch this video, which provides an overview of multiple regression models and walks through
interpreting the values from this model. Once you’re done with the video, continue reading below.
So far, we have only been working with simple linear regressions: models that use one predictor
variable (net_tuition) to predict the outcome variables (default_rate). If we’d like to use
multiple predictor variables at once in order to model our outcome, we can use a technique called
multiple regression.
For example, imagine we want to use both net_tuition (𝑥1 ) and grad_rate (𝑥2 ) to predict
default_rate (𝑦). We can write a new model with multiple predictors, like this:

𝑦 ̂ = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2
Where: - 𝑦 ̂ is the predicted default_rate - 𝑥1 is the net_tuition - 𝑥2 is the grad_rate
This means that… - 𝛽1 is the slope for net_tuition –> the slope between default_rate and
net_tuition, controlling for all other predictors - 𝛽2 is the slope for grad_rate –> the slope
between default_rate and grad_rate, controlling for all other predictors - 𝛽0 is the y-intercept
–> the predicted y-value when 𝑥1 = 0 and 𝑥2 = 0 (when net_tuition and grad_rate are both 0)
Let’s go ahead and fit this model, so we can understand what this all really means. In R, if we
want to use multiple predictors within our model (such as net_tuition and grad_rate), we simply
include both of them in our lm command. See below:
[10]: ## Run this code but do not edit it
# fit multiple regression: default_rate ~ net_tuition + grad_rate
tuition_grad_model <- lm(default_rate ~ net_tuition + grad_rate, data = dat)
tuition_grad_model

Call:
lm(formula = default_rate ~ net_tuition + grad_rate, data = dat)

Coefficients:
(Intercept) net_tuition grad_rate
14.478742 0.006692 -0.160296

As you can see, the values have changed a bit, and an extra slope term has now appeared in our
model. We can plug these values into our model like so:

𝑦 ̂ = 14.479 + (0.007)𝑥1 + (−0.160)𝑥2

Here’s how we can interpret the slopes in our model: - 𝛽1 = 0.007 –> For every 1,000 dollar
increase in net_tuition, we expect a 0.007 percent point increase in default_rate, controlling

7
for grad_rate - 𝛽2 = −0.160 –> For every 1 percentage point increase in grad_rate, we expect a
0.160 percentage point decrease in default_rate, controlling for net_tuition
The key is that multiple regression allows you to control for other predictors, which helps us
eliminate confounding. When we can control for graduation rates - i.e. when comparing colleges
with similar graduation rates - we see that tuition is now positively related to default rates. In
other words, if students attend colleges with similar graduation rates, we’d expect the one that
charges more in tuition to have higher rates of default.
So, charging students more for school is, in fact, associated with higher rates of default - as long
as we’re comparing among schools with similar graduation rates.
Just as we can use the summary command to find the 𝑅2 value of a simple linear regression, we can
use summary to find the 𝑅2 of our multiple regression model:
[11]: ## Run this code but do not edit it
# summary of tuition_grad_model
summary(tuition_grad_model)

Call:
lm(formula = default_rate ~ net_tuition + grad_rate, data = dat)

Residuals:
Min 1Q Median 3Q Max
-6.9530 -1.4051 -0.1913 0.9162 20.4882

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.478742 0.293922 49.261 <2e-16 ***
net_tuition 0.006692 0.013066 0.512 0.609
grad_rate -0.160296 0.006023 -26.616 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.609 on 1050 degrees of freedom


Multiple R-squared: 0.5153, Adjusted R-squared: 0.5143
F-statistic: 558.1 on 2 and 1050 DF, p-value: < 2.2e-16

2.4 - How does the 𝑅2 value of our multiple regression model (tuition_grad_model) compare
to the 𝑅2 value of our simple linear regression model (tuition_model). Did adding grad_rate
alongside net_tuition help make the model’s predictions stronger? Explain.
Double-click this cell to type your answer here: The R^2 for the multiple regression model
is higher, so adding grad_rate to net_tuition made the model’s predictions stronger.

0.1.5 3.0 - Making your own multiple regression models


3.1 - There’s no reason that you have to stop at 2 predictors. Your model could have many
predictors! Use the lm command to create a model that predicts default_rate using three pre-

8
dictor variables: net_tuition, grad_rate, and pct_PELL. Store the model in an object called
tuition_grad_pell_model and then print out the model.

[12]: # Your code goes here


tuition_grad_pell_model <- lm(default_rate ~ net_tuition + grad_rate +␣
↪pct_PELL, data=dat)

tuition_grad_pell_model

Call:
lm(formula = default_rate ~ net_tuition + grad_rate + pct_PELL,
data = dat)

Coefficients:
(Intercept) net_tuition grad_rate pct_PELL
8.51264 0.03059 -0.11731 0.09045

Check yourself: When you print out the model, you should see four numbers: 8.513, 0.031, -0.117,
0.090
3.2 - Interpret (in context) the slope value for pct_PELL from your model.
Double-click this cell to type your answer here: For every 1 percent increase of the proportion
of Pell grant receiving students at the instituion, the expected value of the default rate increases
by 0.090 percent.
3.3 - Use the summary command to find the 𝑅2 value of the tuition_grad_pell_model.

[13]: # Your code goes here


summary(tuition_grad_pell_model)

Call:
lm(formula = default_rate ~ net_tuition + grad_rate + pct_PELL,
data = dat)

Residuals:
Min 1Q Median 3Q Max
-8.1491 -1.2449 -0.0454 1.0199 18.8640

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.512643 0.552863 15.397 <2e-16 ***
net_tuition 0.030587 0.012354 2.476 0.0134 *
grad_rate -0.117307 0.006603 -17.766 <2e-16 ***
pct_PELL 0.090451 0.007275 12.432 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

9
Residual standard error: 2.437 on 1049 degrees of freedom
Multiple R-squared: 0.5775, Adjusted R-squared: 0.5763
F-statistic: 478 on 3 and 1049 DF, p-value: < 2.2e-16

Check yourself: The output should show an 𝑅2 value of 0.5775


3.4 - Compare the 𝑅2 values from tuition_grad_model and tuition_grad_pell_model. Did
adding pct_PELL strengthen the model’s predictions? If so, did it strengthen the model’s predictions
by a large amount? Explain.
Double-click this cell to type your answer here: Yes, the model’s factor of determination
increased from 0.51 to 0.57, which is an improvement on its predictive capability on our data.
3.5 - Create your own multiple regression model, using variables of your own choosing. Analyze
the slope values from at least two separate predictors and try to maximize the 𝑅2 value.
Hints: - Look at the dataset description here to identify good potential predictor variables for
your model. - You may be tempted to use all the variables in the dataset as predictors. This may
not be the best idea. The next notebook will explore why.
[21]: # Your code goes here
fam_income_tuition_model <- lm(default_rate ~ med_alum_earnings + grad_rate +␣
↪pct_PELL, data=dat)

summary(fam_income_tuition_model)

Call:
lm(formula = default_rate ~ med_alum_earnings + grad_rate + pct_PELL,
data = dat)

Residuals:
Min 1Q Median 3Q Max
-8.124 -1.339 -0.184 1.097 18.563

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.812792 0.593973 16.521 < 2e-16 ***
med_alum_earnings -0.042631 0.008303 -5.135 3.37e-07 ***
grad_rate -0.089102 0.007170 -12.427 < 2e-16 ***
pct_PELL 0.081407 0.007222 11.272 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.414 on 1049 degrees of freedom


Multiple R-squared: 0.5855, Adjusted R-squared: 0.5843
F-statistic: 493.8 on 3 and 1049 DF, p-value: < 2.2e-16

10
0.1.6 Feedback (Required)
Please take 2 minutes to fill out this anonymous notebook feedback form, so we can continue
improving this notebook for future years!

11

You might also like