Notebook 2 - Linear Regression
Notebook 2 - Linear Regression
Reference Guide for R (student resource) - Check out our reference guide for a full listing
of useful R commands for this project.
0.1 Data Science Project: Use data to determine the best and worst colleges
for conquering student debt.
0.1.1 Notebook 2: Simple Linear Regression
Does college pay off? We’ll use some of the latest data from the US Department of Education’s
College Scorecard Database to answer that question.
In this notebook (the 2nd of 4 total notebooks), you’ll use R to create scatterplots, fit simple linear
regression models, and compare the strength of your models. By the end of this notebook, you’ll
see what factors make certain colleges better investments than others.
[1]: ## Run this code but do not edit it. Hit Ctrl+Enter to run the code
# This command downloads a useful package of R commands
library(coursekata)
[ ]:
[ ]:
1
munity colleges and trade schools often have different goals (e.g. facilitating transfers, direct career
education) than institutions that offer four-year bachelors degrees. By comparing four-year colleges
only to other four-year colleges, we’ll have clearer analyses and conclusions.
This data is a subset of the US Department of Education’s College Scorecard Database. The data
is current as of the 2020-2021 school year.
Description of all variables: See here
Detailed data file description: See here
[2]: ## Run this code but do not edit it. Hit Ctrl+Enter to run the code
# This command downloads the data
dat <- read.csv('https://skewthescript.org/s/four_year_colleges.csv')
1.1 - Use the head command to print out the first several rows of the dataset.
[3]: # Your code goes here
head(dat)
1. 1053 2. 26
Check yourself: Your code should have printed out two numbers: 1053 and 26.
A good measure of whether attending a certain college “pays off” is its student loan default
rate. If a college is low-cost and prepares students for high-paying jobs, few students will default
on their loans. If a college is high-cost and does not prepare students for high-paying jobs, many
students will have trouble paying off their loans (high default rate).
So, our main outcome variable in this analysis will be default_rate. We’re going to use scatter-
plots to see how strongly different predictor variables correlate with default rates. In particular,
we’re going to explore how well each of the following variables predicts colleges’ default rates: -
pct_PELL - percent of student body that receives PELL grants. Note: PELL grants are government
2
scholarships given to students from low-income families - grad_rate - percent of students who suc-
cessfully graduate - net_tuition - Net tuition (tuition minus average discounts and allowances)
per student, in thousands of dollars
To begin, let’s create a scatterplot of colleges’ default rates and the percent of their student body
that receive PELL grants. We can use the gf_point command to make the graph:
We see that there’s a positive relationship between pct_PELL and default_rate. The colleges with
the highest rates of PELL grant recipients (low-income students) also tend to have higher student
loan default rates. In other words, if you were to fit a model to this data, it would predict higher
default rates at schools that serve more PELL recipients.
We must keep in mind: correlation is not causation. The scatterplot shows us that default
rates and PELL recipient rates are positively correlated. However, the graph doesn’t show us a
clear causal explanation behind the correlation. For example, here are several causal explanations
that this graph can’t clarify: - PELL recipients may only be able to afford to attend low-quality
colleges. These colleges have higher default rates because they fail to prepare students for the
workforce. - PELL recipients may have less familial resources to weather the storms of financial
emergencies in the first few years after college. So, the schools that serve PELL recipients at high
rates will also have more of their students defaulting on loans (regardless of the school’s quality). -
PELL recipients may have attended lower-quality high schools, which don’t properly prepare them
3
for college. So, these students may drop out of college at higher rates, which raises their chances
of defaulting on student loans.
Or, it could be a combination of all those explanations! We can’t tell from this analysis alone.
1.3 - In the next question, you will create a scatterplot that visualizes the relationship between
grad_rate and default_rate. Before doing so, make a prediction: Do you expect student loan
default rates to positively or negatively correlate with graduation rates? Why?
Double-click this cell to type your answer here: Negatively correlate because non-graduates
will have a harder time in the workforce and have an increased rate of defaulting on their loans, so
lower grad rate would lead to higher default rates, therefore a negative correlation.
1.4 - Create a scatterplot that visualizes the relationship between grad_rate (predictor) and
default_rate (outcome).
Check yourself: Your code should have generated a scatterplot with the x-axis labled with
grad_rate and the y-axis labeled with default_rate.
1.5 - Using your scatterplot, describe the relationship between graduation rates and student loan
default rates. For instance, are these variables positively or negatively related? How can you tell?
Does this corroborate your prediction from Question 1.3? Explain.
Double-click this cell to type your answer here: negatively linear correlation because the
4
points strongly match a negative linear prediction, this matches my prediction from the previous
question.
2.2 - Is the slope value of this model positive or negative? How can you tell?
Double-click this cell to type your answer here: positive because as pct_PELL increases,
default_rate increases as well.
5
R can help us find the equation that models this linear regression line. As shown in the video,
we can model a linear trend between a predictor (x) and outcome (y) using this linear regression
formula:
𝑦 ̂ = 𝛽 0 + 𝛽1 𝑥
Where: - 𝑦 ̂ (pronounced “y hat”) is the predicted y-value (predicted outcome value) - 𝛽0 (pro-
nounced “beta zero”) is the y-intercept –> the predicted y-value (outcome value) when x = 0
(the predictor’s value is 0) - 𝛽1 (pronounced “beta 1”) is the slope –> the predicted change in y
(outcome) for a 1-unit increase in x (predictor) - 𝑥 is the x-value (predictor value)
To fit a linear regression model to a set of data in R, we use the lm command. lm stands for
“linear model.” Here, we use lm to find the linear regression model relating pct_PELL (x) and
default_rate (y).
Call:
lm(formula = default_rate ~ pct_PELL, data = dat)
Coefficients:
(Intercept) pct_PELL
-0.9327 0.1765
The output of the lm command is a bit clunky, but here’s what it means: - The (Intercept) value
is the y-intercept (𝛽0 ) - The pct_PELL value is the coefficient for the predictor. In other words, it’s
the slope (𝛽1 )
So, our regression equation can be written as:
𝑦 ̂ = −0.9327 + (0.1765)𝑥
2.3 - Identify the slope value and interpret what it means (in context).
Double-click this cell to type your answer here: For every increase of 1% in percentage of
PELL grant receiving students, there is expected to be a 0.1675% increase in default rate.
2.4 - Use the gf_point and gf_lm commands to visualize a linear regression model for predicting
default_rate (outcome) using grad_rate (predictor).
6
Check yourself: Your scatterplot should have a line on it with a negative slope.
2.5 - Use the lm command to find the linear regression model you visualized above. Store the
model in an object called grad_model and print it to see its values.
Call:
lm(formula = default_rate ~ grad_rate, data = dat)
Coefficients:
(Intercept) grad_rate
14.4600 -0.1584
Check yourself: If you print out grad_model, you should see two numbers: 14.46 and -0.1584.
2.6 - Identify the slope value and interpret what it means (in context).
Double-click this cell to type your answer here: for every 1% increase in graduation rate,
there is expected to be a 0.1584% decrease in default rate.
7
0.1.5 3.0 - Analyzing strength (𝑅2 )
In addition to the direction of a relationship (positive or negative), we can also look at the strength
of a relationship. The strength is a measure of the quality of our model’s predictions. A key
metric for analyzing the strength of a model is 𝑅2 . The following diagram (from Skew The Script)
shows the 𝑅2 values of various linear models:
In the “weak” correlations, we see that our predictions (the linear model) tend to be far away from
the actual data values (the points). If we used a model with weak correlation to predict new data
values, our predictions would have high error. If we used a model with strong correlation to predict
new data values, our predictions would have low error.
𝑅2 takes values between 0 - 1 (alternatively: 0% - 100%). The stronger the model, the closer 𝑅2
gets to 1 (or 100%). The weaker the model, the closer 𝑅2 gets to 0 (or 0%). An intuitive way to
think about it: for the perfectly strong correlations, the model gives 100% perfect predictions. The
models explain 100% of the variation in the data, so 𝑅2 = 100%. As the correlations get weaker,
they start leaving room for error, since the models capture less of the variation in the data. So, the
𝑅2 value declines from 100%, approaching 0% if there’s no correlation (model adds no prediction
power compared to naive guessing).
Optional Resource: If you’d like a more thorough explanation of the math behind 𝑅2 , check out
this video.
To see the 𝑅2 values of our linear regression models, we can use the summary command. For
example, here we get the summary printout of grad_model.
Call:
lm(formula = default_rate ~ grad_rate, data = dat)
Residuals:
Min 1Q Median 3Q Max
-6.9199 -1.4038 -0.2248 0.9011 20.5450
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.45997 0.29152 49.60 <2e-16 ***
grad_rate -0.15839 0.00474 -33.42 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
There’s a lot going on in this printout. For now, focus at the bottom of the printed information.
8
The Multiple R-squared value is the 𝑅2 value for the model. In this case, 𝑅2 = 51.5%. So, we
can say that the correlation between graduation rates and student loan default rates is moderately
strong. This model would yield moderately strong predictions for default rates if used to predict
on new colleges.
3.1 - Let’s consider a new variable: net_tuition (tuition minus average discounts and allowances
per student, in thousands of dollars). How well does a school’s tuition predict its student loan
default rate? Let’s start exploring. Go ahead and create a scatterplot that visualizes the relationship
between net_tuition (predictor) and default_rate (outcome). Overlay a linear regression
model on the graph using the %>% gf_lm(color = "orange") command.
3.2 - Use the lm command to find the linear regression model you visualized above. Store the
model in an object called tuition_model and print out the model’s values.
[18]: # Your code goes here
tuition_model <- lm(default_rate ~ net_tuition, data=dat)
tuition_model
Call:
lm(formula = default_rate ~ net_tuition, data = dat)
9
Coefficients:
(Intercept) net_tuition
8.0029 -0.2077
Check yourself: If you print out tuition_model, you should see two numbers: 8.0029 and -0.2077.
3.3 - Use the summary command to find the 𝑅2 value of your linear model.
[19]: # Your code goes here
summary(tuition_model)
Call:
lm(formula = default_rate ~ net_tuition, data = dat)
Residuals:
Min 1Q Median 3Q Max
-6.4480 -1.9912 -0.5984 1.2492 25.4189
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.00294 0.21329 37.52 <2e-16 ***
net_tuition -0.20772 0.01331 -15.61 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
10
0.1.6 Feedback (Required)
Please take 2 minutes to fill out this anonymous notebook feedback form, so we can continue
improving this notebook for future years!
11