Notebook 2 - Linear Regression
Notebook 2 - Linear Regression
Reference Guide for R (student resource) - Check out our reference guide for a full listing
of useful R commands for this project.
0.1 Data Science Project: Use data to determine the best and worst colleges
for conquering student debt.
0.1.1 Notebook 2: Simple Linear Regression
Does college pay off? We’ll use some of the latest data from the US Department of Education’s
College Scorecard Database to answer that question.
In this notebook (the 2nd of 4 total notebooks), you’ll use R to create scatterplots, fit simple linear
regression models, and compare the strength of your models. By the end of this notebook, you’ll
see what factors make certain colleges better investments than others.
[1]: ## Run this code but do not edit it. Hit Ctrl+Enter to run the code
# This command downloads a useful package of R commands
library(coursekata)
1
This data is a subset of the US Department of Education’s College Scorecard Database. The data
is current as of the 2020-2021 school year.
Description of all variables: See here
Detailed data file description: See here
[2]: ## Run this code but do not edit it. Hit Ctrl+Enter to run the code
# This command downloads the data
dat <- read.csv('https://skewthescript.org/s/four_year_colleges.csv')
1.1 - Use the head command to print out the first several rows of the dataset.
[3]: # Your code goes here
head(dat)
1. 1053 2. 26
Check yourself: Your code should have printed out two numbers: 1053 and 26.
A good measure of whether attending a certain college “pays off” is its student loan default
rate. If a college is low-cost and prepares students for high-paying jobs, few students will default
on their loans. If a college is high-cost and does not prepare students for high-paying jobs, many
students will have trouble paying off their loans (high default rate).
So, our main outcome variable in this analysis will be default_rate. We’re going to use scatter-
plots to see how strongly different predictor variables correlate with default rates. In particular,
we’re going to explore how well each of the following variables predicts colleges’ default rates: -
pct_PELL - percent of student body that receives PELL grants. Note: PELL grants are government
scholarships given to students from low-income families - grad_rate - percent of students who suc-
cessfully graduate - net_tuition - Net tuition (tuition minus average discounts and allowances)
per student, in thousands of dollars
2
To begin, let’s create a scatterplot of colleges’ default rates and the percent of their student body
that receive PELL grants. We can use the gf_point command to make the graph:
We see that there’s a positive relationship between pct_PELL and default_rate. The colleges with
the highest rates of PELL grant recipients (low-income students) also tend to have higher student
loan default rates. In other words, if you were to fit a model to this data, it would predict higher
default rates at schools that serve more PELL recipients.
We must keep in mind: correlation is not causation. The scatterplot shows us that default
rates and PELL recipient rates are positively correlated. However, the graph doesn’t show us a
clear causal explanation behind the correlation. For example, here are several causal explanations
that this graph can’t clarify: - PELL recipients may only be able to afford to attend low-quality
colleges. These colleges have higher default rates because they fail to prepare students for the
workforce. - PELL recipients may have less familial resources to weather the storms of financial
emergencies in the first few years after college. So, the schools that serve PELL recipients at high
rates will also have more of their students defaulting on loans (regardless of the school’s quality). -
PELL recipients may have attended lower-quality high schools, which don’t properly prepare them
for college. So, these students may drop out of college at higher rates, which raises their chances
of defaulting on student loans.
Or, it could be a combination of all those explanations! We can’t tell from this analysis alone.
3
1.3 - In the next question, you will create a scatterplot that visualizes the relationship between
grad_rate and default_rate. Before doing so, make a prediction: Do you expect student loan
default rates to positively or negatively correlate with graduation rates? Why?
Double-click this cell to type your answer here: negatively, since if students cannot pay off
loans, graduation rates would be low
1.4 - Create a scatterplot that visualizes the relationship between grad_rate (predictor) and
default_rate (outcome).
Check yourself: Your code should have generated a scatterplot with the x-axis labled with
grad_rate and the y-axis labeled with default_rate.
1.5 - Using your scatterplot, describe the relationship between graduation rates and student loan
default rates. For instance, are these variables positively or negatively related? How can you tell?
Does this corroborate your prediction from Question 1.3? Explain.
Double-click this cell to type your answer here: negatively related, for ever unit increase in
grad_rate, decrease in default_rate, does corroborate pediction from q 1.3
4
0.1.4 2.0 - Simple linear regression (one predictor)
2.1 - If you haven’t taken AP Stats, watch this video, which provides an introduction to linear
regression.
Note: This video is adapted from other materials and covers data from a separate context. How-
ever, the video provides a good intro to the concepts and models we’ll be using in this section of
the project.
Let’s create a linear regression model relating pct_PELL (x) and default_rate (y). To visualize our
model, we can graph the line modeled by our equation on top of the scatterplot relating pct_PELL
to default_rate. We use the gf_point command to produce the scatterplot, the gf_lm command
to graph our linear model, and the %>% symbol to put the elements together on the same graph:
[7]: ## Run this code but do not edit it
# Overlay linear model of default_rate ~ pct_PELL on top of scatterplot
gf_point(default_rate ~ pct_PELL, data = dat) %>% gf_lm(color = "orange")
2.2 - Is the slope value of this model positive or negative? How can you tell?
Double-click this cell to type your answer here: positive, line sloping upward
R can help us find the equation that models this linear regression line. As shown in the video,
we can model a linear trend between a predictor (x) and outcome (y) using this linear regression
formula:
5
𝑦 ̂ = 𝛽 0 + 𝛽1 𝑥
Where: - 𝑦 ̂ (pronounced “y hat”) is the predicted y-value (predicted outcome value) - 𝛽0 (pro-
nounced “beta zero”) is the y-intercept –> the predicted y-value (outcome value) when x = 0
(the predictor’s value is 0) - 𝛽1 (pronounced “beta 1”) is the slope –> the predicted change in y
(outcome) for a 1-unit increase in x (predictor) - 𝑥 is the x-value (predictor value)
To fit a linear regression model to a set of data in R, we use the lm command. lm stands for
“linear model.” Here, we use lm to find the linear regression model relating pct_PELL (x) and
default_rate (y).
Call:
lm(formula = default_rate ~ pct_PELL, data = dat)
Coefficients:
(Intercept) pct_PELL
-0.9327 0.1765
The output of the lm command is a bit clunky, but here’s what it means: - The (Intercept) value
is the y-intercept (𝛽0 ) - The pct_PELL value is the coefficient for the predictor. In other words, it’s
the slope (𝛽1 )
So, our regression equation can be written as:
𝑦 ̂ = −0.9327 + (0.1765)𝑥
2.3 - Identify the slope value and interpret what it means (in context).
Double-click this cell to type your answer here: For every one unit increase in the pct_PELL
value, the default_rate increases by 0.1765
2.4 - Use the gf_point and gf_lm commands to visualize a linear regression model for predicting
default_rate (outcome) using grad_rate (predictor).
6
Check yourself: Your scatterplot should have a line on it with a negative slope.
2.5 - Use the lm command to find the linear regression model you visualized above. Store the
model in an object called grad_model and print it to see its values.
Call:
lm(formula = default_rate ~ grad_rate, data = dat)
Coefficients:
(Intercept) grad_rate
14.4600 -0.1584
Check yourself: If you print out grad_model, you should see two numbers: 14.46 and -0.1584.
2.6 - Identify the slope value and interpret what it means (in context).
Double-click this cell to type your answer here: slope -0.1584, for every one unit increase in
the grad_rate, there is a 0.1584 decrease in the loan default rate
7
0.1.5 3.0 - Analyzing strength (𝑅2 )
In addition to the direction of a relationship (positive or negative), we can also look at the strength
of a relationship. The strength is a measure of the quality of our model’s predictions. A key
metric for analyzing the strength of a model is 𝑅2 . The following diagram (from Skew The Script)
shows the 𝑅2 values of various linear models:
In the “weak” correlations, we see that our predictions (the linear model) tend to be far away from
the actual data values (the points). If we used a model with weak correlation to predict new data
values, our predictions would have high error. If we used a model with strong correlation to predict
new data values, our predictions would have low error.
𝑅2 takes values between 0 - 1 (alternatively: 0% - 100%). The stronger the model, the closer 𝑅2
gets to 1 (or 100%). The weaker the model, the closer 𝑅2 gets to 0 (or 0%). An intuitive way to
think about it: for the perfectly strong correlations, the model gives 100% perfect predictions. The
models explain 100% of the variation in the data, so 𝑅2 = 100%. As the correlations get weaker,
they start leaving room for error, since the models capture less of the variation in the data. So, the
𝑅2 value declines from 100%, approaching 0% if there’s no correlation (model adds no prediction
power compared to naive guessing).
Optional Resource: If you’d like a more thorough explanation of the math behind 𝑅2 , check out
this video.
To see the 𝑅2 values of our linear regression models, we can use the summary command. For
example, here we get the summary printout of grad_model.
Call:
lm(formula = default_rate ~ grad_rate, data = dat)
Residuals:
Min 1Q Median 3Q Max
-6.9199 -1.4038 -0.2248 0.9011 20.5450
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.45997 0.29152 49.60 <2e-16 ***
grad_rate -0.15839 0.00474 -33.42 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
There’s a lot going on in this printout. For now, focus at the bottom of the printed information.
8
The Multiple R-squared value is the 𝑅2 value for the model. In this case, 𝑅2 = 51.5%. So, we
can say that the correlation between graduation rates and student loan default rates is moderately
strong. This model would yield moderately strong predictions for default rates if used to predict
on new colleges.
3.1 - Let’s consider a new variable: net_tuition (tuition minus average discounts and allowances
per student, in thousands of dollars). How well does a school’s tuition predict its student loan
default rate? Let’s start exploring. Go ahead and create a scatterplot that visualizes the relationship
between net_tuition (predictor) and default_rate (outcome). Overlay a linear regression
model on the graph using the %>% gf_lm(color = "orange") command.
3.2 - Use the lm command to find the linear regression model you visualized above. Store the
model in an object called tuition_model and print out the model’s values.
[16]: # Your code goes here
tuition_model <- lm(default_rate ~ net_tuition, data = dat)
tuition_model
Call:
lm(formula = default_rate ~ net_tuition, data = dat)
9
Coefficients:
(Intercept) net_tuition
8.0029 -0.2077
Check yourself: If you print out tuition_model, you should see two numbers: 8.0029 and -0.2077.
3.3 - Use the summary command to find the 𝑅2 value of your linear model.
[17]: # Your code goes here
summary(tuition_model)
Call:
lm(formula = default_rate ~ net_tuition, data = dat)
Residuals:
Min 1Q Median 3Q Max
-6.4480 -1.9912 -0.5984 1.2492 25.4189
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.00294 0.21329 37.52 <2e-16 ***
net_tuition -0.20772 0.01331 -15.61 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
10
0.1.6 Feedback (Required)
Please take 2 minutes to fill out this anonymous notebook feedback form, so we can continue
improving this notebook for future years!
11