0% found this document useful (0 votes)
104 views

Notebook 2 - Linear Regression

Uploaded by

Blobby Hatchner
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views

Notebook 2 - Linear Regression

Uploaded by

Blobby Hatchner
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Notebook 2 - Linear Regression

May 23, 2024

Reference Guide for R (student resource) - Check out our reference guide for a full listing
of useful R commands for this project.

0.1 Data Science Project: Use data to determine the best and worst colleges
for conquering student debt.
0.1.1 Notebook 2: Simple Linear Regression
Does college pay off? We’ll use some of the latest data from the US Department of Education’s
College Scorecard Database to answer that question.
In this notebook (the 2nd of 4 total notebooks), you’ll use R to create scatterplots, fit simple linear
regression models, and compare the strength of your models. By the end of this notebook, you’ll
see what factors make certain colleges better investments than others.
[1]: ## Run this code but do not edit it. Hit Ctrl+Enter to run the code
# This command downloads a useful package of R commands
library(coursekata)

�� CourseKata packages ������������������������������������


coursekata 0.15.0 ��
� dslabs 0.8.0 � Metrics
0.1.4
� Lock5withR 1.2.2 � lsr
0.5.2
� fivethirtyeightdata 0.1.0 � mosaic
1.9.1
� fivethirtyeight 0.6.2 � supernova
3.0.0

0.1.2 The Dataset (four_year_colleges.csv)


General description - In this notebook, we’ll be using the four_year_colleges.csv file, which
only includes schools that offer four-year bachelors degrees and/or higher graduate degrees. Com-
munity colleges and trade schools often have different goals (e.g. facilitating transfers, direct career
education) than institutions that offer four-year bachelors degrees. By comparing four-year colleges
only to other four-year colleges, we’ll have clearer analyses and conclusions.

1
This data is a subset of the US Department of Education’s College Scorecard Database. The data
is current as of the 2020-2021 school year.
Description of all variables: See here
Detailed data file description: See here

0.1.3 1.0 - Creating scatterplots


To begin, let’s download our data. We’ll download the four_year_colleges.csv file from the
skewthescript.org website and store it in an R dataframe called dat.

[2]: ## Run this code but do not edit it. Hit Ctrl+Enter to run the code
# This command downloads the data
dat <- read.csv('https://skewthescript.org/s/four_year_colleges.csv')

1.1 - Use the head command to print out the first several rows of the dataset.
[3]: # Your code goes here
head(dat)

OPEID name city state region me


<int> <chr> <chr> <chr> <chr> <d
1 100200 Alabama A & M University Normal AL South 15.
2 105200 University of Alabama at Birmingham Birmingham AL South 15.
A data.frame: 6 × 26
3 105500 University of Alabama in Huntsville Huntsville AL South 14.
4 100500 Alabama State University Montgomery AL South 17.
5 105100 The University of Alabama Tuscaloosa AL South 17.
6 831000 Auburn University at Montgomery Montgomery AL South 12.
1.2 - Use the dim command to find the number of colleges (rows) and number of variables (columns)
in our dataset.
[4]: # Your code goes here
dim(dat)

1. 1053 2. 26
Check yourself: Your code should have printed out two numbers: 1053 and 26.
A good measure of whether attending a certain college “pays off” is its student loan default
rate. If a college is low-cost and prepares students for high-paying jobs, few students will default
on their loans. If a college is high-cost and does not prepare students for high-paying jobs, many
students will have trouble paying off their loans (high default rate).
So, our main outcome variable in this analysis will be default_rate. We’re going to use scatter-
plots to see how strongly different predictor variables correlate with default rates. In particular,
we’re going to explore how well each of the following variables predicts colleges’ default rates: -
pct_PELL - percent of student body that receives PELL grants. Note: PELL grants are government
scholarships given to students from low-income families - grad_rate - percent of students who suc-
cessfully graduate - net_tuition - Net tuition (tuition minus average discounts and allowances)
per student, in thousands of dollars

2
To begin, let’s create a scatterplot of colleges’ default rates and the percent of their student body
that receive PELL grants. We can use the gf_point command to make the graph:

[5]: ## Run this code but do not edit it


# Create scatterplot: default_rate ~ pct_PELL
gf_point(default_rate ~ pct_PELL, data = dat)

We see that there’s a positive relationship between pct_PELL and default_rate. The colleges with
the highest rates of PELL grant recipients (low-income students) also tend to have higher student
loan default rates. In other words, if you were to fit a model to this data, it would predict higher
default rates at schools that serve more PELL recipients.
We must keep in mind: correlation is not causation. The scatterplot shows us that default
rates and PELL recipient rates are positively correlated. However, the graph doesn’t show us a
clear causal explanation behind the correlation. For example, here are several causal explanations
that this graph can’t clarify: - PELL recipients may only be able to afford to attend low-quality
colleges. These colleges have higher default rates because they fail to prepare students for the
workforce. - PELL recipients may have less familial resources to weather the storms of financial
emergencies in the first few years after college. So, the schools that serve PELL recipients at high
rates will also have more of their students defaulting on loans (regardless of the school’s quality). -
PELL recipients may have attended lower-quality high schools, which don’t properly prepare them
for college. So, these students may drop out of college at higher rates, which raises their chances
of defaulting on student loans.
Or, it could be a combination of all those explanations! We can’t tell from this analysis alone.

3
1.3 - In the next question, you will create a scatterplot that visualizes the relationship between
grad_rate and default_rate. Before doing so, make a prediction: Do you expect student loan
default rates to positively or negatively correlate with graduation rates? Why?
Double-click this cell to type your answer here: negatively, since if students cannot pay off
loans, graduation rates would be low
1.4 - Create a scatterplot that visualizes the relationship between grad_rate (predictor) and
default_rate (outcome).

[6]: # Your code goes here


gf_point(default_rate ~ grad_rate, data = dat)

Check yourself: Your code should have generated a scatterplot with the x-axis labled with
grad_rate and the y-axis labeled with default_rate.
1.5 - Using your scatterplot, describe the relationship between graduation rates and student loan
default rates. For instance, are these variables positively or negatively related? How can you tell?
Does this corroborate your prediction from Question 1.3? Explain.
Double-click this cell to type your answer here: negatively related, for ever unit increase in
grad_rate, decrease in default_rate, does corroborate pediction from q 1.3

4
0.1.4 2.0 - Simple linear regression (one predictor)
2.1 - If you haven’t taken AP Stats, watch this video, which provides an introduction to linear
regression.
Note: This video is adapted from other materials and covers data from a separate context. How-
ever, the video provides a good intro to the concepts and models we’ll be using in this section of
the project.
Let’s create a linear regression model relating pct_PELL (x) and default_rate (y). To visualize our
model, we can graph the line modeled by our equation on top of the scatterplot relating pct_PELL
to default_rate. We use the gf_point command to produce the scatterplot, the gf_lm command
to graph our linear model, and the %>% symbol to put the elements together on the same graph:
[7]: ## Run this code but do not edit it
# Overlay linear model of default_rate ~ pct_PELL on top of scatterplot
gf_point(default_rate ~ pct_PELL, data = dat) %>% gf_lm(color = "orange")

2.2 - Is the slope value of this model positive or negative? How can you tell?
Double-click this cell to type your answer here: positive, line sloping upward
R can help us find the equation that models this linear regression line. As shown in the video,
we can model a linear trend between a predictor (x) and outcome (y) using this linear regression
formula:

5
𝑦 ̂ = 𝛽 0 + 𝛽1 𝑥
Where: - 𝑦 ̂ (pronounced “y hat”) is the predicted y-value (predicted outcome value) - 𝛽0 (pro-
nounced “beta zero”) is the y-intercept –> the predicted y-value (outcome value) when x = 0
(the predictor’s value is 0) - 𝛽1 (pronounced “beta 1”) is the slope –> the predicted change in y
(outcome) for a 1-unit increase in x (predictor) - 𝑥 is the x-value (predictor value)
To fit a linear regression model to a set of data in R, we use the lm command. lm stands for
“linear model.” Here, we use lm to find the linear regression model relating pct_PELL (x) and
default_rate (y).

[8]: ## Run this code but do not edit it


# Create and display linear model: default_rate ~ pct_PELL
PELL_model <- lm(default_rate ~ pct_PELL, data = dat)
PELL_model

Call:
lm(formula = default_rate ~ pct_PELL, data = dat)

Coefficients:
(Intercept) pct_PELL
-0.9327 0.1765

The output of the lm command is a bit clunky, but here’s what it means: - The (Intercept) value
is the y-intercept (𝛽0 ) - The pct_PELL value is the coefficient for the predictor. In other words, it’s
the slope (𝛽1 )
So, our regression equation can be written as:

𝑦 ̂ = −0.9327 + (0.1765)𝑥

2.3 - Identify the slope value and interpret what it means (in context).
Double-click this cell to type your answer here: For every one unit increase in the pct_PELL
value, the default_rate increases by 0.1765
2.4 - Use the gf_point and gf_lm commands to visualize a linear regression model for predicting
default_rate (outcome) using grad_rate (predictor).

[10]: # Your code goes here


gf_point(default_rate ~ grad_rate, data = dat) %>% gf_lm(color = "orange")

6
Check yourself: Your scatterplot should have a line on it with a negative slope.
2.5 - Use the lm command to find the linear regression model you visualized above. Store the
model in an object called grad_model and print it to see its values.

[13]: # Your code goes here


grad_model <- lm(default_rate ~ grad_rate, data = dat)
grad_model

Call:
lm(formula = default_rate ~ grad_rate, data = dat)

Coefficients:
(Intercept) grad_rate
14.4600 -0.1584

Check yourself: If you print out grad_model, you should see two numbers: 14.46 and -0.1584.
2.6 - Identify the slope value and interpret what it means (in context).
Double-click this cell to type your answer here: slope -0.1584, for every one unit increase in
the grad_rate, there is a 0.1584 decrease in the loan default rate

7
0.1.5 3.0 - Analyzing strength (𝑅2 )
In addition to the direction of a relationship (positive or negative), we can also look at the strength
of a relationship. The strength is a measure of the quality of our model’s predictions. A key
metric for analyzing the strength of a model is 𝑅2 . The following diagram (from Skew The Script)
shows the 𝑅2 values of various linear models:
In the “weak” correlations, we see that our predictions (the linear model) tend to be far away from
the actual data values (the points). If we used a model with weak correlation to predict new data
values, our predictions would have high error. If we used a model with strong correlation to predict
new data values, our predictions would have low error.
𝑅2 takes values between 0 - 1 (alternatively: 0% - 100%). The stronger the model, the closer 𝑅2
gets to 1 (or 100%). The weaker the model, the closer 𝑅2 gets to 0 (or 0%). An intuitive way to
think about it: for the perfectly strong correlations, the model gives 100% perfect predictions. The
models explain 100% of the variation in the data, so 𝑅2 = 100%. As the correlations get weaker,
they start leaving room for error, since the models capture less of the variation in the data. So, the
𝑅2 value declines from 100%, approaching 0% if there’s no correlation (model adds no prediction
power compared to naive guessing).
Optional Resource: If you’d like a more thorough explanation of the math behind 𝑅2 , check out
this video.
To see the 𝑅2 values of our linear regression models, we can use the summary command. For
example, here we get the summary printout of grad_model.

[14]: ## Run this code but do not edit it


# Summarize default_rate ~ grad_rate model
summary(grad_model)

Call:
lm(formula = default_rate ~ grad_rate, data = dat)

Residuals:
Min 1Q Median 3Q Max
-6.9199 -1.4038 -0.2248 0.9011 20.5450

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.45997 0.29152 49.60 <2e-16 ***
grad_rate -0.15839 0.00474 -33.42 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.608 on 1051 degrees of freedom


Multiple R-squared: 0.5151, Adjusted R-squared: 0.5147
F-statistic: 1117 on 1 and 1051 DF, p-value: < 2.2e-16

There’s a lot going on in this printout. For now, focus at the bottom of the printed information.

8
The Multiple R-squared value is the 𝑅2 value for the model. In this case, 𝑅2 = 51.5%. So, we
can say that the correlation between graduation rates and student loan default rates is moderately
strong. This model would yield moderately strong predictions for default rates if used to predict
on new colleges.
3.1 - Let’s consider a new variable: net_tuition (tuition minus average discounts and allowances
per student, in thousands of dollars). How well does a school’s tuition predict its student loan
default rate? Let’s start exploring. Go ahead and create a scatterplot that visualizes the relationship
between net_tuition (predictor) and default_rate (outcome). Overlay a linear regression
model on the graph using the %>% gf_lm(color = "orange") command.

[15]: # Your code goes here


gf_point(default_rate ~ net_tuition, data = dat) %>% gf_lm(color="orange")

3.2 - Use the lm command to find the linear regression model you visualized above. Store the
model in an object called tuition_model and print out the model’s values.
[16]: # Your code goes here
tuition_model <- lm(default_rate ~ net_tuition, data = dat)
tuition_model

Call:
lm(formula = default_rate ~ net_tuition, data = dat)

9
Coefficients:
(Intercept) net_tuition
8.0029 -0.2077

Check yourself: If you print out tuition_model, you should see two numbers: 8.0029 and -0.2077.
3.3 - Use the summary command to find the 𝑅2 value of your linear model.
[17]: # Your code goes here
summary(tuition_model)

Call:
lm(formula = default_rate ~ net_tuition, data = dat)

Residuals:
Min 1Q Median 3Q Max
-6.4480 -1.9912 -0.5984 1.2492 25.4189

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.00294 0.21329 37.52 <2e-16 ***
net_tuition -0.20772 0.01331 -15.61 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.375 on 1051 degrees of freedom


Multiple R-squared: 0.1882, Adjusted R-squared: 0.1875
F-statistic: 243.7 on 1 and 1051 DF, p-value: < 2.2e-16

Check yourself: The 𝑅2 value for tuition_model should be 0.1882.


3.4 - When evaluating different college options to predict if attending them would “pay off,” many
students look very closely at the tuition and costs of attending. Very few students look at colleges’
graduation rates. Is this reasonable or a mistake? Justify your answers using the 𝑅2 values for the
grad_model and tuition_model.
Double-click this cell to type your answer here: for grad model much higher than tuition
model, so grad rates provide better estimate than tuition cost
3.5 - The correlation between tuition costs and student loan default rates is negative. This means
that as tuition costs get higher, fewer student tend to default on their student loans. Is that
possible? What might be going on here?
Double-click this cell to type your answer here: relationship would ideally be positive, but
quality of school may be preparing students to pay off tuition

10
0.1.6 Feedback (Required)
Please take 2 minutes to fill out this anonymous notebook feedback form, so we can continue
improving this notebook for future years!

11

You might also like