LinearRegressionLab
LinearRegressionLab
Katelyn Smith
2023-11-01
Purpose
To demonstrate the application of prediction using linear regression, assessing linear model fit, and interpreting
the output of a linear model.
The Data
This data is from the 1840 census. It provided counts of various demographics at a county level.
The data set was downloaded from the Integrated Public Use Microdata Series out of the University of
Minnesota. The documentation and data download can be found here
census1840 <- read_csv("C_1840.csv") %>%
drop_na() # delete all rows with any missing values
Exploratory Analysis
# selecting only the numeric variables
# since those are th only variables
# eligible for a linear regression
census1840_num <- select_if(census1840, is.numeric)
summary(census1840_lit$lit_rate)
1
# creating proportions for the groupf
# of people to better standardize for
# population differences
# visualization
0.5
0.4
female_prop
0.3
0.2
0.1
2
Proportion of Slaves vs Literacy Rate at a County Level in 1840
0.75
slave_prop
0.50
0.25
0.00
Linear Model
# Running the linear model linear model
# with lit_rate as the predicted
# variable and slave_prop as the
# predictor variable
Lit_model <- lm(data = census1840_prop4,
lit_rate ~ slave_prop)
##
## Call:
## lm(formula = lit_rate ~ slave_prop, data = census1840_prop4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.26772 -0.02617 0.01301 0.03859 0.05329
##
3
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.946710 0.001777 532.777 < 2e-16 ***
## slave_prop 0.031826 0.006337 5.023 5.82e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04864 on 1274 degrees of freedom
## Multiple R-squared: 0.01942, Adjusted R-squared: 0.01865
## F-statistic: 25.23 on 1 and 1274 DF, p-value: 5.821e-07
(f_val <- Lit_model_Summary$fstatistic[1])
## value
## 25.22695
(f_df1 <- Lit_model_Summary$fstatistic[2])
## numdf
## 1
(f_df2 <- Lit_model_Summary$fstatistic[3])
## dendf
## 1274
(f_pval <- pf(f_val, f_df1, f_df2, lower.tail = FALSE))
## value
## 5.820978e-07
(rsqr <- Lit_model_Summary$r.squared)
## [1] 0.01941689
(RSE <- sigma(Lit_model))
## [1] 0.04864105
From the summary of the linear model of the literacy rate and proportion of slaves, the equation of the line
is: yb = 0.947 + 0.032x. The slope indicates that as the proportion of slaves in a county increases by 1, the
literacy rate is predicted to increase by 3.18%. The intercept indicates that when the literacy rate is 0, the
county would be made up of 94% slaves.
This small slope indicates that while the literacy rate and proportion of slaves have a positive and statistically
significant relationship, it is not a very strong relationship. The adjusted R2 is only 0.017 which means that
1.7% of the variation in literacy rate can be explained by the proportion of slaves in a county and confirms
that the proportion of slaves in a county is not a good predictor of literacy rate.
The residual standard error is 4.86% meaning that the irreducible error quite low which is promising as is the
very low p-value, but the other model statistics show that despite these two statistics, this model is actually
not a very good fit to predict literacy rate.
4
color = "blue") + geom_point(aes(y = .fitted),
color = "blue") + geom_point(aes(y = lit_rate),
color = "red") + geom_segment(aes(x = slave_prop,
y = .fitted, xend = slave_prop, yend = lit_rate),
alpha = 0.2)
1.0
0.9
.fitted
0.8
0.7
5
9
density
6
0.0
.resid
−0.1
−0.2