0% found this document useful (0 votes)
6 views

LinearRegressionLab

Lab about how to run a linear regression algorithm

Uploaded by

2021katelynsmith
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

LinearRegressionLab

Lab about how to run a linear regression algorithm

Uploaded by

2021katelynsmith
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Linear Regression Lab Resubmission

Katelyn Smith

2023-11-01

Purpose
To demonstrate the application of prediction using linear regression, assessing linear model fit, and interpreting
the output of a linear model.

The Data
This data is from the 1840 census. It provided counts of various demographics at a county level.
The data set was downloaded from the Integrated Public Use Microdata Series out of the University of
Minnesota. The documentation and data download can be found here
census1840 <- read_csv("C_1840.csv") %>%
drop_na() # delete all rows with any missing values

## Rows: 1276 Columns: 127


## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): rectype
## dbl (126): year, region, statefip, stateicp, county, cntypopf, cntypops, qco...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Exploratory Analysis
# selecting only the numeric variables
# since those are th only variables
# eligible for a linear regression
census1840_num <- select_if(census1840, is.numeric)

# creating a literacy rate variable to


# be predicted by subtracting 1 from
# the total number of people who are
# illiterate divided by the total
# people
census1840_lit <- census1840_num %>%
mutate(lit_rate = 1 - (nlit_c/ntotal_c))

summary(census1840_lit$lit_rate)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 0.6790 0.9267 0.9678 0.9524 0.9919 1.0000

1
# creating proportions for the groupf
# of people to better standardize for
# population differences

census1840_prop1 <- census1840_lit %>%


mutate(female_prop = (nfemale_c/ntotal_c))

census1840_prop2 <- census1840_prop1 %>%


mutate(male_prop = (nmale_c/ntotal_c))

census1840_prop3 <- census1840_prop2 %>%


mutate(free_prop = (numperhh_c/ntotal_c))

census1840_prop4 <- census1840_prop3 %>%


mutate(slave_prop = (nslave_c/ntotal_c))

# visualization

ggplot(census1840_prop4, aes(x = lit_rate,


y = female_prop)) + geom_point() + ggtitle("Proportion of Females vs Literacy Rate at a County Level

Proportion of Females vs Literacy Rate at a County Level in 1840

0.5

0.4
female_prop

0.3

0.2

0.1

0.7 0.8 0.9 1.0


lit_rate
ggplot(census1840_prop4, aes(x = lit_rate,
y = slave_prop)) + geom_point() + ggtitle("Proportion of Slaves vs Literacy Rate at a County Level i

2
Proportion of Slaves vs Literacy Rate at a County Level in 1840

0.75
slave_prop

0.50

0.25

0.00

0.7 0.8 0.9 1.0


lit_rate
I chose to look at the minority groups as the compare the literacy rate because historically those groups have
had less access to education. Surprisingly, there seems to be a positive relationship between the literacy rate
and proprtion of slaves in a county. I have chosen the proportion of slaves as my predictor variable to further
explore this surprising result.

Linear Model
# Running the linear model linear model
# with lit_rate as the predicted
# variable and slave_prop as the
# predictor variable
Lit_model <- lm(data = census1840_prop4,
lit_rate ~ slave_prop)

# summarizing an printing the summary


# information for the linear model
(Lit_model_Summary <- summary(Lit_model))

##
## Call:
## lm(formula = lit_rate ~ slave_prop, data = census1840_prop4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.26772 -0.02617 0.01301 0.03859 0.05329
##

3
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.946710 0.001777 532.777 < 2e-16 ***
## slave_prop 0.031826 0.006337 5.023 5.82e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04864 on 1274 degrees of freedom
## Multiple R-squared: 0.01942, Adjusted R-squared: 0.01865
## F-statistic: 25.23 on 1 and 1274 DF, p-value: 5.821e-07
(f_val <- Lit_model_Summary$fstatistic[1])

## value
## 25.22695
(f_df1 <- Lit_model_Summary$fstatistic[2])

## numdf
## 1
(f_df2 <- Lit_model_Summary$fstatistic[3])

## dendf
## 1274
(f_pval <- pf(f_val, f_df1, f_df2, lower.tail = FALSE))

## value
## 5.820978e-07
(rsqr <- Lit_model_Summary$r.squared)

## [1] 0.01941689
(RSE <- sigma(Lit_model))

## [1] 0.04864105
From the summary of the linear model of the literacy rate and proportion of slaves, the equation of the line
is: yb = 0.947 + 0.032x. The slope indicates that as the proportion of slaves in a county increases by 1, the
literacy rate is predicted to increase by 3.18%. The intercept indicates that when the literacy rate is 0, the
county would be made up of 94% slaves.
This small slope indicates that while the literacy rate and proportion of slaves have a positive and statistically
significant relationship, it is not a very strong relationship. The adjusted R2 is only 0.017 which means that
1.7% of the variation in literacy rate can be explained by the proportion of slaves in a county and confirms
that the proportion of slaves in a county is not a good predictor of literacy rate.
The residual standard error is 4.86% meaning that the irreducible error quite low which is promising as is the
very low p-value, but the other model statistics show that despite these two statistics, this model is actually
not a very good fit to predict literacy rate.

Further model assessment


predictions <- augment(Lit_model)
# Plot data, model, residuals
predictions %>%
ggplot(aes(x = slave_prop)) + geom_line(aes(y = .fitted),

4
color = "blue") + geom_point(aes(y = .fitted),
color = "blue") + geom_point(aes(y = lit_rate),
color = "red") + geom_segment(aes(x = slave_prop,
y = .fitted, xend = slave_prop, yend = lit_rate),
alpha = 0.2)

1.0

0.9
.fitted

0.8

0.7

0.00 0.25 0.50 0.75


slave_prop
This graph shows visually what the descriptive model statistics reported. The points seem to be very far
from the line of best fit confirming that the proportion of slaves in a county does not predict the literacy rate
very well.
# Look at the residuals with a density
# plot
ggplot(predictions, aes(x = .resid)) + geom_density()

5
9
density

−0.2 −0.1 0.0


.resid
# Look at the residuals with a scatter
# plot
predictions %>%
ggplot(aes(x = slave_prop)) + geom_point(aes(y = .resid))

6
0.0
.resid

−0.1

−0.2

0.00 0.25 0.50 0.75


slave_prop
The density plot is skewed extremely left and the residual plot looks very similar to the confirming that
proportion of slaves in a county is not a very strong predictor of literacy rate.
Overall, despite the promising positive relationship in the scatter plot of proportion os slaves vs literacy rate
at a county level in 1840, there is not a strong redictive realtionship.

You might also like