0% found this document useful (0 votes)
66 views25 pages

Regression For Everyone Vol. 1

Simple linear regression uses ordinary least squares to determine coefficients by minimizing the residual sum of squares. Analysis of variance decomposes total variation into residual, regression, and total sums of squares. Model diagnostics check assumptions like linearity, normality, constant variance, and independence through residual analysis plots.

Uploaded by

danielng97007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views25 pages

Regression For Everyone Vol. 1

Simple linear regression uses ordinary least squares to determine coefficients by minimizing the residual sum of squares. Analysis of variance decomposes total variation into residual, regression, and total sums of squares. Model diagnostics check assumptions like linearity, normality, constant variance, and independence through residual analysis plots.

Uploaded by

danielng97007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

REGRESSION FOR

EVERYONE
A Simple Guide To Simple Linear Regression

Jared Schultz

Volume One
Table of Contents

01 What is Regression?
This section covers an introduction for simple linear regression.
Topics explain why we might use regression and how our model is
created.

02 Analysis Of Variance
This section covers how we can break down the sources of error
within our regression model.

03 Model Diagnostics
This section explains our assumption when working with linear
regression. It showcases how to check if our assumptions are
broken. Residual Analysis is showcased.

04 Fixing Model Departures


This section covers fixing the simple linear regession model when our
assumptions our broken.

05 Confidence Intervals
This section covers the creation and understanding of different
confidence intervals that can be created from estimates obtained
with or simple linear regression model.

06 Evaluation and Validation


This section covers final evaluation of our model and validation of
our results.
REGRESSION FOR EVERYONE #1
What is Regression? *
Regression is a form of machine learning that examines the
relationship between variables and allows us insight on
explaining patterns in data.
*This is a very casual definition.

Housing Market Example Guide


In this example we are interested in finding
what variables affect housing price in a
specific region.
We also want our model to be good fit.

Example
In this example the only data we have about housing in our region
is the house value and the age of the house.

Simple Linear Regression

* error shorthand will be e


1
Simple Regression Model Legend
Value of the house for the ith house (Dependent or
Response variable)
Age of the house for the ith house (Independent or
Explanatory variable)

Regression Intercept (unknown parameter)

Regression Slope (unknown parameter)

Random variables that have a mean of 0. All


have equal variance and are uncorrelated.

o n
t i
t a
r e
r p
t e
l I n
i c a
p h
r a 1 unit of x
G

How do we pick our unknown parameters to get the best line that
fits our data?

We can use the Ordinary Least Squares (OLS) method. The goal of
OLS is to fit a line that minimizes the sum of squared difference
from the observed and fitted values.
2
OLS Formula and
Graphical Interpretation

This formula looks complicated but it boils down to saying that our
parameters we will pick will be the values that minimizes the right-
hand side of the equation.

Lets take a graphical approach to understanding this:

Suppose on the left I create Suppose on the right I create a


a line with an intercept of 3 line with an intercept of 2 and
and a slope of 0.5. a slope of 0.7.

The OLS formula in a graphical context is the sum of the squares


that are shown in blue. In this picture Q(3,0.5) > Q(2,0.7) thus the
line that best fits the data is Q(2,0.7) since it contains a smaller sum
of error. 3
OLS Estimators
Obviously, it would be very time consuming find the pair of
parameters by hand. Luckly, using calculus we can find the normal
equations that allow us to solve for our estimators.

*Typically a hat notation is used instead of ~.

What does our fitted line look like using our data?

Fitted values which are predictions by the OLS line. Which means
that they are the respective points of the fitted line.

Residuals are the differences between the observed values and the
respective fitted values.

4
REGRESSION FOR EVERYONE #2
RECAP
Simple linear regression is uses OLS to determine the
coefficients of our regression relation.

But how can we quantify how much error is in the model

WHAT IS ANALYSIS OF VARIANCE (ANOVA)?

Basic Idea: Attributing variation in the data to different sources


through decomposition of total variation.

Graphical Representation of the Partition of


Total Deviation

Total Deviations Residuals Deviations

5
Decomposition of Total Variation
Now let's take a look at the sum of squares of the total variation.

SSTO SSE SSR

Total Sum of Squares (SSTO)


Variation of the observations around the sample mean.

Error Sum of Squares (SSE)


Variation of the observations around the fitted regression line.

Regression Sum of Squares (SSR)


Variation of the fitted values around the sample mean.

Should I use SSE to measure the error of the model

6
Explanation and Solution
SSE is not a good representation of error in the model. Let's
take this example to see why:
Suppose that we have a model with 100 points of data and an
SSE of 250. We get more data which should give us a better
model since we have more data, but our SSE is higher!? This is
due to the fact that SSE increases with more data.

Mean Squared Error (MSE)


Error sum of squares divided by its degrees of freedom (df)
gives an unbiased estimate of the true error variance.

Degrees of Freedom (df)


The number of components that are allowed to vary. For
simple linear regression the df(SSE) is n-2.

Regression Mean Square (MSR)


Regression sum of squares divided by its degrees of freedom
(df). In simple linear regression the df(SSR) is 1.

7
REGRESSION FOR EVERYONE #3

MODEL DIAGNOSTICS
Basic Idea: Our model has assumptions. We have to make sure
that our assumptions of the normal error model in simple linear
regression are correct. Otherwise we do not have a valid model.

Simple Regression Model Assumptions


1) Linearity
There is a linear relationship between Y and the coefficients of
the model.

8
How to Check Our Assumptions
Unless it is clearly obvious in the scatterplot, we will often
need to do a residual analysis to check our model
assumptions.
Residuals contain the leftover variation in the data after
accounting for our model fit.

Detection of Non-Linearity
Residuals vs. fitted values plot.
Residuals vs. X variable plot.

If either of these show a clear non-linear pattern, then there


is a possible indication of non-linearity.
Non-linearity unaccounted for by the model will be left in the
residuals.
Residual Analysis

Assumption Violation:
There is a clear quadratic
pattern indicating that
our regression relation is
non-linear.

9
2) Normality
We assume that our error distribution is normally distributed.

Error Distribution ~ Error Distribution ~

Detection of Non-Normality
Normal Q-Q plot of the residuals.
If the residuals are normally distributed, then the points of
the Q-Q plot should be nearly straight on a line.
Residual Analysis

Assumption Violation:
This plot shows more
probability mass on both
tails. Distribution has
heavy tails and is not
normal.

10
3) Constant Variance
We assume that all of the errors have equal variance.

Detection of Non-Constant Variance


Residuals vs. fitted values plot.
If this plot shows a clear increasing or decreasing spread,
then there is a possible indication of non-constant variance.
Residual Analysis

Assumption Violation:
There is a clear increasing
pattern indicating that
our variance is not
constant.

11
4) Independence
We assume that all of the errors are independent.

This assumption is often overlooked since it can be done


during the data collection stage. They make observations
collected independent.

The assumption might be broken when working with


longitudinal data, time series data or cluster collected
data.

Detection of Dependence
Residuals vs. Time (X)
If this plot shows residuals deviating outside of a 95% CI
around 0, then there is a possible indication of dependence.

Residual Analysis

12
REGRESSION FOR EVERYONE #4
FIXING MODEL DEPARTURES
Basic Idea: Our model might have departed from the
assumptions. Thus, we need to fix our model in such a way that
our assumptions still hold true.

WHAT SHOULD I DO ?
First, mild departures of our model do not need to be fixed.
Serious departures in our model include:

Fix Regression Relation (Linearity Assumption):


Transformation of the Y and/or X variable may be
needed.

Fix Error Distribution (Normality and Equal Variance


Assumption): Transformation of the Y variable.

Fix Outliers (Influential Cases): Exclusion or robust


regression.

NOTE: Fixing departures can take some time and exploration.


Below are common methods of fixing them. Remember, when
applying transformations, it affects the interpretation of the
model results and can affect model interpretability.

13
Transformations of X
We may want to linearize a non-linear relationship:

Data is increasing and concave downward:

Data is increasing and concave upward:

Data is decreasing and concave upward:

Add constants to the transformation to avoid negatives or


zeros.

Example Application

14
Transformations of Y
Fixing error distribution such as unequal variance and/or non-
normality:

Box-Cox Procedure
A method for picking a power transformation on the Y variable
to make the distribution normal. (Use R library: MASS)

The procedure is as follows:


For each , fit a regression model on the transformed data
and record the SSE for each choice of .
Find the that minimizes SSE and apply the corresponding
power transformation to Y.

Rather than using the entire transformation above, a simpler one


you can try after getting is:

15
REGRESSION FOR EVERYONE #5
CONFIDENCE INTERVALS
Basic Idea: We want to have a measure of confidence regarding
our estimates that come from our simple linear regression
model.

Confidence Intervals
Recall: Our regression coefficients that we find are and .

Under the normal error model for simple regression these


are the maximum likelihood estimators (MLE) of and .

To find out how confident we are in our estimate we can


look at a -confidence interval.

Lets take a look at what could happen if our estimate for the
slope changed:

16
Confidence Interval for
The -confidence interval form:

Accuracy Precision
KEY
Amount of type one error allowed.
Critical t-value related to confidence level and
degrees of freedom.
Standard error of the estimated coefficient.
Sample size.

Accuracy and Precision

is called the confidence level and represents the


accuracy of the confidence interval.

The higher the confidence level, the more


accurate the confidence interval.

is called the half-width and represents the


precision of the confidence interval.

The larger the sample size n, the narrower the


confidence interval.
The larger the standard error, the wider the
confidence interval.
Tradeoff: To add more accuracy to a confidence interval it must
become less precise with all other things equal.
17
Visual Representation

90% C.I of 98% C.I of

Confidence Interval Interpretation


A (1- )100% confidence interval can be interpreted in the
following way:

(1- )100% of the time we are confident that the


confidence interval captures the true parameter.

To get a better understanding, look at the visual example again


and notice how not all of the confidence intervals capture the
true parameter and that is reflected in the confidence level.

18
Confidence Interval of the Mean Response
Suppose that we want to create a confidence interval for a
specific point in our data. The formula:

Example (Not real data)


Suppose we go back to the housing example where we
regressed housing price on the age of the house. Then we want
a 95% confidence interval of the housing price for a house aged
at 30 years old in the dataset.

We could say: We are 95% confident that the average home


value of homes 30 years old is between [320,000,350,000].
NOTE:
This confidence interval is for making inferences within the
scope of our dataset.
The farther away our choice of is from the mean , then
the larger the standard error for our confidence interval.

Visual Representation

19
REGRESSION FOR EVERYONE #6
Model Evaluation & Validation
Basic Idea: We want to understand metrics that can indicate if
we have a good model with significant predictors and validate
our model to make sure we are not over or underfitting.

Coefficient of Determination
Coefficient of Determination: A descriptive measure for the
linear association between X and Y.
Interpretation: Tells us the amount of variation of Y that is
explained by X.

Visual Representation

20
Warnings:
When the relationship between X and Y is non-linear, is
not a meaningful measure.
A large does not necessarily mean the estimated
regression line it a good fit.
A near zero does not necessary mean that X and Y are
not related.

Pearsons Correlation Coefficient


A statistical measure that evaluates the strength and direction
of the relationship between two variables.

Visual Representation

21
Mean Squared Error
Recall: MSE is an unbiased estimate of the true error variance.

MSE is the averaged squared distance of from the


observed to the predicted values.
Since MSE deals with squared distances it is often harder
to interpret.
A model with a good fit should have a low MSE.

Root Mean Squared Error (RMSE)


RMSE: The square root of MSE. Measures the average distance
between the predicted and actual values.

It represents the standard deviation of residuals which is a


good quantifier of how dispersed the residuals are in the
model.
A low RMSE is an indication that the model is a good fit and
has precise predictions

Tests for Linear Association


These tests are testing the following hypothesis:

Testing Methods:
T-test
F-test
Note:
For simple linear regression these tests are equivalent.

Interpretation: In these tests we are testing to see if the slope


is significantly different from zero.

22
Model Validation
Model validation is a form of quality checking your model to
make sure that it preforms as expected.

Internal validation: Checking the validity of the model


using the same data when fitted
External validation: Checking the validity of the model
using new or holdout data.

For a data set with a sufficiently large sample size, one option
for internal validation uses training and validation (testing)
data to check the model validity.

Training data: Must be large enough so that a reliable


model can be built. Model is trained on this data.
Validation data: Often smaller in size. Use the fitted model
and the new data to see how model preforms.
Note:
The distribution of both datasets should be the same when
comparing variables.

Common Methods
Leave one out cross validation

K-fold Cross Validation

Authors Note: Thanks for reading this far! This volume covered a
simple overview of simple linear regression. The next volume will cover
multiple regression and will build off of what has already been covered.

23

You might also like