0% found this document useful (0 votes)

66 views25 pages

Regression For Everyone Vol. 1

Simple linear regression uses ordinary least squares to determine coefficients by minimizing the residual sum of squares. Analysis of variance decomposes total variation into residual, regression, and total sums of squares. Model diagnostics check assumptions like linearity, normality, constant variance, and independence through residual analysis plots.

Uploaded by

danielng97007

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views25 pages

Regression For Everyone Vol. 1

Uploaded by

danielng97007

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

REGRESSION FOR

EVERYONE
A Simple Guide To Simple Linear Regression

Jared Schultz

Volume One
Table of Contents

01 What is Regression?
This section covers an introduction for simple linear regression.
Topics explain why we might use regression and how our model is
created.

02 Analysis Of Variance
This section covers how we can break down the sources of error
within our regression model.

03 Model Diagnostics
This section explains our assumption when working with linear
regression. It showcases how to check if our assumptions are
broken. Residual Analysis is showcased.

04 Fixing Model Departures

This section covers fixing the simple linear regession model when our
assumptions our broken.

05 Confidence Intervals
This section covers the creation and understanding of different
confidence intervals that can be created from estimates obtained
with or simple linear regression model.

06 Evaluation and Validation

This section covers final evaluation of our model and validation of
our results.
REGRESSION FOR EVERYONE #1
What is Regression? *
Regression is a form of machine learning that examines the
relationship between variables and allows us insight on
explaining patterns in data.
*This is a very casual definition.

Housing Market Example Guide

In this example we are interested in finding
what variables affect housing price in a
specific region.
We also want our model to be good fit.

Example
In this example the only data we have about housing in our region
is the house value and the age of the house.

Simple Linear Regression

* error shorthand will be e

1
Simple Regression Model Legend
Value of the house for the ith house (Dependent or
Response variable)
Age of the house for the ith house (Independent or
Explanatory variable)

Regression Intercept (unknown parameter)

Regression Slope (unknown parameter)

Random variables that have a mean of 0. All

have equal variance and are uncorrelated.

o n
t i
t a
r e
r p
t e
l I n
i c a
p h
r a 1 unit of x
G

How do we pick our unknown parameters to get the best line that
fits our data?

We can use the Ordinary Least Squares (OLS) method. The goal of
OLS is to fit a line that minimizes the sum of squared difference
from the observed and fitted values.
2
OLS Formula and
Graphical Interpretation

This formula looks complicated but it boils down to saying that our
parameters we will pick will be the values that minimizes the right-
hand side of the equation.

Lets take a graphical approach to understanding this:

Suppose on the left I create Suppose on the right I create a

a line with an intercept of 3 line with an intercept of 2 and
and a slope of 0.5. a slope of 0.7.

The OLS formula in a graphical context is the sum of the squares

that are shown in blue. In this picture Q(3,0.5) > Q(2,0.7) thus the
line that best fits the data is Q(2,0.7) since it contains a smaller sum
of error. 3
OLS Estimators
Obviously, it would be very time consuming find the pair of
parameters by hand. Luckly, using calculus we can find the normal
equations that allow us to solve for our estimators.

*Typically a hat notation is used instead of ~.

What does our fitted line look like using our data?

Fitted values which are predictions by the OLS line. Which means
that they are the respective points of the fitted line.

Residuals are the differences between the observed values and the
respective fitted values.

4
REGRESSION FOR EVERYONE #2
RECAP
Simple linear regression is uses OLS to determine the
coefficients of our regression relation.

But how can we quantify how much error is in the model

WHAT IS ANALYSIS OF VARIANCE (ANOVA)?

Basic Idea: Attributing variation in the data to different sources

through decomposition of total variation.

Graphical Representation of the Partition of

Total Deviation

Total Deviations Residuals Deviations

5
Decomposition of Total Variation
Now let's take a look at the sum of squares of the total variation.

SSTO SSE SSR

Total Sum of Squares (SSTO)

Variation of the observations around the sample mean.

Error Sum of Squares (SSE)

Variation of the observations around the fitted regression line.

Regression Sum of Squares (SSR)

Variation of the fitted values around the sample mean.

Should I use SSE to measure the error of the model

6
Explanation and Solution
SSE is not a good representation of error in the model. Let's
take this example to see why:
Suppose that we have a model with 100 points of data and an
SSE of 250. We get more data which should give us a better
model since we have more data, but our SSE is higher!? This is
due to the fact that SSE increases with more data.

Mean Squared Error (MSE)

Error sum of squares divided by its degrees of freedom (df)
gives an unbiased estimate of the true error variance.

Degrees of Freedom (df)

The number of components that are allowed to vary. For
simple linear regression the df(SSE) is n-2.

Regression Mean Square (MSR)

Regression sum of squares divided by its degrees of freedom
(df). In simple linear regression the df(SSR) is 1.

7
REGRESSION FOR EVERYONE #3

MODEL DIAGNOSTICS
Basic Idea: Our model has assumptions. We have to make sure
that our assumptions of the normal error model in simple linear
regression are correct. Otherwise we do not have a valid model.

Simple Regression Model Assumptions

1) Linearity
There is a linear relationship between Y and the coefficients of
the model.

8
How to Check Our Assumptions
Unless it is clearly obvious in the scatterplot, we will often
need to do a residual analysis to check our model
assumptions.
Residuals contain the leftover variation in the data after
accounting for our model fit.

Detection of Non-Linearity
Residuals vs. fitted values plot.
Residuals vs. X variable plot.

If either of these show a clear non-linear pattern, then there

is a possible indication of non-linearity.
Non-linearity unaccounted for by the model will be left in the
residuals.
Residual Analysis

Assumption Violation:
There is a clear quadratic
pattern indicating that
our regression relation is
non-linear.

9
2) Normality
We assume that our error distribution is normally distributed.

Error Distribution ~ Error Distribution ~

Detection of Non-Normality
Normal Q-Q plot of the residuals.
If the residuals are normally distributed, then the points of
the Q-Q plot should be nearly straight on a line.
Residual Analysis

Assumption Violation:
This plot shows more
probability mass on both
tails. Distribution has
heavy tails and is not
normal.

10
3) Constant Variance
We assume that all of the errors have equal variance.

Detection of Non-Constant Variance

Residuals vs. fitted values plot.
If this plot shows a clear increasing or decreasing spread,
then there is a possible indication of non-constant variance.
Residual Analysis

Assumption Violation:
There is a clear increasing
pattern indicating that
our variance is not
constant.

11
4) Independence
We assume that all of the errors are independent.

This assumption is often overlooked since it can be done

during the data collection stage. They make observations
collected independent.

The assumption might be broken when working with

longitudinal data, time series data or cluster collected
data.

Detection of Dependence
Residuals vs. Time (X)
If this plot shows residuals deviating outside of a 95% CI
around 0, then there is a possible indication of dependence.

Residual Analysis

12
REGRESSION FOR EVERYONE #4
FIXING MODEL DEPARTURES
Basic Idea: Our model might have departed from the
assumptions. Thus, we need to fix our model in such a way that
our assumptions still hold true.

WHAT SHOULD I DO ?
First, mild departures of our model do not need to be fixed.
Serious departures in our model include:

Fix Regression Relation (Linearity Assumption):

Transformation of the Y and/or X variable may be
needed.

Fix Error Distribution (Normality and Equal Variance

Assumption): Transformation of the Y variable.

Fix Outliers (Influential Cases): Exclusion or robust

regression.

NOTE: Fixing departures can take some time and exploration.

Below are common methods of fixing them. Remember, when
applying transformations, it affects the interpretation of the
model results and can affect model interpretability.

13
Transformations of X
We may want to linearize a non-linear relationship:

Data is increasing and concave downward:

Data is increasing and concave upward:

Data is decreasing and concave upward:

Add constants to the transformation to avoid negatives or

zeros.

Example Application

14
Transformations of Y
Fixing error distribution such as unequal variance and/or non-
normality:

Box-Cox Procedure
A method for picking a power transformation on the Y variable
to make the distribution normal. (Use R library: MASS)

The procedure is as follows:

For each , fit a regression model on the transformed data
and record the SSE for each choice of .
Find the that minimizes SSE and apply the corresponding
power transformation to Y.

Rather than using the entire transformation above, a simpler one

you can try after getting is:

15
REGRESSION FOR EVERYONE #5
CONFIDENCE INTERVALS
Basic Idea: We want to have a measure of confidence regarding
our estimates that come from our simple linear regression
model.

Confidence Intervals
Recall: Our regression coefficients that we find are and .

Under the normal error model for simple regression these

are the maximum likelihood estimators (MLE) of and .

To find out how confident we are in our estimate we can

look at a -confidence interval.

Lets take a look at what could happen if our estimate for the
slope changed:

16
Confidence Interval for
The -confidence interval form:

Accuracy Precision
KEY
Amount of type one error allowed.
Critical t-value related to confidence level and
degrees of freedom.
Standard error of the estimated coefficient.
Sample size.

Accuracy and Precision

is called the confidence level and represents the

accuracy of the confidence interval.

The higher the confidence level, the more

accurate the confidence interval.

is called the half-width and represents the

precision of the confidence interval.

The larger the sample size n, the narrower the

confidence interval.
The larger the standard error, the wider the
confidence interval.
Tradeoff: To add more accuracy to a confidence interval it must
become less precise with all other things equal.
17
Visual Representation

90% C.I of 98% C.I of

Confidence Interval Interpretation

A (1- )100% confidence interval can be interpreted in the
following way:

(1- )100% of the time we are confident that the

confidence interval captures the true parameter.

To get a better understanding, look at the visual example again

and notice how not all of the confidence intervals capture the
true parameter and that is reflected in the confidence level.

18
Confidence Interval of the Mean Response
Suppose that we want to create a confidence interval for a
specific point in our data. The formula:

Example (Not real data)

Suppose we go back to the housing example where we
regressed housing price on the age of the house. Then we want
a 95% confidence interval of the housing price for a house aged
at 30 years old in the dataset.

We could say: We are 95% confident that the average home

value of homes 30 years old is between [320,000,350,000].
NOTE:
This confidence interval is for making inferences within the
scope of our dataset.
The farther away our choice of is from the mean , then
the larger the standard error for our confidence interval.

Visual Representation

19
REGRESSION FOR EVERYONE #6
Model Evaluation & Validation
Basic Idea: We want to understand metrics that can indicate if
we have a good model with significant predictors and validate
our model to make sure we are not over or underfitting.

Coefficient of Determination
Coefficient of Determination: A descriptive measure for the
linear association between X and Y.
Interpretation: Tells us the amount of variation of Y that is
explained by X.

Visual Representation

20
Warnings:
When the relationship between X and Y is non-linear, is
not a meaningful measure.
A large does not necessarily mean the estimated
regression line it a good fit.
A near zero does not necessary mean that X and Y are
not related.

Pearsons Correlation Coefficient

A statistical measure that evaluates the strength and direction
of the relationship between two variables.

Visual Representation

21
Mean Squared Error
Recall: MSE is an unbiased estimate of the true error variance.

MSE is the averaged squared distance of from the

observed to the predicted values.
Since MSE deals with squared distances it is often harder
to interpret.
A model with a good fit should have a low MSE.

Root Mean Squared Error (RMSE)

RMSE: The square root of MSE. Measures the average distance
between the predicted and actual values.

It represents the standard deviation of residuals which is a

good quantifier of how dispersed the residuals are in the
model.
A low RMSE is an indication that the model is a good fit and
has precise predictions

Tests for Linear Association

These tests are testing the following hypothesis:

Testing Methods:
T-test
F-test
Note:
For simple linear regression these tests are equivalent.

Interpretation: In these tests we are testing to see if the slope

is significantly different from zero.

22
Model Validation
Model validation is a form of quality checking your model to
make sure that it preforms as expected.

Internal validation: Checking the validity of the model

using the same data when fitted
External validation: Checking the validity of the model
using new or holdout data.

For a data set with a sufficiently large sample size, one option
for internal validation uses training and validation (testing)
data to check the model validity.

Training data: Must be large enough so that a reliable

model can be built. Model is trained on this data.
Validation data: Often smaller in size. Use the fitted model
and the new data to see how model preforms.
Note:
The distribution of both datasets should be the same when
comparing variables.

Common Methods
Leave one out cross validation

K-fold Cross Validation

Authors Note: Thanks for reading this far! This volume covered a
simple overview of simple linear regression. The next volume will cover
multiple regression and will build off of what has already been covered.

ISYE6501 HW1 Kevin
No ratings yet
ISYE6501 HW1 Kevin
7 pages
Applied Econometrics & Time Series Analysis Homework 1
No ratings yet
Applied Econometrics & Time Series Analysis Homework 1
5 pages
Assignment of Fundamentals of Econometrics, On Multiple Regression
No ratings yet
Assignment of Fundamentals of Econometrics, On Multiple Regression
4 pages
Week Two Assignment A
No ratings yet
Week Two Assignment A
1 page
14.predictive Modeling Using Logistic Regression.2007
No ratings yet
14.predictive Modeling Using Logistic Regression.2007
266 pages
QWE Case Study
No ratings yet
QWE Case Study
5 pages
Dadm Assignment Questionnaire Unsolved
No ratings yet
Dadm Assignment Questionnaire Unsolved
16 pages
Analyzing The External Environment of The Firm: Chapter Two
No ratings yet
Analyzing The External Environment of The Firm: Chapter Two
37 pages
Econometrics 1 Cumulative Final Study Guide
No ratings yet
Econometrics 1 Cumulative Final Study Guide
35 pages
Exercises - SPSS
No ratings yet
Exercises - SPSS
6 pages
BA Module 02 - 2.1 + 2.2
No ratings yet
BA Module 02 - 2.1 + 2.2
12 pages
Nation Branding Bangladesh
No ratings yet
Nation Branding Bangladesh
16 pages
Chapter17 Stat Forecasting
100% (1)
Chapter17 Stat Forecasting
101 pages
Logistic Regression
No ratings yet
Logistic Regression
35 pages
Module 1 Notes
100% (1)
Module 1 Notes
73 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
19 pages
Regression Analysis
No ratings yet
Regression Analysis
21 pages
AIUBMBA9
100% (1)
AIUBMBA9
8 pages
Gradient Descent
No ratings yet
Gradient Descent
18 pages
MKT470 - Marketing Research: Submitted By: Submit To
No ratings yet
MKT470 - Marketing Research: Submitted By: Submit To
5 pages
Kohli Batting Analysis
No ratings yet
Kohli Batting Analysis
19 pages
Naeem Ahmed - Final Q ECO104
No ratings yet
Naeem Ahmed - Final Q ECO104
7 pages
ML Lecture 15 Ensemble
No ratings yet
ML Lecture 15 Ensemble
27 pages
Chapter 4 Descriptive Data Mining
No ratings yet
Chapter 4 Descriptive Data Mining
6 pages
12.simple Regression NLS Edit
No ratings yet
12.simple Regression NLS Edit
62 pages
CH 16
100% (1)
CH 16
54 pages
Bnad 277 Group Project
No ratings yet
Bnad 277 Group Project
11 pages
BUS650 (Operations Management) Report - Lean Production in Bangladesh
0% (1)
BUS650 (Operations Management) Report - Lean Production in Bangladesh
8 pages
Elementary Statistics Project
No ratings yet
Elementary Statistics Project
1 page
Statistics For Business and Economics,: 11E Anderson/Sweeney/Williams
100% (1)
Statistics For Business and Economics,: 11E Anderson/Sweeney/Williams
57 pages
CH 18
No ratings yet
CH 18
30 pages
Exercise 3A - Supplier Selection 1
No ratings yet
Exercise 3A - Supplier Selection 1
4 pages
Buan 6312
No ratings yet
Buan 6312
7 pages
Module 3 - Regression
No ratings yet
Module 3 - Regression
55 pages
Statistics For Business and Economics: Describing Data: Numerical
No ratings yet
Statistics For Business and Economics: Describing Data: Numerical
40 pages
Chapter 06 - Heteroskedasticity
100% (1)
Chapter 06 - Heteroskedasticity
30 pages
Tutorial 4 Exercises IFM
100% (1)
Tutorial 4 Exercises IFM
5 pages
Choosing The Correct Statistical Test
No ratings yet
Choosing The Correct Statistical Test
26 pages
Estimating Demand Curves
100% (1)
Estimating Demand Curves
2 pages
Structure Project Topics For Projects 2016
No ratings yet
Structure Project Topics For Projects 2016
4 pages
Tutoring Session 2023 - Statistics For Business
No ratings yet
Tutoring Session 2023 - Statistics For Business
65 pages
Principal Components Analysis
No ratings yet
Principal Components Analysis
59 pages
Time Series Analysis Cheat Sheet
No ratings yet
Time Series Analysis Cheat Sheet
2 pages
China Business Environment
No ratings yet
China Business Environment
23 pages
SPSS Multiple Linear Regression
No ratings yet
SPSS Multiple Linear Regression
55 pages
Group 2 - Chap 11 Analysis of Variance
No ratings yet
Group 2 - Chap 11 Analysis of Variance
102 pages
Utkarsh's Entrepreneur Project
No ratings yet
Utkarsh's Entrepreneur Project
26 pages
Unconstraining Methods For Rev PDF
No ratings yet
Unconstraining Methods For Rev PDF
16 pages
Basic Forecasting Paper Sample
No ratings yet
Basic Forecasting Paper Sample
46 pages
Developing A Global Vision Through Marketing Research: Mcgraw-Hill/Irwin
No ratings yet
Developing A Global Vision Through Marketing Research: Mcgraw-Hill/Irwin
37 pages
Monetary Policy & Inflation in India: Group 11
No ratings yet
Monetary Policy & Inflation in India: Group 11
18 pages
Statistics Report..
No ratings yet
Statistics Report..
34 pages
Slides Prepared by John S. Loucks St. Edward's University
100% (1)
Slides Prepared by John S. Loucks St. Edward's University
44 pages
DS II Mid Term 2017 Solution
No ratings yet
DS II Mid Term 2017 Solution
20 pages
Factor Analysis Xid-2898537 1 BSCdOjdTGS
No ratings yet
Factor Analysis Xid-2898537 1 BSCdOjdTGS
64 pages
QC Module6
No ratings yet
QC Module6
30 pages
Ch.2 - BB New
No ratings yet
Ch.2 - BB New
43 pages
1
No ratings yet
1
12 pages
Linear Regression
No ratings yet
Linear Regression
59 pages
Lec 34
No ratings yet
Lec 34
15 pages
Reading The Candlesticks An OK Estimator For Volatility Jia LI
No ratings yet
Reading The Candlesticks An OK Estimator For Volatility Jia LI
32 pages
Applied Statistics: Confidence Intervals
No ratings yet
Applied Statistics: Confidence Intervals
8 pages
25 Mangal Singh
No ratings yet
25 Mangal Singh
5 pages
Module 22 Confidence Interval Estimation of The Population Mean
No ratings yet
Module 22 Confidence Interval Estimation of The Population Mean
5 pages
Final
No ratings yet
Final
3 pages
Work Sampling
No ratings yet
Work Sampling
5 pages
Brevard County Public Schools Response
No ratings yet
Brevard County Public Schools Response
67 pages
Sta404 07
0% (1)
Sta404 07
71 pages
Bayesian Analysis in Mplus - A Brief Introduction
No ratings yet
Bayesian Analysis in Mplus - A Brief Introduction
92 pages
6C Bullying00202
No ratings yet
6C Bullying00202
8 pages
Customer Loyalty Repurchase and Satisfaction - A Meta-Analytical
No ratings yet
Customer Loyalty Repurchase and Satisfaction - A Meta-Analytical
27 pages
Sample Size Determination in Health Stud
No ratings yet
Sample Size Determination in Health Stud
88 pages
Psy 233 Course Outline (U of S)
No ratings yet
Psy 233 Course Outline (U of S)
3 pages
Statistics Robert S. Witte - Download The Ebook and Start Exploring Right Away
100% (5)
Statistics Robert S. Witte - Download The Ebook and Start Exploring Right Away
62 pages
7.CML Methods
No ratings yet
7.CML Methods
6 pages
1 - 1 - 5 Survey Sample Calculator Template 3
No ratings yet
1 - 1 - 5 Survey Sample Calculator Template 3
8 pages
Gaussian Distributions: Overview: This Worksheet Introduces The Properties of Gaussian Distributions, The
No ratings yet
Gaussian Distributions: Overview: This Worksheet Introduces The Properties of Gaussian Distributions, The
25 pages
Efect of Strength Training On Biomechanical and Neuromuscular Variables in Distance Runners - A Systematic Review and Meta Analysis
No ratings yet
Efect of Strength Training On Biomechanical and Neuromuscular Variables in Distance Runners - A Systematic Review and Meta Analysis
18 pages
Course Outline MTS 202 - Statistical Inference
No ratings yet
Course Outline MTS 202 - Statistical Inference
6 pages
3rd Assignment Research Solution
No ratings yet
3rd Assignment Research Solution
109 pages
Tutorial: How To Read A Forest Plot: en Español - Exme em Português - Eme
No ratings yet
Tutorial: How To Read A Forest Plot: en Español - Exme em Português - Eme
13 pages
Impact of Demonetisation On Indian Stock MARKET
No ratings yet
Impact of Demonetisation On Indian Stock MARKET
48 pages
Estimates 8.2 Users Guide
No ratings yet
Estimates 8.2 Users Guide
39 pages
Realistic Pipe Prover Volume Uncertainty - Paul Martin
No ratings yet
Realistic Pipe Prover Volume Uncertainty - Paul Martin
26 pages
Simon Shaw Bayes Theory
No ratings yet
Simon Shaw Bayes Theory
72 pages
Ratio Regression R
No ratings yet
Ratio Regression R
20 pages
The Normal Distribution Estimation Correlation
100% (1)
The Normal Distribution Estimation Correlation
16 pages
Lec 5
No ratings yet
Lec 5
64 pages
GraphPad Dose Response Ebook PDF
No ratings yet
GraphPad Dose Response Ebook PDF
12 pages
Chapter 8 Quantitative Methods
No ratings yet
Chapter 8 Quantitative Methods
19 pages

Regression For Everyone Vol. 1

Uploaded by

Regression For Everyone Vol. 1

Uploaded by

REGRESSION FOR

04 Fixing Model Departures

06 Evaluation and Validation

Housing Market Example Guide

Simple Linear Regression

* error shorthand will be e

Regression Intercept (unknown parameter)

Regression Slope (unknown parameter)

Random variables that have a mean of 0. All

Lets take a graphical approach to understanding this:

Suppose on the left I create Suppose on the right I create a

The OLS formula in a graphical context is the sum of the squares

*Typically a hat notation is used instead of ~.

But how can we quantify how much error is in the model

WHAT IS ANALYSIS OF VARIANCE (ANOVA)?

Basic Idea: Attributing variation in the data to different sources

Graphical Representation of the Partition of

Total Deviations Residuals Deviations

SSTO SSE SSR

Total Sum of Squares (SSTO)

Error Sum of Squares (SSE)

Regression Sum of Squares (SSR)

Should I use SSE to measure the error of the model

Mean Squared Error (MSE)

Degrees of Freedom (df)

Regression Mean Square (MSR)

Simple Regression Model Assumptions

If either of these show a clear non-linear pattern, then there

Error Distribution ~ Error Distribution ~

Detection of Non-Constant Variance

This assumption is often overlooked since it can be done

The assumption might be broken when working with

Fix Regression Relation (Linearity Assumption):

Fix Error Distribution (Normality and Equal Variance

Fix Outliers (Influential Cases): Exclusion or robust

NOTE: Fixing departures can take some time and exploration.

Data is increasing and concave downward:

Data is increasing and concave upward:

Data is decreasing and concave upward:

Add constants to the transformation to avoid negatives or

The procedure is as follows:

Rather than using the entire transformation above, a simpler one

Under the normal error model for simple regression these

To find out how confident we are in our estimate we can

Accuracy and Precision

is called the confidence level and represents the

The higher the confidence level, the more

is called the half-width and represents the

The larger the sample size n, the narrower the

90% C.I of 98% C.I of

Confidence Interval Interpretation

(1- )100% of the time we are confident that the

To get a better understanding, look at the visual example again

Example (Not real data)

We could say: We are 95% confident that the average home

Pearsons Correlation Coefficient

MSE is the averaged squared distance of from the

Root Mean Squared Error (RMSE)

It represents the standard deviation of residuals which is a

Tests for Linear Association

Interpretation: In these tests we are testing to see if the slope

Internal validation: Checking the validity of the model

Training data: Must be large enough so that a reliable

K-fold Cross Validation

You might also like