Introduction of Regression
Instructor: Weikang Kao, Ph.D.
Correlation to Regression
Recall: Correlation
With correlation our aim is to see if there is any relationship
between x and y.
How about take a step further?
Can we use one variable to predict another one?
The Aim of Regression
In regression, the aim is to use one or more predictors to predict the
outcome variable.
Simple regression vs multiple regression.
For example, as fat or sugar intake increases does weight increase?
Advantages of Regression
Why regression?
ANOVA, z and t-tests, and chi-square tests allow us to make comparisons
between multiple groups (categorical predictor levels) means.
Limitations:
The biggest limitation is we cannot really include continuous predictors (Age, Weight,
Income, Time, etc.)
ANOVA does not build a predictive model it simply tells us where differences exist but does
not quantify those differences.
Regression Line
From correlation to regression.
First, the correlation coefficient tells us the strength of relationship between two
variables.
In bivariate regression, we have a single independent variable (predictor) and a
dependent variable (criterion)
Regression Line
The shape of the relationship being MODELED by the correlation coefficient is
linear.
So, r describes the degree to which a straight line describes the values of the Y
variable across the range of X values.
Regression line helps us to create the best-fitting line to predict what the values of
the Y variable will be for any given value of X.
Regression
Line example
Regression Line
In mathematic, how do we define a straight line?
Y = a + bX
In regression, the idea is the same.
Y = b0 + b1X
Intercept (b0) and gradient (b1)
Model of Regression
Idea of prediction: outcome = (model) + error
Like Correlation, regression can be positive or negative
Two important factors in terms of defining a straight line.
Intercept (b0) and gradient (b1)
Parameters of Regression
Those parameters (b0 and b1) are the regression coefficients.
Regression model:
Outcome (Y) = [b0 + b1(X1)] + error
b1 tells us what the model looks like and b0 tells us where the model is.
These weights are known as coefficients (typically Betas: 𝜷) and coefficients tell
us the relative impact of a predictor (independent variable) on our dependent
variable
Parameters of Regression
(Un)Standardized Betas
𝜷 is reserved for standardized predictors while b is used for unstandardized.
b are what we get by default when we use our variables in their raw state. They tell us the unit
change in our dependent variable (in its raw units) for a unit change in our predictor variable
(in its raw units).
Reference: https://www3.nd.edu/~rwilliam/stats1/x92.pdf
With 𝜷 we see the impact of a 1 SD change in the predictor on the dependent.
Variance in
Regression
Variance in Regression
Recall: the correlation graph
The best fitting line is the regression model
The difference between the regression line and the mean of Y
The method of least squares
Helps to find the best fit line, we try to minimized the residual.
Least Squares Criterion
We try to identify the values of b0 and b1 the produce the best-fitting linear
function.
So, we use the observed data to identify and create the values of b 0 and b1 that
minimize the distances between the observed values (Y) and the predicted values (Y-
hat)
In general, we choose the regression coefficient (b1) and the regression intercept
(b0) that define the regression line that minimized the sum of square residuals.
min Σ[Y-(b0 + b1X)]2 = min Σ(Y- Ŷ)2
Regression Coefficients
Below equations help to compute the least square solution:
b0 = ȳ - b1x̄
b1 = byx = = rxy
Predicting Y From X
Predicted Score (Y-hat):
By using the best-fitting model, we can take any value of X, substitute it into the
regression equation, and compute the value of Y.
Observed Score (Y):
The actual score we find in our data.
Residual Score (Error):
The difference between the predicted score and observed score, when using the same
value of X.
Variance in Regression
How to determine whether it is a good regression model?
The best fitting model vs the basic model
What is the basic model?
The mean of y
Sum of square total: SST
The overall model is :
SST = SSR + SSE
Sum of square Regression (Model): SSR
Sum of square residual (error): SSE
Variance in Regression
SST = SSR + SSE
1. (Observed Y – Ymean) = (Predicted Y –Ymean) + (Observed Y – Predicted Y)
2. Variation in Y = Explained by X + Unexplained by X
3. SSTotal = SSRegression + SSError
What’s the importance of regression?
Without regression, what’s the best fitting model?
Total Variance
in Regression:
Observed Y – Ymean
=
Variance Can
be explained:
Predicted Y –Ymean
=
Variance Can
Not Be
Explained:
Observed Y – PredictedY
=
Y –Ŷ Y –ȳ
Ŷ–ȳ
Ŷ=b 0 + b 1 X
Overall Equation:
Things to Check
The regression line ALWAYS passes through the point (X, Y)
The mean of the predicted scores is ALWAYS equal to the mean of the observed
Y score.
Regression: Tests
Test the Overall Regression Model
First, we want to know if the model is significant.
F-Ratio
For the overall model
F = (SSM/df) / (SSR/df) = MSM / MSR
What is this number?
Test the Individual Parameter
Second, we want to know if our parameters (b0, b1, b2, etc.) are significant.
A bad predictor means, a unit change in predictor result no change in the
predicted value of outcome.
T-test
The null of t-test is that the b is zero.
For parameters:
R-square in Regression
How do we know if the model is good in terms of how much variability we can
explain?
R-square
The value of R-square will be between 0 to 1. Why?
Regression: Hypothesis
The null hypothesis:
There is no relationship between the X variables and the Y variable
Alternative hypothesis:
There is a significant relationship between the X variables and the Y variable
Regression: Assumptions
1. The type of variable
All predictor variables must be quantitative or categorical
The outcome variable be quantitative and continuous.
2. Non- zero variance
The predictors should have some variations.
3. No perfect multicollinearity
There should be no perfect linear relationship between two or more predictors.
4. Predictors are not correlated with extraneous variables
Influence the reliability of our model, think of the third-party issue.
Regression: Assumptions
5. Heteroscedasticity
Equal residual variances along regression line
6. Independent errors
Errors are not correlated with others
7. Normally distributed errors
The residuals in the model with a mean of zero.
8. Independence
All the values of the outcome (pairs) variable are independent
9. Linearity
The mean value of the outcome variable for each increment of the predictor lie along a straight line
Regression
Before data analysis, some issues must be taken care of.
Outliers.
Extreme numbers influence coefficient estimates and model fit.
Shrinkage.
and adjusted
FYI:
https://www.investopedia.com/ask/answers/012615/whats-difference-
between-rsquared-and-adjusted-rsquared.asp
Sample size?
Mostly 15 cases for per predictor.
Pick the Right Tests:
Scenario Practice
Scenario 1
A current study is interested in determine the effect of a new drug upon the
number of pictures recalled. Prior research had established a strong correlation
between painting skills and pictures recall. Therefore, individual differences in
painting skills were controlled to produce a more sensitive test of the treatment
effect.
What test should we use?
What are the IV and DV?
Scenario 2
Suppose we have data on blood pressure levels for males and females and we
wish to see if one sex has higher/lower blood pressure than the other.
What test should we use?
What are the IV and DV?
Scenario 3
A current study is interested in whether individuals’ weight change can be
predicted by how many soda they drink per day.
What test should we use?
What are the IV and DV?
Scenario 4
In a current data we collect, we notice that there is a strong correlation between
consumers’ attitudes toward our brand and how many reward bonuses they
receive every year. We would like to do some further analysis to see if we can use
reward bonus receive to predict consumers’ attitudes.
What test should we use?
What are the IV and DV?
Regression: Application
Regression: Application
Use the “(simplerelationships.xlsx) “ data.
Income – how much a person earns per year
Percent Income Saved – what percent of income carried over to next year and can
be negative if more money was spent than came in.
Score on a savings motivation metric – an ordinal scale.
Regression: Assumptions
Check assumptions:
The type of variable:
Is it good?
Non- zero variance
Use graph or look at the variance.
plot(density(data$Income))
plot(density(data$PI))
plot(density(data$MS))
Regression: Assumptions
Multicollinearity:
We have only one predictor here.
Predictors are not correlated with extraneous variable.
Heteroscedasticity:
The variance around the regression line should be equal across values of x.
scatterplot(data$Income, data$PI)
Regression: Assumptions
Independent errors:
We have no issue because we assume a random sample with no repeated
measurement.
Normally distributed error:
qqnorm(data$PI)
sshapiro.test(model$residuals)
Independence:
Linearity:
scatterplot(data$Income, data$PI)
Regression: Application
Build the regression model:
model <- lm(PI ~ Income, data = data)
#PI is the outcome variable and Income is the predictor variable
F (1, 97) = 18.37, p < .001, R2 = .16.
Estimate Std. error T-value P-value
Intercept -8.378e-03 2.570e-03 -3.26 0.00154
Income 1.097e-07 2.560e-08 4.29 < .001
Regression Model
Based on the result, now we are able to conduct the regression model.
The expected regression model is: Y = b0 + b1X
We can replace the values of intercept (b0) and slope (b1) with our own values.
Our model is:
Y = (-0.0083) + (.0000001097 )X
Regression: Summary Write Up
A simple regression model was conducted to predict participants’ money-
saving motivation, based on their annul income. All the regression assumptions were
met, and no further adjustment made. A significant regression equation was found (F
(1, 97) = 18.37, p < .001), with an R2 of .16. Both the intercept (p = .002) and
predictor (p < .001) were statistically significant. The result suggested that, income
predicts and shows that for each dollar increase in income there is a .0000001097
percent increase in savings.
In Class Practice
In Class Practice
Example: Can we use either height or weight to predict heart rate?
Make sure you test the assumption and do a summary write up.
Height: 175, 170, 180, 178, 168, 181, 190, 185, 177, 162
Weight: 60, 70, 75, 80, 69, 78, 82, 84, 72, 53
Heart Rate: 60, 70, 75, 73, 71, 73, 76, 80, 68, 64
Non-Linear Trends
Non-Linear Trends
Many relationships are not
best captured as straight
lines.
For example, the effect of
stress on performance is
known to follow a quadratic
trend (Yerkes-Dodson curve).
What should we do?
Non-Linear Trends
Always plot our data.
Using the scatterplot function in R two lines will be drawn (car library needed):
1. The red lines (by default) is a “smoothed” fit (only useful for thinking about
trend)
2. The green line (by default) plots the best fitting linear trend (regression line).
Non-Linear Trends
It looks like we have a
non-linear trend – higher
values of the predictor
are related with a
quicker increase (steeper
slope) in our dependent
than lower values.
We also see the linear
model is going to do a
poor job fitting the data.
Capturing Non-Linear Trends
In many cases we can use linear regression to capture non-linear trends.
However, you will want to pay close attention to your residuals as frequently you
will get good model fit, but that fit may be deceptive (e.g., predicting before and
after a curve poorly but other points well).
In the example graph on the preceding slide it appears the predicted slope gets
steeper as the predictor increases (linearly) – a quadratic trend.
Such trends can be captured by adding a x^2 of the predictor to the model.
If we had data that had two visible shifts, we would then consider adding higher
order functions (e.g., x^3).
Capturing Non-
Linear Trends
It should be noted that a
heuristic for determining your
polynomial effect(s) is to count
the shifts (going from up to
down, up to flat, and so on).
Conducting a Non-Linear Regression Model
Quadratic:
= b0 + b1X1 + b2
Cubic:
= b0 + b1X1 + b2 + b2
Lower Order Trends
When we include higher order trends it is important, we do not remove their lower order
components.
For example, if we wanted to add a quadratic trend to the model, we need to
keep the linear trend in as well even if its not significant.
What does that mean?
To include a cubic trend we must include the quadratic and linear trends as
well.
The reason for this is that if we do not include the lower order trends the higher order
trends will capture in part these effects which would bias our estimates.
A potential issue that will pop up as you include trends beyond linear is multicollinearity
(predictors sharing some of the same predictive variance – Age and Income).
Non-Linear Trends: Example
We try to simulate a non-linear dataset (Lionel, 2016)
set.seed(20191007)
x<-seq(0,50,1)
y<-((runif(1,10,20)*x)/(runif(1,0,10)+x))+rnorm(51,0,1)
Plot our data:
plot(x,y)
Non-Linear Trends: Example
From this graph set approximate starting values
a_start<-8
b_start<-2*log(2)/a_start
Build the model:
model<-nls(y~a*exp(-b*x),start=list(a=a_start,b=b_start))
Find the best fitting line:
lines(x,predict(model),lty=2,col="red",lwd=3)
1. When should we use regression instead of
ANOVA?
2. Please explain the relationship between
SStotal, SSregression and SSerror.
In Class 3. Please use the following data to build a
regression model and write a summary. IV is
Practice II sugar and DV is calories.
Sugar: 5, 8, 9, 10, 15, 18, 14, 17, 20, 22, 24, 26,
30 ,30, 32
Calories: 20, 30, 60, 70, 100, 95, 70, 83, 103, 112,
130, 80, 95, 130, 112