13 Predictive Analysis - Tests of Association - Regression
13 Predictive Analysis - Tests of Association - Regression
Business Research
Predictive Analysis: Tests of Association: Regression
Testing for Association
1. Presence of relationship? Statistical significance
• P-value (sig.); confidence interval
2. Direction of relationship?
• Positive or negative
3. Strength of association?
• Nonexistent, weak, moderate, strong
4. Type of relationship?
• Linear (or close approximation), curvilinear
Regression Analysis
• A statistical procedure for analyzing associative relationships between
a metric-dependent variable and one or more independent variables.
Y = a + bX + Ɛ
Where a = intercept; b is the slope and Ɛ is the error, the residual term
Example
• A restaurant owner wants to know the relationship between meals
bought and associated tips.
• To begin with, she has data for only tips for 6 meals
Y-Values: Tip amount ($)
18
T 14
1 5.00 i 12
2 17.00 p
10
3 11.00
A 8
4 8.00 m 6
5 14.00 o 4
u
6 5.00 2
n
t 0
0 1 2 3 4 5 6 7
Meal #
Exercise Contd.
• Calculate mean and draw a “best-fit” line at 𝑦ത = $10
• With only one variable and
no other information, the Y-Values: Tip amount ($)
18
best prediction for the 16
0
themselves 0 1 2 3 4 5 6 7
Meal #
Exercise Contd. – Goodness of Fit
• Calculate mean and draw a “best-fit” line at mean: 𝑦ത = $10
• Calculate 𝑦𝑖 - 𝑦ത
• 𝑦𝑖 - 𝑦ത is the deviation of a
point from the “best-fit” Y-Values: Tip amount ($)
18
line 16
+7
• These are called residuals 14 +12 +4
+1
• Also called errors
12
‘best-fit’ line
10
-2
• Squaring this will give you 8
-5 -5
the area 6
4 -12
• Add them up. You have 2
Residuals
Sum of Squared Errors 0
0 1 2 3 4 5 6 7
Meal #
Exercise Contd.
• Sum of Squared Errors (SSE) = 25 + 49 + 1 + 4 + 16 + 25 = 120
• If our regression model is significant, it will “eat up” much of the raw
SSE we had when we assumed that the independent variable did not
exist.
• The regression line will/should literally “fit” the data better. It will
minimize the residuals
Regression Model with the Independent Variable
• Simple linear regression is a bivariate test that is really a
comparison of two models:
• One is where the independent variable does not exist
• And the other uses the best fit regression line
• Regression is a statistical test that attempts to determine the
strength and character of the relationship between one
dependent variable (usually denoted by Y) and a series of other
variables (known as independent variables).
• Also called simple regression or ordinary least squares (OLS).
• Linear regression establishes the linear relationship between two
variables based on a linear ‘line of best fit’: 𝑦 = )𝑥(
• Remember the formula for slope: y = 𝑚𝑥 + 𝑏, (imagine 𝑦 = 𝑏 + 𝑎𝑥)
• Where x = random variable
m = slope of the line: rise/run
b = y-intercept (crosses y-axis)
y-intercept is where x = 0; coordinate of (0, y)
• We use the same concept for regression model:
y = 𝛽0 + 𝛽1 𝑥 + Ɛ where
𝛽0 = y-intercept population parameter
𝛽1 = slope population parameter
Ɛ = error term, unexplained variation in y
𝛽0 + 𝛽1 𝑥 will explain part of variation in y due to x, what is left is the
error Ɛ
• Simple Linear Regression: The expected value of y, 𝑦, ො equals
E(y) = 𝛽0 + 𝛽1 𝑥 + Ɛ
• The expected value of y is the mean of the values of y for a given
value of x
• So y is not a ‘point’ but the mean expected value on a curve. Any
expected value of y we calculate is at best going to be an
Y-Values
approximation: mean of a distribution 0.9
around y 0.8
0.7
0.6
• For sample data, we use sample 0.5
0.2
estimator, E(y), of mean value of y 0.1
• min∑(𝑦𝑖 − 𝑦ො𝑖 ) 34
108
5
17
• The goal is to minimize the sum of the squared 64 11
16
10
2
(in the example it was 120) 0
0 20 40 60 80 100 120
Steps: Simple Linear Regression
3.5 6
3 5
0.5 1
E(y) = 𝛽0 + 𝛽1 𝑥 0 0
Meal 1 Meal 2 Meal 3 Meal 4 Meal 5 Meal 1 Meal 2 Meal 3 Meal 4 Meal 5
Intercept 𝑏0 = −0.8188 10
Slope 𝑏1 = 0.1462 5
x Y 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത (𝑥𝑖 − 𝑥)(𝑦
ҧ 𝑖 − 𝑦)
ത 𝑥𝑖 − 𝑥ҧ 2
4 88 8 14 -2 -28 196
5 99 14 25 4 100 625
𝑥ҧ = 74 𝑦ത = 10 ∑ = 615 ∑ = 4206
Slope of the regression line: Remember, this is the covariance
between 𝑥 and 𝑦
𝑏1 = ∑ 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝑥𝑖 − 𝑥ҧ 2 Total Bill ($)
Tip Amount
($)
Deviation
Bill
Deviations
products
𝑏1 = 615 / 4206 Remember, Squared
x y
𝑏1 = 0.1462 this is variance
(𝑥𝑖 − 𝑥)(𝑦
ҧ 𝑖 − 𝑦)
ത 𝑥𝑖 − 𝑥ҧ 2
of 𝑥 34 5
200 1600
108 17
𝑏0 = 𝑦ത − 𝑏1 𝑥ҧ
238 1156
Where 𝑏1 = 0.1462 64 11
-10 100
𝑏0 = 10 – 0.1462(74) 88 8
-28 196
𝑏0 = 10 – 10.8188 99 14 100 625
𝑏0 = – 0.8188
51 5 115 529
𝑥ҧ = 74 𝑦ത = 10 ∑ = 615 ∑ = 4206
Interpretation
• 𝑦ො𝑖 = 0.1462𝑥 − 0.8188
If the bill amount (x) is zero, the
expected /predicted tip amount
is $ 0.8188. does not have to
make sense in every situation
• For every $ 1 the bill amount (x) increases, we would expect the tip
amount to increase by $0.1462 or about 15 cents
2
𝑦𝑖 − 𝑦ො𝑖
0.7217
4.1237
6.0688
16.3645
0.1201
2.6762
∑(𝑦𝑖 − 𝑦ො𝑖 )2 =
Predict 𝑦ො𝑖 using the formula 𝑦ො𝑖 = 30.075
0.1462𝑥 − 0.8188 for each bill / tip SSE = 30.075
combination
Sum of Squares due to Regression: SSR
• For model 1 we had only one dependent variable
• Therefore, SSE = SST
• For model 2, both IV and DV are included
• SST will remain the same but SSE should reduce significantly
• Difference between SST and SSE = SSR, sum of squares due to
regression: SST – SSE = SSR
• Dots on line are predicted values of y for every x value; other dots are
observed values
• SSE = ∑(𝑦𝑖 − 𝑦ො𝑖 )2
• From example: SST – SSE = 120 – 30.075 = 89.925, sum of squares
due to regression
Coefficient of Determination
• How well does the estimated equation fit the data?
• If SSR is large then it uses up more of SST; then SSE is smaller
• Ratio = SSR
SST
• Coefficient of determination quantifies this ratio as a %
𝑆𝑆𝑅
• 𝑟2 =
𝑆𝑆𝑇
𝑆𝑆𝑅 89.925
• In the example, 𝑟2 = = = 0.7493, 𝑜𝑟 74.93%
𝑆𝑆𝑇 120
• 74.93% of the total sum of squares can be explained by using the
estimated regression equation to predict the tip amount. The
remainder is error, 25.2%
Thresholds for R-squared: Coefficient of
Determination
Social sciences, such as business subjects, allow lower values of R2 as
good predictive power
• Falk and Miller (1992): R2 should be = or > 0.10 for the variance
explained by an endogenous construct to be deemed adequate.
• Cohen (1988): R2 of endogenous latent variables: 0.26 – substantial;
0.13 – moderate; 0.02 – weak.
• Chin (1998): R2 of endogenous latent variables: 0.67 – substantial;
0.33 – moderate); 0.19 – weak.
• Hair et al. (2011) & Hair et al. (2011): R2 of endogenous latent
variables: 0.75 – substantial; 0.5 – moderate; 0.25 – weak.
Slope – beta b or 𝛽1
• The expected increase or decrease in the dependent variable Y as a
function of an increase or decrease in the independent variable X
𝛽1 or b = r SDy
SDx
Intercept – a or 𝛽0
• The expected value of Y when X is zero.
• Forms the foundation of the regression equation. We start off from
the intercept and build from there.
𝛽0 or a = Y – bX
Regression: Assumptions
1. The error term is normally distributed. For each fixed value of X, the
distribution of Y is normal.
2. The means of all these normal distributions of Y, given X, lie on a straight
line with slope b. [E(Y) = a + bx where E(Y) is the mean value of Y]
3. The mean of the error term is 0.
4. The variance of the error term is constant. This variance does not depend
on the values assumed by X.
5. The error terms are uncorrelated. In other words, the observations have
been drawn independently.
In sum, the random error term Ɛ is normally and independently distributed
with a mean of 0 and a variance of σ2
Regression: Assumptions
1. Linearity: between X and Y; the change in the dependent variable should be
proportional to the change in the independent variables. Scatter plots.
2. Normality of errors: The residuals should follow a normal distribution with a
mean of zero. Histogram or a Q-Q plot, or through statistical tests, like the
Shapiro-Wilk test or the Kolmogorov-Smirnov test.
3. Homoscedasticity: the residuals’ variance should be constant across all
independent variable levels. This variance does not depend on the values assumed
by X. In other words, the residuals’ spread should be similar for all values of X,
large or small. Heteroscedasticity, violating this assumption, can be identified
using scatterplots of the residuals or formal tests like the Breusch-Pagan test.
4. Independence of errors: This assumption states that the dataset observations
should be independent of each other.
5. Absence of multicollinearity (Multiple Linear Regression): independent variables
in the linear regression model should not be highly correlated. VIF
In sum, the random error term Ɛ is normally and independently distributed with a
mean of 0 and a variance of σ2
Homoscedasticity
• The horizontal line in both plots represents the
zero residual line—that is, the line where the
predicted values exactly equal the actual values.
• In simpler terms:
• Residuals are the errors:
Residual = Actual value − Predicted value
• So, when a data point lies on the horizontal line,
it means the model predicted it perfectly.
• Points above the line = underpredicted (actual
was higher)
• Points below the line = overpredicted (actual was
lower)
• This line is the baseline to visualize whether the
model's errors are random (good) or patterned
(bad).
Example: Simple Regression
• Attitude towards the city based on duration of stay in the city
• One independent variable: IV
• One dependent variable: DV
Example: Att_City.sav • As an example, suppose that a
researcher wants to explain
Res Attitude Duration Importance
ID towards of attached to attitudes towards a respondent’s city
city residence weather
of residence in terms of duration of
1 6 10 3
2 9 12 11
residence in the city. The attitude is
3 8 12 4 measured on an 11-point scale (1 =
4
5
3
10
4
12
1
11
do not like the city, 11 = very much
6 4 6 1 like the city), and the duration of
7 5 8 7 residence is measured in terms of
8
9
2
11
2
18
4
8
the number of years the respondent
10 9 9 10 has lived in the city. In a pretest of
11 10 17 8 12 respondents, the data shown in
12 2 2 5
the Table are obtained.
Hypothesis
Research hypothesis:
• H1: A person’s duration of stay in a city positively impacts his/her attitude
towards the city
OR
• H1: The longer a person has stayed in a city, the more positive his/her attitude is
towards that city
OR
• H1: Duration of stay in the city and attitude towards the city are positively related
Statistical/Test hypotheses:
• H0: there is no association between duration of stay in the city and attitude
towards the city
• H1: there is an association between duration of stay in the city and attitude
towards the city
Linear Regression: Procedure
• Data: Att_City.sav
• To test if attitude towards the city is dependent on duration of stay:
• Analyze Regression Linear
• Add ‘attitude’ to ‘dependent’ and ‘duration’ to ‘independent’ variable.
• In ‘Statistics’ check ‘Estimates’, ‘Confidence intervals’, ‘Model fit’ and
‘Descriptives’.
• Heteroscedasticity: variance of the residuals in your analysis are not
consistent or constant across your predicted variable. The predictive
power of your regression analysis should be roughly equal from low
levels of the X value to high levels of the X value.
• To check for heteroscedasticity, in ‘Plots’, under ‘Standardized residual
plots’, check ‘Histogram’ and ‘Normal probability plot’. Add Z predicted
value, ‘*ZPRED’ to ‘X’ axis and Z residual value, ‘*ZRESD’ to ‘Y’.
• Click OK.
High correlation
Variables Entered/Removed table is important for the multiple regression
Amount of error
Adjusted for sample size associated with the
87.8% of the variability in to 87.1% of the variability. regression analysis model
attitude towards the city The difference between R in terms of predicting a
can be accounted for by Square and Adjusted R particular value.
duration of stay. Square becomes smaller Associated only with the
Very meaningful predictor as sample size increases mean of X (9.33)
Is the
correlation of
0.936
statistically
significant?
• A simple linear regression was conducted to predict attitude towards the city
based on duration of stay in the city. A significant regression equation was
found [(F1, 10) = 70.803, p ˂ 0.05], with an R2 of 0.864. Predicted attitude
towards the city is equal to 1.079 + 0.590 IV is measured as number of years
a person has lived n the city and DV is measured on an 11-point scale.
Attitude towards the city increased 0.590 units for each unit increase in
duration of stay.
https://www.youtube.com/watch?v=6xcQYmPDqXs
Multiple Regression
• A statistical technique that simultaneously develops a mathematical relationship
between two or more independent variables and a single interval-scaled
(continuous scaled) dependent variable.
• Examples:
• Can variation in sales be explained in terms of variation in advertising expenditures, prices
and level of distribution?
• Can variation in market shares be accounted for by the size of the sales force, advertising
expenditures and sales promotion budgets?
• Are consumers’ perceptions of quality determined by their perceptions of prices, brand
image and brand attributes?
Some independent variables or sets of
independent variables are better at predicting
The DV than others. Some contribute nothing
EXAMPLE:
Multiple Regression Preparation
• Conducting multiple regression analysis requires a fair amount of pre-
work before actually running the regression:
1. Generate a list of potential variables
2. Collect data on the variables
3. Check the relationships between each IV and the DV using
scatterplots and correlations
4. Check the relationships among the IVs using scatterplots and
correlations
5. (Optional) conduct simple linear regressions for each IV/DV pair
6. Use the non-redundant IVs in the analysis to find the best fitting
model
7. Use the best fitting model to make predictions about DV
• As an example, suppose that a
researcher wants to explain Att_City.sav
Res Attitude Duration Importance
attitudes towards a respondent’s city ID towards of attached to
of residence in terms of duration of city residence weather
1 6 10 3
residence in the city. The attitude is 2 9 12 11
measured on an 11-point scale (1 = 3 8 12 4
Duration of
residence (X1)
Attitude
towards
city (y)
Importance
attached to
weather (X2) Multiple regression
many-to-one
3 relationships to analyze
Scatterplots: Individual IVs and DV
• Is there a visible linear relationship?
Scatterplot Summary
• Dependent variable versus independent variables
• Attitude towards the city appears highly correlated with
duration of residence in the city (X1)
• Attitude towards the city appears highly correlated with
the importance attached to weather (X2)
• If an IV did not show strong correlation with the DV
then it is no use including it in the equation. We do
not have any here
Sketching out Relationships: Multicollinearity Check
Duration of
residence (X1)
Attitude
towards
city (y)
Importance
attached to Multiple regression
weather (X2)
many-to-one
3 relationships to analyze
Scatterplot: IV vs IV: Multicollinearity Check
• Is there a visible linear correlation?
Correlations
• What does the table tell us?
• Check for multicollinearity
Simple Linear Regressions
First calculate regression for each IV separately
• Duration and Attitude
• 𝑦ො𝑥1 = 1.079 + 0.59(Duration of residence)
• Normality of Residuals: Assess using the histogram and normal P-P plot.
• Reporting: The residuals' empirical cumulative distribution against the theoretical normal cumulative distribution shows an
approximately straight line, indicating normality.
• Homoscedasticity: Plot the standardized residuals (the difference between observed and predicted values)
against the fitted (predicted) values or each independent variable.
• Reporting: A plot of standardized residuals and predicted values showed that the points were scattered randomly around zero,
without any discernible pattern. Therefore, the variance of residuals was constant and homoscedasticity was confirmed.
• Multicollinearity: Check VIF and tolerance.
• The Variance Inflation Factor (VIF) for each predictor was well below the threshold of 5, and the tolerance levels were well above
0.1, dispelling multicollinearity concerns.
• Collectively, these diagnostic tests validated the key assumptions underpinning our multiple linear regression model, providing a
solid groundwork for the subsequent analysis.
Reporting Multiple Regression Results in APA Style
• A multiple regression was run to predict attitudes towards the city
from duration of residence in the city (in years) and the level of
importance attached to the weather of the city. This resulted in a
significant model, F(2, 9) = 57.132, p < .01, R2adjusted = .933. the model
explains 93.30% of the variance in attitude towards the city. The
individual predictors were examined further and indicated
that duration of residence (beta = 0.481, t = 9.21, p < .001) and the
level of importance attached to weather (beta = 0.289, t = 3.353, p <
.01) were both significant predictors of attitude towards the city.
• The regression equation was y = 0.337 + 0.481Xduration + 0.289Xweather,
where each year of residence improves attitude towards the city by
0.481 units and each 1 unit of importance attached to weather
improves attitude towards the city by 0.289 units.
Multiple Regression: Examples
• How much of the variation in sales can be explained by advertising
expenditures, prices and level of distribution?
• What is the contribution of advertising expenditures in explaining the
variation in sales when the levels of prices and distribution are
controlled?
• What levels of sales may be expected given the levels of advertising
expenditures, prices and level of distribution?
Multiple Regression: Assumptions
1. Adequate sample size: Different guidelines suggested.
1. Stevens (1996): 15 participants per predictor
2. Tabachnick and Fidell (2007) formula for calculating sample size requirements, taking into account
the number of independent variables: N > 50 + 8m (where m = number of independent variables).
E.g. for 5 independent variables, 90 cases. More cases are needed if the dependent variable is
skewed.
3. For stepwise regression, there should be a ratio of 40 cases for every independent variable.
2. Multicollinearity and singularity: both undesirable
3. No outliers
4. Normality: the residuals should be normally distributed about the predicted DV scores
5. Linearity: the residuals should have a straight-line relationship with predicted DV
scores
6. homoscedasticity: the variance of the residuals about predicted DV scores should be
the same for all predicted scores.
7. The residuals should be independent