0% found this document useful (0 votes)
7 views70 pages

13 Predictive Analysis - Tests of Association - Regression

The document discusses predictive analysis through regression, focusing on testing for associations between dependent and independent variables. It explains the significance of determining relationships, their direction, strength, and type, while detailing the process of linear regression and its application in predicting outcomes. Additionally, it provides examples and exercises to illustrate the concepts of regression analysis and the importance of minimizing errors in predictions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views70 pages

13 Predictive Analysis - Tests of Association - Regression

The document discusses predictive analysis through regression, focusing on testing for associations between dependent and independent variables. It explains the significance of determining relationships, their direction, strength, and type, while detailing the process of linear regression and its application in predicting outcomes. Additionally, it provides examples and exercises to illustrate the concepts of regression analysis and the importance of minimizing errors in predictions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Advanced and Applied

Business Research
Predictive Analysis: Tests of Association: Regression
Testing for Association
1. Presence of relationship? Statistical significance
• P-value (sig.); confidence interval
2. Direction of relationship?
• Positive or negative
3. Strength of association?
• Nonexistent, weak, moderate, strong
4. Type of relationship?
• Linear (or close approximation), curvilinear
Regression Analysis
• A statistical procedure for analyzing associative relationships between
a metric-dependent variable and one or more independent variables.

• A family of techniques that can be used to explore the relationship


between one continuous dependent variable and a number of
independent variables or predictors (usually continuous).

• Based on correlation but allows a more sophisticated exploration of


the interrelationship among a set of variables
Test of Association: Regression
Why Use It?
• To predict outcomes.
• To see what matters most.
• To quantify relationships (“how much effect?”).
• Example: what is the effect of study hours on student scores?
• Y = a + bX, where
• Y = the dependent variable, student scores, dependent on study hours.
• a = start of regression line: e.g. student’s score without studying;
• X = the independent variable, number of hours the student studies
• b = the slope (how much Y changes when X increases by 1)
Regression Analysis: Why Perform?
1 To determine whether the independent variables explain a significant
variation in the dependent variable: whether a relationship exists.
2 To determine how much of the variation in the dependent variable can
be explained by the independent variables: strength of the relationship.
3 To determine the structure or form of the relationship: the
mathematical equation relating the independent and dependent
variables.
4 To predict the values of the dependent variable.
5 To control for other independent variables when evaluating the
contributions of a specific variable or set of variables.
Linear Regression
• Model based technique; extension of Pearson correlation

• A procedure for deriving a mathematical relationship, in the form of an


equation, between a single metric-dependent variable and a single metric-
independent variable.

• Has three terms: an intercept, a slope, and an error term

• Helps us make predictions (build models)

• Regression equation helps us predict a value of Y given a value of X

Y = a + bX + Ɛ
Where a = intercept; b is the slope and Ɛ is the error, the residual term
Example
• A restaurant owner wants to know the relationship between meals
bought and associated tips.
• To begin with, she has data for only tips for 6 meals
Y-Values: Tip amount ($)
18

Meal # Tip amount ($) $ 16

T 14
1 5.00 i 12
2 17.00 p
10
3 11.00
A 8
4 8.00 m 6
5 14.00 o 4
u
6 5.00 2
n
t 0
0 1 2 3 4 5 6 7
Meal #
Exercise Contd.
• Calculate mean and draw a “best-fit” line at 𝑦ത = $10
• With only one variable and
no other information, the Y-Values: Tip amount ($)
18
best prediction for the 16

next measurement is the 14


12
mean of the sample itself. 10 ‘best-fit’ line

• The variability in the tip 8


6
amounts can only be 4

explained by the tips 2

0
themselves 0 1 2 3 4 5 6 7
Meal #
Exercise Contd. – Goodness of Fit
• Calculate mean and draw a “best-fit” line at mean: 𝑦ത = $10
• Calculate 𝑦𝑖 - 𝑦ത
• 𝑦𝑖 - 𝑦ത is the deviation of a
point from the “best-fit” Y-Values: Tip amount ($)
18
line 16
+7
• These are called residuals 14 +12 +4
+1
• Also called errors
12
‘best-fit’ line
10
-2
• Squaring this will give you 8
-5 -5
the area 6

4 -12
• Add them up. You have 2
Residuals
Sum of Squared Errors 0
0 1 2 3 4 5 6 7
Meal #
Exercise Contd.
• Sum of Squared Errors (SSE) = 25 + 49 + 1 + 4 + 16 + 25 = 120

Y-Values: Tip amount ($)


18
Meal# 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝟐
16
1 -5 25 +7
14
2 +7 49 49
12 16
3 +1 1 11 4 ‘best-fit’ line
10
4 -2 4 -24
8
5 +4 16 25-5 25
-5
6
6 -5 25 4
Sum of Squares Error 120 Squared residuals (ERROR)
2

Σ(yi - y)2 = 120 0 1 2 3


Meal #
4 5 6 7
Regression Model with the Independent Variable
• The goal of simple linear regression is to create a linear model that
minimizes the sum of squares of the residuals/error (SSE)

• When we introduce the independent variable, the regression model


should improve the prediction by reducing the SSE

• If our regression model is significant, it will “eat up” much of the raw
SSE we had when we assumed that the independent variable did not
exist.

• The regression line will/should literally “fit” the data better. It will
minimize the residuals
Regression Model with the Independent Variable
• Simple linear regression is a bivariate test that is really a
comparison of two models:
• One is where the independent variable does not exist
• And the other uses the best fit regression line
• Regression is a statistical test that attempts to determine the
strength and character of the relationship between one
dependent variable (usually denoted by Y) and a series of other
variables (known as independent variables).
• Also called simple regression or ordinary least squares (OLS).
• Linear regression establishes the linear relationship between two
variables based on a linear ‘line of best fit’: 𝑦 = ‫)𝑥(׬‬
• Remember the formula for slope: y = 𝑚𝑥 + 𝑏, (imagine 𝑦 = 𝑏 + 𝑎𝑥)
• Where x = random variable
m = slope of the line: rise/run
b = y-intercept (crosses y-axis)
y-intercept is where x = 0; coordinate of (0, y)
• We use the same concept for regression model:
y = 𝛽0 + 𝛽1 𝑥 + Ɛ where
𝛽0 = y-intercept population parameter
𝛽1 = slope population parameter
Ɛ = error term, unexplained variation in y
𝛽0 + 𝛽1 𝑥 will explain part of variation in y due to x, what is left is the
error Ɛ
• Simple Linear Regression: The expected value of y, 𝑦, ො equals
E(y) = 𝛽0 + 𝛽1 𝑥 + Ɛ
• The expected value of y is the mean of the values of y for a given
value of x
• So y is not a ‘point’ but the mean expected value on a curve. Any
expected value of y we calculate is at best going to be an
Y-Values
approximation: mean of a distribution 0.9

around y 0.8

0.7

0.6
• For sample data, we use sample 0.5

statistics: E(y) 0.4

𝑦ො = 𝑏0 + 𝑏1 𝑥 where 𝑦ො is the point 0.3

0.2
estimator, E(y), of mean value of y 0.1

for a given value of x 0


0 0.2 0.4 0.6 0.8 1 1.2
Back to Exercise: Tips in a Restaurant
• 𝑦ො = 𝑏0 + 𝑏1 𝑥 (Regression formula)
• 𝑦ො = 10 + 0𝑥 Remember, the slope for this line, 𝛽1 = 0
• Therefore, 𝑦ො = 10 The value of 𝑦ො = 0 for every value of x
Y-Values: Tip amount ($)
Meal# 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝟐
18
1 -5 25
16
2 +7 49 +7
14
49
3 +1 1 12 16
4 -2 4 1 ‘best-fit’ line
𝑦ො 10
4
5 +4 16 8
25-5 25
-5
6 -5 25 6

Sum of Squares Error 120 4


Squared residuals (ERROR)
)2
2
∑(𝑦𝑖 − 𝑦ො𝑖 = 120 (𝑦ො = 0)
0
0 1 2 3 4 5 6 7
Observed value – Expected value = Error Meal #
Ordinary
2
Least Squares Criterion Bill ($) Tip ($)

• min∑(𝑦𝑖 − 𝑦ො𝑖 ) 34
108
5
17
• The goal is to minimize the sum of the squared 64 11

differences between the observed value of the 88


99
8
14
dependent variable (𝑦𝑖 ) and the estimated or the 51 5
predicted value of the dependent variable (𝑦ො𝑖 ) Y-Values
that is provided by the regression 18

16

line – sum of the squared residuals 14

• In addition, sum of the squared 12

10

residuals should be much smaller 8

than the model with only the 6

dependent variable and 𝛽1 = 0 4

2
(in the example it was 120) 0
0 20 40 60 80 100 120
Steps: Simple Linear Regression

1. Draw a scatter plot


2. Look for a rough visual line: does the data seem to
fall along a line? We are looking for a linear pattern
• Yes: proceed; no; it is a blob: regression is not an
appropriate technique to use
3. Correlation coefficient
4. Descriptive statistic and the centroid
5. Calculations
Scatter Plot

• Graph → Legacy Dialogs → Scatter/Dot


• Simple Scatter → Define
• Add DV to the y-axis and IV to the x-
axis. → OK.
• The data appears to fall in a pattern,
increasing tip amounts with increasing
bill amounts.
• This suggests regression is a good test
of association for this data.
• Proceed with regression.
Scatter Plots and Lines of Best Fit

3.5 6
3 5

• Three basic models 2.5


2
4
3
1.5
E(y) = 𝛽0 + 0𝑥 1
2

0.5 1
E(y) = 𝛽0 + 𝛽1 𝑥 0 0
Meal 1 Meal 2 Meal 3 Meal 4 Meal 5 Meal 1 Meal 2 Meal 3 Meal 4 Meal 5

E(y) = 𝛽0 − 𝛽1 𝑥 Chart Title


4
3
2
1
0
Meal 1 Meal 2 Meal 3 Meal 4 Meal 5
Ordinary Least Squares Criterion Bill ($) Tip ($)
34 5
2
• min∑ 𝑦𝑖 − 𝑦ො𝑖 108 17

• Plot means of x and y (74, 10) 64 11


88 8
• The point is the centroid and any regression line must 99 14
pass through the centroid. 51 5
𝑥=74
ҧ 𝑦=
ത 10
• 𝑏0 = 𝑦ത − 𝑏1 𝑥ҧ and 𝑏1 = ∑ 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝑥𝑖 − 𝑥ҧ 2
• Regression line can now be calculated 20 Y-Values
𝑦ො𝑖 = 𝑏0 + 𝑏1 𝑥𝑖 15

Intercept 𝑏0 = −0.8188 10

Slope 𝑏1 = 0.1462 5

𝑦ො𝑖 = 0.1462𝑥 − 0.8188 0


0 20 40 60 80 100 120
Tip Amount Bill Tips Bill Deviations
Meal Total Bill ($) Deviation products
($) deviation Deviation Squared

x Y 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത (𝑥𝑖 − 𝑥)(𝑦
ҧ 𝑖 − 𝑦)
ത 𝑥𝑖 − 𝑥ҧ 2

1 34 5 -40 -5 200 1600

2 108 17 34 7 238 1156

3 64 11 -10 1 -10 100

4 88 8 14 -2 -28 196

5 99 14 25 4 100 625

6 51 5 -23 -5 115 529

𝑥ҧ = 74 𝑦ത = 10 ∑ = 615 ∑ = 4206
Slope of the regression line: Remember, this is the covariance
between 𝑥 and 𝑦
𝑏1 = ∑ 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝑥𝑖 − 𝑥ҧ 2 Total Bill ($)
Tip Amount
($)
Deviation
Bill
Deviations
products
𝑏1 = 615 / 4206 Remember, Squared
x y
𝑏1 = 0.1462 this is variance
(𝑥𝑖 − 𝑥)(𝑦
ҧ 𝑖 − 𝑦)
ത 𝑥𝑖 − 𝑥ҧ 2
of 𝑥 34 5
200 1600
108 17
𝑏0 = 𝑦ത − 𝑏1 𝑥ҧ
238 1156
Where 𝑏1 = 0.1462 64 11
-10 100
𝑏0 = 10 – 0.1462(74) 88 8
-28 196
𝑏0 = 10 – 10.8188 99 14 100 625
𝑏0 = – 0.8188
51 5 115 529
𝑥ҧ = 74 𝑦ത = 10 ∑ = 615 ∑ = 4206
Interpretation
• 𝑦ො𝑖 = 0.1462𝑥 − 0.8188
If the bill amount (x) is zero, the
expected /predicted tip amount
is $ 0.8188. does not have to
make sense in every situation

• For every $ 1 the bill amount (x) increases, we would expect the tip
amount to increase by $0.1462 or about 15 cents
2
𝑦𝑖 − 𝑦ො𝑖

0.7217
4.1237
6.0688
16.3645
0.1201
2.6762
∑(𝑦𝑖 − 𝑦ො𝑖 )2 =
Predict 𝑦ො𝑖 using the formula 𝑦ො𝑖 = 30.075
0.1462𝑥 − 0.8188 for each bill / tip SSE = 30.075
combination
Sum of Squares due to Regression: SSR
• For model 1 we had only one dependent variable
• Therefore, SSE = SST
• For model 2, both IV and DV are included
• SST will remain the same but SSE should reduce significantly
• Difference between SST and SSE = SSR, sum of squares due to
regression: SST – SSE = SSR
• Dots on line are predicted values of y for every x value; other dots are
observed values
• SSE = ∑(𝑦𝑖 − 𝑦ො𝑖 )2
• From example: SST – SSE = 120 – 30.075 = 89.925, sum of squares
due to regression
Coefficient of Determination
• How well does the estimated equation fit the data?
• If SSR is large then it uses up more of SST; then SSE is smaller
• Ratio = SSR
SST
• Coefficient of determination quantifies this ratio as a %
𝑆𝑆𝑅
• 𝑟2 =
𝑆𝑆𝑇
𝑆𝑆𝑅 89.925
• In the example, 𝑟2 = = = 0.7493, 𝑜𝑟 74.93%
𝑆𝑆𝑇 120
• 74.93% of the total sum of squares can be explained by using the
estimated regression equation to predict the tip amount. The
remainder is error, 25.2%
Thresholds for R-squared: Coefficient of
Determination
Social sciences, such as business subjects, allow lower values of R2 as
good predictive power
• Falk and Miller (1992): R2 should be = or > 0.10 for the variance
explained by an endogenous construct to be deemed adequate.
• Cohen (1988): R2 of endogenous latent variables: 0.26 – substantial;
0.13 – moderate; 0.02 – weak.
• Chin (1998): R2 of endogenous latent variables: 0.67 – substantial;
0.33 – moderate); 0.19 – weak.
• Hair et al. (2011) & Hair et al. (2011): R2 of endogenous latent
variables: 0.75 – substantial; 0.5 – moderate; 0.25 – weak.
Slope – beta b or 𝛽1
• The expected increase or decrease in the dependent variable Y as a
function of an increase or decrease in the independent variable X

𝛽1 or b = r SDy
SDx
Intercept – a or 𝛽0
• The expected value of Y when X is zero.
• Forms the foundation of the regression equation. We start off from
the intercept and build from there.

𝛽0 or a = Y – bX
Regression: Assumptions
1. The error term is normally distributed. For each fixed value of X, the
distribution of Y is normal.
2. The means of all these normal distributions of Y, given X, lie on a straight
line with slope b. [E(Y) = a + bx where E(Y) is the mean value of Y]
3. The mean of the error term is 0.
4. The variance of the error term is constant. This variance does not depend
on the values assumed by X.
5. The error terms are uncorrelated. In other words, the observations have
been drawn independently.
In sum, the random error term Ɛ is normally and independently distributed
with a mean of 0 and a variance of σ2
Regression: Assumptions
1. Linearity: between X and Y; the change in the dependent variable should be
proportional to the change in the independent variables. Scatter plots.
2. Normality of errors: The residuals should follow a normal distribution with a
mean of zero. Histogram or a Q-Q plot, or through statistical tests, like the
Shapiro-Wilk test or the Kolmogorov-Smirnov test.
3. Homoscedasticity: the residuals’ variance should be constant across all
independent variable levels. This variance does not depend on the values assumed
by X. In other words, the residuals’ spread should be similar for all values of X,
large or small. Heteroscedasticity, violating this assumption, can be identified
using scatterplots of the residuals or formal tests like the Breusch-Pagan test.
4. Independence of errors: This assumption states that the dataset observations
should be independent of each other.
5. Absence of multicollinearity (Multiple Linear Regression): independent variables
in the linear regression model should not be highly correlated. VIF
In sum, the random error term Ɛ is normally and independently distributed with a
mean of 0 and a variance of σ2
Homoscedasticity
• The horizontal line in both plots represents the
zero residual line—that is, the line where the
predicted values exactly equal the actual values.
• In simpler terms:
• Residuals are the errors:
Residual = Actual value − Predicted value
• So, when a data point lies on the horizontal line,
it means the model predicted it perfectly.
• Points above the line = underpredicted (actual
was higher)
• Points below the line = overpredicted (actual was
lower)
• This line is the baseline to visualize whether the
model's errors are random (good) or patterned
(bad).
Example: Simple Regression
• Attitude towards the city based on duration of stay in the city
• One independent variable: IV
• One dependent variable: DV
Example: Att_City.sav • As an example, suppose that a
researcher wants to explain
Res Attitude Duration Importance
ID towards of attached to attitudes towards a respondent’s city
city residence weather
of residence in terms of duration of
1 6 10 3
2 9 12 11
residence in the city. The attitude is
3 8 12 4 measured on an 11-point scale (1 =
4
5
3
10
4
12
1
11
do not like the city, 11 = very much
6 4 6 1 like the city), and the duration of
7 5 8 7 residence is measured in terms of
8
9
2
11
2
18
4
8
the number of years the respondent
10 9 9 10 has lived in the city. In a pretest of
11 10 17 8 12 respondents, the data shown in
12 2 2 5
the Table are obtained.
Hypothesis
Research hypothesis:
• H1: A person’s duration of stay in a city positively impacts his/her attitude
towards the city
OR
• H1: The longer a person has stayed in a city, the more positive his/her attitude is
towards that city
OR
• H1: Duration of stay in the city and attitude towards the city are positively related
Statistical/Test hypotheses:
• H0: there is no association between duration of stay in the city and attitude
towards the city
• H1: there is an association between duration of stay in the city and attitude
towards the city
Linear Regression: Procedure
• Data: Att_City.sav
• To test if attitude towards the city is dependent on duration of stay:
• Analyze Regression Linear
• Add ‘attitude’ to ‘dependent’ and ‘duration’ to ‘independent’ variable.
• In ‘Statistics’ check ‘Estimates’, ‘Confidence intervals’, ‘Model fit’ and
‘Descriptives’.
• Heteroscedasticity: variance of the residuals in your analysis are not
consistent or constant across your predicted variable. The predictive
power of your regression analysis should be roughly equal from low
levels of the X value to high levels of the X value.
• To check for heteroscedasticity, in ‘Plots’, under ‘Standardized residual
plots’, check ‘Histogram’ and ‘Normal probability plot’. Add Z predicted
value, ‘*ZPRED’ to ‘X’ axis and Z residual value, ‘*ZRESD’ to ‘Y’.
• Click OK.
High correlation
Variables Entered/Removed table is important for the multiple regression

Amount of error
Adjusted for sample size associated with the
87.8% of the variability in to 87.1% of the variability. regression analysis model
attitude towards the city The difference between R in terms of predicting a
can be accounted for by Square and Adjusted R particular value.
duration of stay. Square becomes smaller Associated only with the
Very meaningful predictor as sample size increases mean of X (9.33)
Is the
correlation of
0.936
statistically
significant?

Tells us where the intercept lies

Intercept β0, tells us the level of preference


when quality of food is rated at a 0
So, how much would attitude towards the city improve if we increase duration of
stay by 1 unit (year)? β1 (Slope of linear relationship)
Homoscedasticity Check

Residual value of the


mean should be 0

Homoscedasticity is achieved when


the residuals are scattered around
zero residual line in no particular
pattern and within ± 3.3 standard
deviations
Normality Check
Residual distribution should be approximately
normal: established through normal P-P plot

Expected and observed probabilities are


approximately equal
Reporting Simple Linear Regression: APA
From ANOVA table From Model Summary table

• A simple linear regression was conducted to predict attitude towards the city
based on duration of stay in the city. A significant regression equation was
found [(F1, 10) = 70.803, p ˂ 0.05], with an R2 of 0.864. Predicted attitude
towards the city is equal to 1.079 + 0.590 IV is measured as number of years
a person has lived n the city and DV is measured on an 11-point scale.
Attitude towards the city increased 0.590 units for each unit increase in
duration of stay.

Constant, β0 , from Coefficients table Slope, β1, from Coefficients table

https://www.youtube.com/watch?v=6xcQYmPDqXs
Multiple Regression
• A statistical technique that simultaneously develops a mathematical relationship
between two or more independent variables and a single interval-scaled
(continuous scaled) dependent variable.

• Independent variables can be continuous or categorical

• Examples:
• Can variation in sales be explained in terms of variation in advertising expenditures, prices
and level of distribution?
• Can variation in market shares be accounted for by the size of the sales force, advertising
expenditures and sales promotion budgets?
• Are consumers’ perceptions of quality determined by their perceptions of prices, brand
image and brand attributes?
Some independent variables or sets of
independent variables are better at predicting
The DV than others. Some contribute nothing
EXAMPLE:
Multiple Regression Preparation
• Conducting multiple regression analysis requires a fair amount of pre-
work before actually running the regression:
1. Generate a list of potential variables
2. Collect data on the variables
3. Check the relationships between each IV and the DV using
scatterplots and correlations
4. Check the relationships among the IVs using scatterplots and
correlations
5. (Optional) conduct simple linear regressions for each IV/DV pair
6. Use the non-redundant IVs in the analysis to find the best fitting
model
7. Use the best fitting model to make predictions about DV
• As an example, suppose that a
researcher wants to explain Att_City.sav
Res Attitude Duration Importance
attitudes towards a respondent’s city ID towards of attached to
of residence in terms of duration of city residence weather
1 6 10 3
residence in the city. The attitude is 2 9 12 11
measured on an 11-point scale (1 = 3 8 12 4

do not like the city, 11 = very much 4


5
3
10
4
12
1
11
like the city), and the duration of 6 4 6 1
residence is measured in terms of 7 5 8 7

the number of years the respondent 8


9
2
11
2
18
4
8
has lived in the city. In a pretest of 10 9 9 10
12 respondents, the data shown in 11 10 17 8
12 2 2 5
the Table are obtained.
Sketching out Relationships
Independent variables Dependent Variable

Duration of
residence (X1)
Attitude
towards
city (y)
Importance
attached to
weather (X2) Multiple regression
many-to-one

3 relationships to analyze
Scatterplots: Individual IVs and DV
• Is there a visible linear relationship?
Scatterplot Summary
• Dependent variable versus independent variables
• Attitude towards the city appears highly correlated with
duration of residence in the city (X1)
• Attitude towards the city appears highly correlated with
the importance attached to weather (X2)
• If an IV did not show strong correlation with the DV
then it is no use including it in the equation. We do
not have any here
Sketching out Relationships: Multicollinearity Check

Independent variables Dependent Variable

Duration of
residence (X1)
Attitude
towards
city (y)

Importance
attached to Multiple regression
weather (X2)
many-to-one

3 relationships to analyze
Scatterplot: IV vs IV: Multicollinearity Check
• Is there a visible linear correlation?
Correlations
• What does the table tell us?
• Check for multicollinearity
Simple Linear Regressions
First calculate regression for each IV separately
• Duration and Attitude
• 𝑦ො𝑥1 = 1.079 + 0.59(Duration of residence)

• Weather and Attitude


• 𝑦ො𝑥2 = 2.479 + 0.675(Importance attached
to weather)
Hypothesis
Research hypothesis:
• H1: Duration of stay in a city positively impacts one’s attitude towards the city.
• H2: The weather of the city positively impacts one’s attitude towards the city.
Statistical/Test hypotheses:
For H1:
• H0: there is no association between duration of stay in the city and attitude
towards the city
• H1: there is an association between duration of stay in the city and attitude
towards the city
For H2
• H0: there is no association between the weather of the city and attitude towards
the city
• H1: there is an association between the weather of the city and attitude towards
the city
Multiple Regression
Shows independence of errors:
Should be between 1 and 3

SSR / SST = 114.264 / 120.917 = 0.945


This is the coefficient of determination, R2

𝑦ො𝑥1𝑥2 = 0.337 + 0.481x1 + 0.289x2


Explain these!!
Reporting Multiple Regression Results in APA Style
Checking and Reporting Model Assumptions
• Linearity: Verify each independent variable’s relationship with the dependent variable is linear.
• Reporting: Scatterplots of attitude towards the city, duration of residence, and importance attached to weather revealed linear
trends.
• Independence of Errors: Use the Durbin-Watson statistic.
• Reporting: The Durbin-Watson statistic of 1.956 suggests no autocorrelation, indicating independent errors.

• Normality of Residuals: Assess using the histogram and normal P-P plot.
• Reporting: The residuals' empirical cumulative distribution against the theoretical normal cumulative distribution shows an
approximately straight line, indicating normality.

• Homoscedasticity: Plot the standardized residuals (the difference between observed and predicted values)
against the fitted (predicted) values or each independent variable.
• Reporting: A plot of standardized residuals and predicted values showed that the points were scattered randomly around zero,
without any discernible pattern. Therefore, the variance of residuals was constant and homoscedasticity was confirmed.
• Multicollinearity: Check VIF and tolerance.
• The Variance Inflation Factor (VIF) for each predictor was well below the threshold of 5, and the tolerance levels were well above
0.1, dispelling multicollinearity concerns.
• Collectively, these diagnostic tests validated the key assumptions underpinning our multiple linear regression model, providing a
solid groundwork for the subsequent analysis.
Reporting Multiple Regression Results in APA Style
• A multiple regression was run to predict attitudes towards the city
from duration of residence in the city (in years) and the level of
importance attached to the weather of the city. This resulted in a
significant model, F(2, 9) = 57.132, p < .01, R2adjusted = .933. the model
explains 93.30% of the variance in attitude towards the city. The
individual predictors were examined further and indicated
that duration of residence (beta = 0.481, t = 9.21, p < .001) and the
level of importance attached to weather (beta = 0.289, t = 3.353, p <
.01) were both significant predictors of attitude towards the city.
• The regression equation was y = 0.337 + 0.481Xduration + 0.289Xweather,
where each year of residence improves attitude towards the city by
0.481 units and each 1 unit of importance attached to weather
improves attitude towards the city by 0.289 units.
Multiple Regression: Examples
• How much of the variation in sales can be explained by advertising
expenditures, prices and level of distribution?
• What is the contribution of advertising expenditures in explaining the
variation in sales when the levels of prices and distribution are
controlled?
• What levels of sales may be expected given the levels of advertising
expenditures, prices and level of distribution?
Multiple Regression: Assumptions
1. Adequate sample size: Different guidelines suggested.
1. Stevens (1996): 15 participants per predictor
2. Tabachnick and Fidell (2007) formula for calculating sample size requirements, taking into account
the number of independent variables: N > 50 + 8m (where m = number of independent variables).
E.g. for 5 independent variables, 90 cases. More cases are needed if the dependent variable is
skewed.
3. For stepwise regression, there should be a ratio of 40 cases for every independent variable.
2. Multicollinearity and singularity: both undesirable
3. No outliers
4. Normality: the residuals should be normally distributed about the predicted DV scores
5. Linearity: the residuals should have a straight-line relationship with predicted DV
scores
6. homoscedasticity: the variance of the residuals about predicted DV scores should be
the same for all predicted scores.
7. The residuals should be independent

You might also like