0% found this document useful (0 votes)
51 views47 pages

Multiple Linear Regression Test - 2025

The document provides a comprehensive overview of Multiple Linear Regression (MLR) and its application in JAMOVI, detailing key assumptions such as linearity, multicollinearity, homoscedasticity, normality of residuals, and autocorrelation. It emphasizes the importance of testing these assumptions to ensure valid regression results and includes examples related to employee performance and calcium intake. The resource is aimed at educating students and researchers on hypothesis formulation and statistical analysis techniques.

Uploaded by

johnakum18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views47 pages

Multiple Linear Regression Test - 2025

The document provides a comprehensive overview of Multiple Linear Regression (MLR) and its application in JAMOVI, detailing key assumptions such as linearity, multicollinearity, homoscedasticity, normality of residuals, and autocorrelation. It emphasizes the importance of testing these assumptions to ensure valid regression results and includes examples related to employee performance and calcium intake. The resource is aimed at educating students and researchers on hypothesis formulation and statistical analysis techniques.

Uploaded by

johnakum18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

MULTIPLE LINEAR

REGRESSION TEST IN
JAMOVI

Resource Person
Dr. Ambati Nageswara Rao,
Assistant Professor of Social Work,
Former Dean, Research and Publication Division,
Gujarat National Law University, Gandhinagar,
Mail ID: [email protected] M. No: 9898332217
Acknowledgement

I want to express my sincere thanks to all the


scholars/academicians/practitioners for contributing
to this topic and circulating online. It helped me a lot
to prepare my this presentation. I have also used
various images from Google to attract the students'
attention. The presentation is mainly prepared to
create awareness amongst students and researchers
on hypothesis.
References

• Laerd Statistics (Retrieved from https://statistics.laerd.com/spss-


tutorials/linear-regression-using-spss-statistics.php)
Multiple Linear Regression (MLR)

Multiple Linear Regression (MLR) is a statistical technique used to model


the relationship between a dependent variable (outcome) and two or more
independent variables (predictors). The goal of MLR is to understand how
changes in the independent variables affect the dependent variable. It helps us
to predict the value of the dependent variable based on the known values of
the independent variables.
Results
Results
Assumption No: 01 & 02
• Assumption No 1: Your dependent variable should be
measured at the continuous level (i.e., it is either
an interval or ratio variable).
• Examples of continuous variables include revision time
(measured in hours), intelligence (measured using IQ score),
exam performance (measured from 0 to 100), weight (measured
in kg), and so forth.
• Assumption No 2: Your independent variable should also be
measured at the continuous level (i.e., it is either
an interval or ratio variable). See the bullet above for examples
of continuous variables.
Linearity: The Straight Path
• Imagine you are walking down a straight road, and every time you take a step, you
move forward at a consistent pace. This road represents the linear relationship
between your independent variables (like years of experience or hours worked) and
your dependent variable (like employee performance).

• In multiple linear regression, we expect the effect of the predictors (independent


variables) on the outcome (dependent variable) to follow a straight line, just like your
walk down a straight road.

• If your path is curvy or zigzagging, it might mean something is wrong with the
relationship between the variables, and the line is not the best way to predict the
outcome.

• Test: Check the scatterplot or partial plots to see if the data points form a straight line.
If they do, the assumption holds.
Assumption No 03:
Linear relationship

• Linearity: This assumption suggests that the relationship


between the independent variables (predictors) and the
dependent variable (employee performance) should be linear.

• Test in JAMOVI: You can create a scatterplot in JAMOVI to


check the relationship between each independent variable and
the dependent variable. A linear pattern in these plots will
confirm the assumption of linearity.
Linearity Assumption test

H₀: There is a linear relationship between the independent variables and the dependent
variable.

H₁: There is a non-linear relationship between the independent variables and the
dependent variable
Linearity Assumption test
How to Check:
•Use a scatter plot of residuals vs. predicted values.

Conclusion:
▪ If the plot shows a random scatter of points (no discernible
pattern), then fail to reject H₀, meaning linearity holds.
▪ If there’s a pattern (e.g., a curve), then reject H₀, meaning the
relationship is non-linear.
Scatterplot
The scatterplot gives a general
idea of the relationship between
the two variables and we can
look by eye to see if a linear
relationship is suitable. In this
example there appears to be a
positive relationship as there are
more points in the bottom-left
and top-right quarters of the
plot than in the top-left and
bottom-right corners.
Multicollinearity: The Friends Who Talk
the Same
• Imagine you have two friends, Sarah and Lisa, who always talk about the same
things. They like the same movies, have the same hobbies, and even agree on
everything. If you ask them a question, they both give you almost identical
answers. In statistical terms, this is called multicollinearity.
• In the world of regression, multicollinearity happens when two or more
independent variables are so similar to each other that they end up giving you
the same information. Just like Sarah and Lisa, they are talking the same talk!
• This is a problem because when we try to figure out which variable is causing
the change in the outcome (like employee performance), it becomes hard to
distinguish the influence of one from the other.
• Test: Use the Variance Inflation Factor (VIF). If VIF values are too high
(greater than 10), it means the predictors are too similar to each other.
Assumption No 04:
No Multicollinearity

• Multicollinearity: Multicollinearity occurs when two or more


independent variables are highly correlated with each other,
which can distort the regression results.

• Test in JAMOVI: You can check Variance Inflation Factor


(VIF) for each predictor. If the VIF is greater than 10,
multicollinearity might be an issue.
Multicollinearity Assumption

• H₀: There is no multicollinearity among the independent


variables.
• H₁: There is multicollinearity among the independent
variables.
Multicollinearity Assumption

How to Check:
▪ Check the Variance Inflation Factor (VIF) values for each
independent variable.

Conclusion:
▪ If VIF values are less than 10 for all predictors, fail to reject H₀
(no multicollinearity).
▪ If VIF is greater than 10 for any predictor, reject H₀
(multicollinearity is present). You may need to remove or
combine highly correlated predictors.
Homoscedasticity: The Equal Spread
• Imagine you are at a carnival, and you are throwing a ball at a target. At first,
you throw the ball, and it lands close to the target—great accuracy! But as you
throw more and more, the ball starts landing farther and farther from the
target. This means the spread of your throws is not consistent.
• In MLR, homoscedasticity means that the spread (or variance) of the
residuals (errors) should be the same across all levels of the independent
variables. If the spread is getting wider or narrower, it’s like your throws at the
carnival starting to miss the mark.
• Test: Check the Residual vs. Predicted plot. If the spread of the residuals
stays constant across all predictions, the assumption holds. If it looks like a
funnel (narrowing or widening), then the assumption is violated.
Assumption No 05:
Homoscedasticity
• Your data needs to show homoscedasticity, which is where the
variances along the line of best fit remain similar as you move
along the line.
Homoscedasticity
• Homoscedasticity can be referred to as the condition of
homogeneity of variance. This is because the variance between
the predicted and observed values will be a constant for any
independent variable.

Source: https://www.wallstreetmojo.com/homoscedasticity/
Homoscedasticity
• Homoscedasticity: This assumption suggests that the variance
of the residuals (errors) should be constant across all levels of
the independent variables.

• Test in JAMOVI: You can create a Residuals vs. Predicted


plot. If the plot shows a random scatter, it confirms
homoscedasticity. If there’s a pattern (like a funnel shape), the
assumption may be violated.
Homoscedasticity Assumption:

H₀: The residuals have constant variance (homoscedasticity).


H₁: The residuals do not have constant variance
(heteroscedasticity).
Homoscedasticity Assumption
How to Check:
▪ Use a scatter plot of residuals vs. predicted values.

Conclusion:
▪ If the residuals are randomly scattered with no specific pattern
(i.e., uniform spread), fail to reject H₀ (homoscedasticity holds).
▪ If the plot shows a funnel shape (either narrowing or widening),
reject H₀ (heteroscedasticity is present). You may need to use
transformations or weighted regression to correct this.
Normality of Residuals: The Bell Curve
• Let’s imagine you are a baker, and every time you bake a batch of cookies, you
weigh them to ensure they are evenly sized. Most of the cookies turn out to
be similar in size, with a few that are much bigger or smaller than the others.
If you were to graph the sizes of the cookies, you'd get a shape that looks like
a bell—this is the normal distribution.
• In regression, the residuals (errors between the predicted and actual values)
should follow a similar bell-shaped curve. If they do, it suggests that the
errors are evenly distributed, and the model is working well. If the
distribution is lopsided, it means something is off.
• Test: Check the histogram or P-P plot for the residuals. If the plot looks
like a bell curve (symmetrical), the assumption holds.
Normality of Residuals Assumption

• H₀: The residuals (errors) are normally distributed.


• H₁: The residuals (errors) are not normally distributed.
Normality Assumption test
How to Check:
▪ Use the Q-Q plot or Shapiro-Wilk test.

Conclusion:
▪ Q-Q Plot: If the residuals form a straight line, fail to reject H₀
(normality is assumed).

▪ Shapiro-Wilk Test: If the p-value > 0.05, fail to reject H₀


(residuals are normally distributed). If the p-value ≤ 0.05, reject H₀
(residuals are not normally distributed).
Assumption No 05:
Normality of Residuals (errors)
Normality of Residuals: The residuals should be normally
distributed for valid significance tests.

Interpretation:
• If histogram and P-P plot show a normal
distribution, the assumption is met.
• Shapiro-Wilk test (p > 0.05) indicates normality.
Assumption No 08:
No outliers
• There should be no significant outliers.
• An outlier is an observed data point that has a dependent variable
value that is very different to the value predicted by the regression
equation.
• An outlier will be a point on a
scatterplot that is (vertically) far
away from the regression line
indicating that it has a large
residual, as highlighted below:
Absence of Extreme Outliers
• Regression analysis is sensitive to outliers, so we want to ensure that there are no extreme
outliers in our data set.
• We can do this by reviewing the Minimum and Maximum columns of the Std.
Residual row in the Residuals Statistics table.
• A data point with a standardized residual that is more extreme than +/-3 is usually
considered to be an outlier.
• In other words, if the value in the Minimum column of the Std. Residual row is less than
-3, we should investigate it.
• Similarly, if the value in the Maximum column of the Std. Residual row is greater than 3,
we should investigate it.
Autocorrelation: The Too Connected
• Let’s use a basketball team analogy to explain autocorrelation more clearly.
• Imagine you’re watching a basketball game. Every time one player scores, they pass the ball
to another player, and that player is likely to score next. Now, if you were trying to evaluate
how well each player performs, it would be difficult because the success of one player
(scoring) depends on the previous player's action (passing the ball). The players are
"connected" by their actions, making it hard to evaluate their individual skills.
• In a regression model, autocorrelation is similar. When errors (residuals) from one
observation (data point) influence or correlate with errors from another observation, it's
like the basketball players relying on each other to score. If residuals from one point in
time are correlated with residuals from another (especially in time-series data), we have
autocorrelation.
• Test: Use the Durbin-Watson Test. A value close to 2 indicates that the errors are
independent, and autocorrelation isn’t a problem.
Autocorrelation of residual Assumption

• H₀: The residuals are independent of each other.


• H₁: The residuals are not independent of each other.
Autocorrelation of residual Assumption
How to Check:
▪ Use the Durbin-Watson test.

Conclusion:
▪ If the Durbin-Watson statistic is close to 2 (range: 1.5–2.5), fail to
reject H₀ (residuals are independent).
▪ If the value is much lower than 1 or much higher than 3, reject
H₀ (autocorrelation exists, meaning the residuals are not
independent).
Assumption No 07:
No Autocorrelation
• Durbin-Watson statistic
in the Model Summary
table to determine
whether your data
satisfies the independence
of observations
assumption.
• Values between 1.5 and
2.5 are normally
considered to satisfy this
assumption.
Source: https://slideplayer.com/slide/4283449/#google_vignette
Multiple Linear Regression (Knowledge Score,
Diet, Exercise → Calcium Intake)
• A multiple linear regression tests whether two or more independent
variables together significantly predict a dependent variable.

• Null Hypothesis (H₀): There is no significant relationship between the


independent variables (Knowledge Score, Diet, Exercise) and Calcium
Intake.

• Alternative Hypothesis (H₁): At least one of the independent variables


(Knowledge Score, Diet, or Exercise) significantly predicts Calcium
Intake.
HR Analytics
Employee Performance/Engagement
• You're an HR analyst at a company, and management wants to understand
what factors influence employee performance. You've identified the following
independent variables (predictors):

• Years of Experience (X1)


• Education Level (X2, coded as: 1 = High School, 2 = Bachelor's, 3 =
Master's)
• Hours Worked per Week (X3)

• The dependent variable (outcome) is Employee Performance Score (Y),


which is a continuous variable based on performance reviews.

• You collected data from 100 employees, and now you want to use Multiple
Linear Regression to analyze the relationship between the independent
variables and employee performance.
Step 1: Formulating Hypotheses

A. Hypothesis for the Overall Model


You first test whether the independent variables as a whole significantly
predict employee performance.

• H₀: There is no significant relationship between the independent


variables (years of experience, education level, and hours worked) and
the dependent variable (employee performance score).
• H₁: At least one of the independent variables significantly predicts
employee performance.
Step 2: Running the Multiple Linear Regression

• Using JAMOVI (or any other statistical software), you


run the Multiple Linear Regression with the
employee performance score (Y) as the dependent
variable and years of experience (X1), education level
(X2), and hours worked per week (X3) as the
independent variables.
1. Calcium Intake (mg/day) 2. Knowledge Score (out of 50)
o Mean: 873.75 mg/day → Average calcium intake. o Mean: 38.50 → Participants have relatively high
knowledge of calcium intake.
o Standard Deviation: 127.907 → Moderate
variation in intake levels. o Standard Deviation: 6.653 → Low variability, indicating
similar levels of knowledge.
o Implication: Calcium intake varies significantly,
likely influenced by knowledge, diet, and exercise. o Implication: Knowledge may have a strong positive
relationship with calcium intake.
3. Diet (Scale 1-5)
4. Exercise (Minutes per Day)
o Mean: 3.40 → Moderate adherence to a balanced
o Mean: 39.60 min/day → Participants exercise moderately.
diet.
o Standard Deviation: 13.028 → Some individuals exercise
o Standard Deviation: 0.995 → Low variability,
significantly more or less.
indicating consistent dietary habits.
o Implication: Exercise may impact calcium metabolism,
o Implication: Diet is expected to significantly
influencing intake.
influence calcium intake.
Pearson
Correlation Matrix

1. Key Findings:
o Calcium Intake & Knowledge Score (r = .994, 2. Intercorrelations Among Independent
p < .001). Higher knowledge about calcium intake Variables:
is strongly associated with increased calcium
consumption. o Knowledge Score & Diet (r = .915, p <
.001) → Higher nutrition awareness leads
o Calcium Intake & Diet (r = .927, p < .001). A to better diet.
well-balanced diet significantly influences calcium
intake. o Knowledge Score & Exercise (r = .977, p
< .001) → Greater knowledge is linked
o Calcium Intake & Exercise (r = .978, p < .001). with higher exercise levels.
Implication: Exercise plays a crucial role in
calcium metabolism and intake. o Diet & Exercise (r = .919, p < .001) →
People with better diet habits also tend to
exercise more.
• R = .995: This is the correlation coefficient, indicating a very strong positive relationship
between the predictors (Exercise, Diet, Knowledge Score) and Calcium Intake.
• R Square (R²) = .991: 99.1% of the variance in Calcium Intake is explained by the three
independent variables. This suggests an extremely high model fit, meaning the predictors
explain almost all the variation in the dependent variable.
• Adjusted R Square = .989: R² that accounts for the number of predictors. 98.9% explanatory
power remains even after adjusting for the number of variables, indicating a robust model.
• Standard Error of the Estimate = 13.267. This represents the average deviation of the
predicted values from the actual values. A lower value indicates better prediction accuracy.
• Durbin-Watson = 1.505. This tests for autocorrelation (whether residuals are correlated).
Values between 1.5 - 2.5 indicate no serious autocorrelation. 1.505 is acceptable, meaning the
residuals are independent, and assumptions are met.
Degrees of Freedom (df): Regression df = 3 → Since
there are three predictors, the degrees of freedom for
regression is 3. Residual df = 16 → Total observations
(N = 20) minus the number of parameters estimated (3
predictors + 1 intercept) df = N - (k + 1) = 20 - 4 =
16.
Mean Square (MS): MS Regression = 102,675.880
Regression Sum of Squares (SS Regression) represents (Calculated as SS Regression / df Regression). MS
the variation in Calcium Intake explained by the Residual = 176.007 (Calculated as SS Residual / df
independent variables (Exercise, Diet, and Knowledge Residual). A high MS Regression relative to MS Residual
Score). A high value indicates that the model captures a suggests a strong model.
significant amount of variance. F-Statistic = 583.363. The F-statistic is calculated as
Residual Sum of Squares (SS Residual) represents the MS Regression / MS Residual. A high F-value
variation in Calcium Intake that is not explained by the indicates that the regression model is statistically
predictors. A low value here is desirable, as it indicates less significant.
unexplained variance. Significance Level (p-value) = Since p < 0.001, the
Total Sum of Squares (SS Total) is the total variability in model is highly significant, meaning that Exercise, Diet,
Calcium Intake before considering the impact of the and Knowledge Score have a significant effect on
independent variables. Calcium Intake.
1. Overall Model Summary: The multiple linear regression analysis predicts calcium intake
based on three independent variables: Knowledge Score, Diet, and Exercise. We will focus on
the coefficients, significance (p-value), and confidence intervals to interpret the findings.

2. Constant (Intercept): B = 190.524, t(3) = 4.588, p < 0.001: The constant term (190.524) is
the estimated calcium intake when all predictors (Knowledge Score, Diet, and Exercise) are zero.
Since the p-value is less than 0.001, this constant is statistically significant, meaning it contributes
meaningfully to the model.
3. Knowledge Score: B = 15.778, t = 7.199, p < 0.001: The coefficient for Knowledge Score is 15.778,
meaning that for every unit increase in Knowledge Score, calcium intake is expected to increase by 15.778
units, holding other variables constant. The t-value is large, and the p-value is significant, indicating that
Knowledge Score is a strong predictor of calcium intake.
4. Diet: B = 11.677, t = 1.476, p = 0.159: The coefficient for Diet is 11.677, suggesting that for every unit
increase in Diet, calcium intake is expected to increase by 11.677 units. However, the p-value (0.159) is
greater than the significance threshold of 0.05, indicating that Diet is not a statistically significant predictor in
this model. This suggests that the relationship between Diet and Calcium Intake may not be strong enough
to be reliable.
5. Exercise: B = 0.911, t = 0.795, p = 0.438: The coefficient for Exercise is 0.911, meaning that for each
unit increase in Exercise, calcium intake is expected to increase by 0.911 units. However, with a p-value of
0.438, which is above the threshold of 0.05, Exercise does not appear to significantly contribute to predicting
calcium intake.
Collinearity:
• The VIF (Variance Inflation Factor) values for Knowledge Score (22.955),
Diet (6.682), and Exercise (24.060) are relatively high, which may indicate
multicollinearity in the model. This suggests that the predictors might be
correlated with each other, which can affect the stability of the regression
coefficients. However, VIF values above 10 typically suggest a serious
multicollinearity problem, which in this case may require further investigation.

Confidence Intervals:
• The 95% Confidence Intervals for the coefficients (e.g., for Knowledge
Score: 11.132 to 20.425) provide the range within which the true population
parameter is likely to fall. For example, we are 95% confident that the true
effect of Knowledge Score on calcium intake lies between 11.132 and 20.425.
Interpretation of the Coefficients
Regression Equation
From the Unstandardized Coefficients (B column), we can write the
regression equation: Calcium Intake=373.743+13.897×Knowledge Score

This means:
•When Knowledge Score = 0, the estimated Calcium Intake = 373.743
mg/day (Intercept).
•For each 1-point increase in Knowledge Score, Calcium Intake
increases by 13.897 mg/day.
Conclusion
Knowledge Score is a statistically significant and strong predictor of calcium intake, with
a high positive relationship. Diet and Exercise are not statistically significant predictors in
this model, suggesting that other factors, not included in this analysis, may better explain
calcium intake.

The model's overall results (as indicated by significant p-values for Knowledge Score)
suggest that Knowledge Score is the most reliable predictor among the variables
considered here for calcium intake.

You might also like