Regression - Analysis ch1,2
Regression - Analysis ch1,2
Yabebal Ayalew
Introduction
Introduction
Outline
y = β0 + β1 x
where y is the value on the y-axis, x is the value on the x-axis, β1 is the slope of the line, which shows
how steep the line is and β0 is the y-intercept, which is the point where the line crosses the y-axis.
• Regression Analysis is a statistical method used to examine the relationship between one dependent
variable (often referred to as the outcome or response variable) and one or more independent
variables (also called predictors or explanatory variables).
• Regression methods are concerned with two types of variables: the explanatory (or independent)
variables x and the dependent variables y. The collection of methods that are referred to as
regression methods have several objectives
1 Modeling of the response y given x such that the underlying structure of the influence of x on
y is found.
2 Quantification of the influence of x on y.
3 Prediction of y given an observation x.
µ = β0 + β1 x1 + β2 x2 + · · · + βp xp
= β0 + xT β
Regression analysis is a versatile tool that finds applications across numerous fields, allowing
for data-driven decision-making, forecasting, and understanding complex relationships
1 Economics and Finance
— Forecasting Economic Indicators: Economists use regression analysis to predict key indicators
like GDP, inflation rates, unemployment, and stock prices based on historical data and
influencing factors.
— Risk Management: In finance, regression models help assess the relationship between various
risk factors (e.g., interest rates, market indices) and the returns of financial assets, aiding in
portfolio management and risk assessment.
— Pricing Models: Regression is used in developing pricing models for options, bonds, and other
financial instruments, often involving multiple variables like interest rates and asset prices.
4 Social Sciences
— Sociology and Psychology: Researchers use regression to explore the relationship between
social factors (e.g., education, income, family background) and outcomes like crime rates,
educational attainment, or mental health.
— Education: Regression models help assess the impact of different teaching methods, class sizes,
and other factors on student performance, guiding educational policy and practice.
5 Engineering and Manufacturing
— Quality Control: In manufacturing, regression analysis is used to identify the factors that most
influence product quality, enabling companies to optimize production processes and reduce
defects.
— Reliability Engineering: Engineers use regression models to predict the lifespan of products or
components based on various stress factors, guiding maintenance and warranty decisions.
— Process Optimization: Regression helps in modeling and optimizing manufacturing processes
by analyzing the relationship between input variables (like temperature, pressure) and output
quality.
6 Environmental Science
— Climate Change Studies: Scientists use regression analysis to model the relationship between
greenhouse gas emissions and temperature changes, helping to understand and predict the
impacts of climate change.
— Pollution Control: Regression models are applied to assess the relationship between pollutant
levels and health outcomes, guiding environmental regulations and public health interventions.
— Resource Management: In agriculture and forestry, regression is used to model crop yields,
forest growth, and resource depletion, informing sustainable management practices.
7 Agriculture
— Crop Yield Prediction: Farmers and agricultural scientists use regression to predict crop yields
based on factors such as soil quality, weather conditions, and agricultural practices.
— Livestock Management: Regression models help optimize feeding strategies and improve
livestock health by analyzing the impact of diet, environment, and genetics on animal growth
and productivity.
11 Make Predictions
— Use the regression equation to make predictions
about the dependent variable for new or unseen data.
— Calculate confidence intervals for the predictions to
understand the uncertainty around them.
12 Report and Communicate Findings
— Prepare a detailed report or presentation that
includes the methodology, regression model, results,
and interpretations.
— Use graphs (e.g., scatter plots with a regression line)
to represent the relationship and findings visually.
— Tailor the communication of your results to your
audience, whether they are statisticians,
stakeholders, or the general public.
5 Which of the following is NOT typically a part of data preparation in regression analysis?
A. Data cleaning
B. Model validation
C. Data transformation
D. Descriptive statistics calculation
E. Outlier correction
6 Descriptive statistics in regression analysis help to:
A. Predict future outcomes
B. Understand the basic features of the data
C. Test the significance of coefficients
D. Split the data into training and testing sets
E. Determine the goodness of fit
7 When should you use multiple linear regression instead of simple linear regression?
A. When there is one independent variable
B. When the dependent variable is categorical
C. When there are two or more independent variables
D. When the relationship between variables is non-linear
E. When the model requires polynomial terms
8 Which type of regression would be most appropriate if your dependent variable is a binary outcome
(e.g., pass/fail)?
A. Simple linear regression
B. Multiple linear regression
C. Logistic regression
D. Polynomial regression
E. Ridge regression
9 Which type of regression analysis is suitable for modeling non-linear relationships between variables?
A. Simple linear regression
B. Multiple linear regression
C. Logistic regression
D. Polynomial regression
E. Stepwise regression
10 Which of the following is NOT a recommended component of a regression analysis report?
A. Visualization of data and model results
B. Discussion of model assumptions and their validation
C. Detailed methodology description
D. Instructions for future data collection
E. Interpretation of the coefficients and findings
13 In an agricultural study, a researcher uses regression analysis to predict crop yield based on factors
such as rainfall, soil quality, and fertilizer usage. What is the dependent variable in this scenario?
A. Rainfall
B. Soil quality
C. Fertilizer usage
D. Crop yield
14 In a marketing campaign, a company uses regression analysis to predict customer spending based on
demographic data and past purchase behavior. Which of the following could be a dependent
variable?
A. Age of the customer
B. Customer spending
C. Past purchase behavior
D. Gender of the customer
• Simple linear regression assumes a linear relationship between the independent variable (X) and
the dependent variable (Y ). This means that as X changes, Y changes in a way that a straight line
can represent.
• The relationship between X and Y is described by the regression equation
Yi = β0 + β1 Xi + ϵi i = 1, 2, . . . , n (1)
where Yi is the value of the response variable in the ith trial, β0 and β1 are parameters, Xi is a
known constant, namely, the value of the predictor variable in the ith trial, ϵi is the random error
term with mean E(ϵi ) = 0 and variance σ 2 (ϵi ) = σ 2 ; ϵi and ϵj are uncorrelated so that their
covariance is zero (i.e., Cov(ϵi , ϵj ) = 0 for all i, j: i ̸= j)
E(Yi ) = E(β0 + β1 Xi + ϵi )
= E(β0 + β1 Xi ) + E(ϵi )
= β0 + β1 Xi
3 The error terms ϵi are assumed to have constant variance σ 2 . It therefore follows that the responses
Yi have the same constant variance:
σ 2 (Yi ) = V ar(β0 + β1 Xi + ϵi )
= V ar(β0 + β1 Xi ) + V ar(ϵi )
| {z } | {z }
=0 σ2
2
= σ
Thus, regression model (1) assumes that the probability distributions of Y have the same variance
σ 2 , regardless of the level of the predictor variable X
4 The error terms are assumed to be uncorrelated. Since the error terms ϵi and ϵj are uncorrelated, so
are the responses Yi and Yj
Y X
• The parameters β0 and β1 in regression model (1) are
called regression coefficients. β1 is the slope of the Y1 X1
regression line. It indicates the change in the mean of the Y2 X2
probability distribution of Y per unit increase in X. .. ..
. .
• The parameter β0 is the Y intercept of the regression Yi Xi
line. When the scope of the model includes X = 0, β0 .. ..
gives the mean of the probability distribution of Y at . .
X = 0. When the scope of the model does not cover Yn Xn
X = 0, β0 does not have any particular meaning as a
separate term in the regression model Data structure for simple linear
regression
The least squares principle for the simple linear regression model is to find the estimates β̂0
and β̂1 such that the sum of the squared distance from actual response Yi and predicted
response Ŷi = β̂0 + β̂1 Xi reaches the minimum among all possible choices of regression
coefficients, β0 and β1 . i.e.,
n
X
(β̂0 , β̂1 ) = arg min (Yi − Ŷi )2
(β̂0 ,β̂1 ) i=1
n
X
= arg min (Yi − β̂0 − β̂1 Xi )2
(β̂0 ,β̂1 ) i=1
The motivation behind the least squares method is to find parameter estimates by choosing
the regression line that is the most "closest" line to all data points (Xi , Yi ).
n n
!
∂ X X
(Yi − β̂0 − β̂1 Xi )2 = −2 (Yi − β̂0 − β̂1 Xi ) =0
∂ β̂0 i=1 i=1
n
X
= (Yi − β̂0 − β̂1 Xi ) = 0
i=1
Xn n
X
= Yi − nβ̂0 − β̂1 Xi = 0
i=1 i=1
n n
!
∂ X X
(Yi − β̂0 − β̂1 Xi )2 = −2 (Yi − β̂0 − β̂1 Xi )Xi =0
∂ β̂1 i=1 i=1
n
X
= (Yi − β̂0 − β̂1 Xi )Xi = 0
i=1
Xn n
X n
X
= Yi Xi − β̂0 Xi − β̂1 Xi2 = 0
i=1 i=1 i=1
44 Statistics Department Regression Analysis Yabebal A.
Simple Linear Regression
Estimation of the Regression Parameters: Least–Square Method
Attention!
• The random disturbance term, often denoted by ϵ, represents the true error in the relationship
between the dependent variable Y and the independent variable X.
ϵi = Yi − E(Yi )
• It captures the effects of all the factors that influence Y other than X. This includes unobserved
variables, measurement errors, and any inherent randomness in the relationship.
• The residual, often denoted by e, is the difference between the observed value Yi and the value
predicted by the estimated regression model Ŷi
ei = Yi − Ŷi
• The residual is an observable quantity calculated from the data and the fitted regression model.
Unlike the random disturbance term, which is theoretical and unobservable, the residual is what we
use to assess the model’s fit.
The sum of the squared residuals, i e2i , is a minimum. This was the requirement to be satisfied in
P
2
deriving the least P
squares estimators of the regression parameters since the sum square error to be
minimized equals i e2i when the least squares estimators β̂0 and β̂1 are used for estimating β0 and
β1 .
3 The sum of the observed values Yi equals the sum of the fitted values Ŷi :
n
X n
X
ei = 0 = (Yi − Ŷi )
i=1 i=1
Xn n
X
= Yi − Ŷi
i=1 i=1
n
X Xn
Yi = Ŷi
i=1 i=1
Proof. We know that β̂0 = Ȳ − β̂1 X̄, ei = (Yi − Ŷi ) = (Yi − Ȳ ) − β̂1 (Xi − X̄) and
Pn Pn
i=1 (Xi − X̄)Yi (Yi − Ȳ )Xi
β̂1 = Pn 2
= Pni=1
i=1 (Xi − X̄) i=1 (Xi − X̄)Xi
n
X n
X n
X
Xi ei = Xi (Yi − Ŷi ) = Xi (Yi − (β̂0 + β̂1 Xi ))
i=1 i=1 i=1
Xn
= Xi (Yi − Ȳ ) − β̄1 (Xi − X̄)
i=1
n
X n
X
= Xi (Yi − Ȳ ) − β̄1 Xi (Xi − X̄)
i=1 i=1
Xn n
X
= Xi (Yi − Ȳ ) − Xi (Yi − Ȳ ) = 0
i=1 i=1
50 Statistics Department Regression Analysis Yabebal A.
Simple Linear Regression
Estimation of the Regression Parameters: Properties of Fitted Regression Line
P P
5 A consequence of properties i ei = 0 and i Xi ei = 0 is that the sum of the weighted residuals is
zero when theP residual in the ith trial is weighted by the fitted value of the response variable for the
ith trial. i.e., i Ŷi ei = 0
Proof.
n
X n
X
Ŷi ei = (β̂0 + β̂1 Xi )ei
i=1 i=1
Xn n
X
= β̂0 ei + β̂1 Xi ei
i=1 i=1
n
X n
X
= β̂0 ei + β̂1 Xi ei
i=1 i=1
= β̂0 × 0 + β̂1 × 0
= 0
Example: A company wants to understand the relationship between its advertising budget (in
thousands of dollars) and the resulting sales (in thousands of units). They have collected data
over six months. The data is as follows:
Month 1 2 3 4 5 6
Advertising Budget (X) 10 15 20 25 30 35
Sales (Y ) 25 30 45 50 55 60
Using the data provided, fit a simple linear regression line to predict sales based on the
advertising budget.
52 Statistics Department Regression Analysis Yabebal A.
Simple Linear Regression
Estimation of the Regression Parameters: Least–Square Method
Solution: The mean of X and Y are
n
1X 10 + 15 + 20 + 25 + 30 + 35 135
X̄ = Xi = = = 22.5
n i=1 6 6
n
1X 25 + 30 + 45 + 50 + 55 + 60 265
Ȳ = Yi = = ≈ 44.17
n i=1 6 6
The slope of the regression line is calculated using the following formula:
6
(Xi − X̄)Yi
P
i=1 (10 − 22.5)25 + (15 − 22.5)30 + · · · + (35 − 22.5)60
β̂1 = 6
=
(10 − 22.5)10 + (15 − 22.5)15 + · · · + (35 − 22.5)35
(Xi − X̄)Xi
P
i=1
637.5
= = 1.46
437.5
Exercise: Based on the above example, show that the properties of the fitted line hold. i.e.,
P P P
i ei = i Xi ei = i Ŷi ei = 0 and the regression line passes through a point (X̄, Ȳ )
Important Results
1
n
X n
X n
X n
X
(Xi − X̄)(Yi − Ȳ ) = (Xi − X̄)Yi = (Yi − Ȳ )Xi = Xi Yi − nX̄ Ȳ
i=1 i=1 i=1 i=1
then,
n
X
(Xi − X̄)(Yi − Ȳ ) = (n − 1)Cov(X, Y )
i=1
4
n
X n
X n
X
(Xi − X̄)2 = (Xi − X̄)Xi = Xi2 − nX̄ 2
=1 i=1 i=1
E(ϵ|X) = 0, V ar(ϵ|X) = σ 2
n
P
(Xi − X̄)Yi " n #
i=1 1 X
E(β̂1 ) = E
P n
=
n (Xi − X̄)E(Yi )
(Xi − X̄) 2 (Xi − X̄)2
P
i=1
i=1 i=1
n n n
(Xi − X̄)(β0 + β1 Xi ) (Xi − X̄)β0 + (Xi − X̄)β1 Xi
P P P
i=1 i=1 i=1
= n = n
(Xi − X̄)2 (Xi − X̄)2
P P
i=1 i=1
P
=0 = i
(Xi −X̄)2
z }| { z }| {
X n X n
β0 (Xi − X̄) +β1 (Xi − X̄)Xi
i=1 i=1
= n = β1
(Xi − X̄)2
P
59 Statistics Department Regression Analysis Yabebal A.
i=1
Simple Linear Regression
Desirable Properties of the Least-Squares Estimator
Theorem: The least squares estimator β̂0 is an unbiased estimator of β0
Proof.
n
!
1X
E(β̂0 ) = E(Ȳ − β̂1 X̄) = E(Ȳ ) − E(β̂1 X̄) = E Yi − X̄E(β̂1 )
n i=1
n
1X
= E(Yi ) − β1 X̄
n i=1
n
1X
= (β0 + β1 Xi ) − β1 X̄
n i=1
n n
" #
1 X X
= β0 + β1 Xi − β1 X̄
n i=1 i=1
= β0 + β1 X̄ − β1 X̄ = β0
Proof.
n
X n
X
V ar(β̂1 ) = V ar( ki Yi ) = V ar(ki Yi ))
i=1 i=1
n
X
= ki2 V ar(Yi )
i=1
n n
" #2
2
X X X (Xi − X̄) 1
= σ ki2 , ki2 = 2
=P 2
i (Xi − X̄) i (Xi − X̄)
P
i=1 i=1 i
" #
1
= σ 2 Pn 2
i=1 (Xi − X̄)
Proof.
V ar(β̂0 ) = V ar(Ȳ − β̂1 X̄) = V ar(Ȳ ) + V ar(β̂1 X̄) − 2Cov(Ȳ , β̂1 X̄)
σ2 σ 2 X̄ 2
= + Pn 2
− 2X̄ Cov(Ȳ , β̂1 )
n i=1 (Xi − X̄)
| {z }
=0
!
1 X̄ 2
= σ2 + Pn 2
n i=1 (Xi − X̄)
• The σ 2 is the population variance is not known. As a result, we have to estimate it from the data
that is being used to estimate regression coefficients
• The unbiased estimator of σ 2 is
n
2 1 X SSE
σ̂ = (Yi − Ŷi )2 = = M SE
n − 2 i=1 n−2
X̄ 2
\ 1 \ 1
V ar(β̂0 ) = M SE + Pn , and V ar(β̂1 ) = M SE Pn
n i=1 (Xi − X̄)
2
i=1 (Xi − X̄)
2
The estimator of β1 is
The fitted regression line will be Ŷi = 64.1 − 0.812Xi . Thus, Ŷi = {51.9, 47.9, 43.8, 39.7, 35.7}
The mean square error is
5
1 X 1h i
M SE = σ̂ 2 = (Yi − Ŷi )2 = (52.3 − 51.9)2 + · · · + (36.1 − 35.7)2 = 0.235
n − 2 i=1 3
Yi = β0 + β1 Xi + ϵi , i = 1, 2, . . . , n
Review—Theorem
where ϵi is assumed to follow a normal If X1 , X2 , . . . , Xn are independent random
probability distribution with mean zero and a variables that are normally distributed,
constant variance σ 2 . i.e., ei ∼ N (0, σ 2 ) with Xi ∼ N (µi , σi2 ), then any linear
• The normal error term greatly simplifies the combination of these variables,
theory of regression analysis. As the error is
assumed to follow a normal distribution, the Y = a1 X1 + a2 X2 + · · · + an Xn
outcome variable (Yi ) is also expected to follow a
is also normally distributed. Specifically,
normal probability distribution. This is because Pn Pn 2 2
Y ∼N a µ
i=1 i i , i=1 i σi .
a
of the property of a normal probability
distribution. i.e.,
Yi ∼ N (β0 + β1 Xi , σ 2 )
where Var(β̂1 ) and Var(β̂0 ) depend on the variance of the errors σ 2 and the sample data.
The CLT enables us to use normal approximation methods for inference on regression coefficients,
allowing us to:
• Construct confidence intervals for β0 and β1 .
• Perform hypothesis tests to assess if coefficients are significantly different from zero.
βˆ1
t= ∼ t(n − 2)
s.e(βˆ1 )
where s.e(βˆ1 ) is the standard error of βˆ1 . It is the positive square root of the variance of β̂1
69 Statistics Department Regression Analysis Yabebal A.
Simple Linear Regression
Inference about the Intercept and the Slope Parameters
• Decision Rule:
1 Significance Level: Set a significance level α (commonly 0.05).
2 Critical Value: Use the t-distribution with n − 2 degrees of freedom to find the critical value
tα/2 (n − 2).
3 Compare Test Statistic: If |t| > tα/2 (n − 2), reject the null hypothesis H0 ; otherwise, do not
reject H0 .
• Conclusion and Interpretation
— Reject H0 : Conclude that there is a statistically significant linear relationship between X and Y .
— Do Not Reject H0 : Conclude that there is insufficient evidence to claim a linear relationship
between X and Y .
• Hypothesis testing for β1 is essential for determining the relevance of predictors in regression models.
It helps establish whether observed relationships are likely to be due to chance or represent real
associations.
• In many cases, we test whether β1 is equal to zero (no effect), but we may also be interested in
testing if β1 equals a specific non-zero value, such as β10 .
• To determine if the effect of X on Y differs significantly from a specified value β10 , we set up the
following hypotheses:
H0 : β1 = β10
H1 : β1 ̸= β10
• To test the hypothesis, we calculate a t-statistic that compares the observed estimate βˆ1 to β10 :
βˆ1 − β10
t= ∼ t(n − 2)
s.e(βˆ1 )
• To determine whether β0 significantly differs from a hypothesized value β00 , we set up the following
hypotheses:
H0 : β0 = β00
H1 : β0 ̸= β00
The null hypothesis suggests that the expected value of Y when X = 0. In many applications, we
test if β0 is equal to zero, meaning there is no baseline effect when X = 0.
• Under H0 : β0 = β00 , we use a t-statistic to test the intercept
βˆ0 − β00
t= ∼ t(n − 2)
s.e(βˆ0 )
The hypotheses to test whether the temperature has a significant impact on the reaction time
of a chemical reaction are:
H0 : β1 = 0
H1 : β1 ̸= 0
The test statistic is
β̂1 −0.812
t= = = −26.5
s.e(β̂1 ) 0.0306
The tabulated value is t0.05/2 (5 − 2) = t0,025 (3) = 3.18
73 Statistics Department Regression Analysis Yabebal A.
Simple Linear Regression
Inference about the Intercept and the Slope Parameters
Observations
— The variability of the sampling distribution of Ŷh is affected by how far Xh is from the X̄. The
further from X̄ is Xh , the greater is the quantity (Xh − X̄)2 and the larger is the variance of Ŷh
— When Xh = 0, the variance of Ŷh reduced to V ar(β̂0 )
A 100(1 − α)% confidence interval for E(Yh ) is
H0 : E(Yh ) = Yh0
H1 : E(Yh ) ̸= Yh0
H0 : E(Yh ) = 40
H1 : E(Yh ) ̸= 40
• We now consider the prediction of a new observation Y corresponding to a given level X of the
independent variable.
• When we estimate Y based on values of X that were not used in fitting the model parameters and
fall outside the observed range of X, this process is known as extrapolation.
• Let X0 be the new value for the independent variable, and the corresponding predicted value is Y0 .
However, Y0 remains unknown and independent of Ŷ0 = β̂0 + β̂1 X0 . So, the confidence interval for
E(Yh ) is inappropriate as it is an interval estimate on the mean of Y , not a probability statement
about future observations from a normal probability distribution
• Now, let’s develop the prediction interval for the future observation Y0 . Let ϕ = Yo − Ŷ0 be a
normal random variable with mean zero and variance
(X0 − X̄)2
2 1
V ar(ϕ) = V ar(Y0 − Ŷ0 ) = σ 1 + + Pn
n i=1 (Xi − X̄)
2
Example: Consider the temperature and reaction time data. Construct a 95% prediction
interval for Y0 when the temperature X0 = 40.
Solution: The new value of the independent variable X0 = 40 is outside of the rage of X.
Thus, we are predicting new reaction time (Y0 )
The standard error of Ŷ0 is 0.552. Then the 95% prediction interval is
• Covariance measures the degree to which two variables change together. Specifically, it quantifies
whether an increase in one variable generally corresponds to an increase (or decrease) in the other
variable.
n
1 X
Cove(X, Y ) = (Xi − X̄)(Yi − Ȳ )
n − 1 i=1
• Covariance is in units obtained by multiplying the units of X and Y , which makes it scale-dependent.
This lack of standardization can make it hard to compare covariances between different variable pairs.
• Limitations
— Covariance alone doesn’t provide a measure of the strength of the relationship between variables.
— It is affected by the scale of the variables, so it’s not useful for comparing relationships across
datasets with different units or ranges.
• The correlation coefficient (often denoted r for sample data or ρ for population data) standardizes
the covariance by dividing by the product of the standard deviations of the variables. This gives a
measure of both the direction and strength of a linear relationship, with values standardized between
−1 and 1. Pn
Cov(X, Y ) i=1 (Xi − X̄)(Yi − Ȳ )
r= p = qP qP
V ar(X)V ar(Y ) n
− 2 n 2
i=1 (X i X̄) i=1 (Yi − Ȳ )
• Interpretation
— r = 1: Perfect positive correlation, meaning X and Y move together in a perfectly linear way.
— r = 1: Perfect negative correlation, meaning X and Y move in exactly opposite directions in a
perfectly linear way.
— r = 0: No linear correlation between X and Y . (Again, this doesn’t rule out a non-linear
relationship.)
• Correlation is unitless because it standardizes the covariance by dividing by the standard deviations.
This makes it possible to compare correlations across different datasets or variable pairs.
Limitations
• Correlation does not imply causation; it only indicates an association.
• Outliers can significantly affect the correlation coefficient, especially with Pearson’s correlation.
• It’s sensitive to linear relationships but may miss strong non-linear associations between variables.
• To test the hypothesis that there is no linear relationship between two variables (denoted as X and
Y ) in a population, we can conduct a hypothesis test on the population correlation coefficient, ρ
H0 : ρ = 0
H1 : ρ ̸= 0
z ± Zα/2 s.e(z)
• Convert the confidence interval limits for z back to the correlation coefficient scale using the inverse
Fisher transformation. i.e.,
tanh z ± Zα/2 s.e(z)
• Thus, a 100(1 − α)% confidence interval for ρ is
Zα/2 Zα/2
tanh arctanh(r) − √ ≤ ρ ≤ tanh arctanh(r) + √
n−3 n−3
This formula can also be rewritten in terms of the sample standard deviations of X and Y and
the sample correlation coefficient r
qP
n
i=1 (Yi − Ȳ )2 Sy
β̄1 = r qP =r
n
− X̄)2 Sx
i=1 (Xi
Decomposing the SST into variations explained by the regression model and unexplained by the
regression model gives birth to the analysis of variance (ANOVA)
n
X n
X
2
SST = (Yi − Ȳ ) = ((Yi − Ŷi ) + (Ŷi − Ȳ ))2
i=1 i=1
n
X
= ((Yi − Ŷi )2 + 2(Yi − Ŷi )(Ŷi − Ȳ ) + (Ŷi − Ȳ )2 )
i=1
n
X n
X n
X
= (Yi − Ŷi )2 + 2 (Yi − Ŷi )(Ŷi − Ȳ ) + (Ŷi − Ȳ )2
i=1 i=1 i=1
| {z }
=0
n
X n
X
SST = (Yi − Ŷi )2 + (Ŷi − Ȳ )2
i=1 i=1
Pn
From our experience of OLS estimation, i=1 (Yi − Ŷi )2 is called sum square error or residual
sum squares (SSE). This is the variation of Y that cannot be explained by the regression
model. Whereas, ni=1 (Ŷi − Ȳ )2 is the variation of Y that is explained by the regression line
P
H0 : β 1 = 0
H1 : β1 ̸= 0
1 Test whether there is a significant linear relationship between BMI and SBP.
2 Construct ANOVA and test the overall goodness of fit of the model
3 Compute the coefficient of multiple determination and interpret the result
Solution:
1 The appropriate hypothesis to test is
H0 : ρ = 0
H1 : ρ ̸= 0
The test statistic is √ √
r n−2 0.713 100 − 2
t= √ = √ = 13.2
1 − r2 1 − 0.713
where P
i (Xi − X̄)Yi 3746
r = qP qP =p = 0.713
2 2 (1290)(21377)
i (Xi − X̄) i (Yi − Ȳ )
The tabulated value is t0.025 (98) = 1.98. The test statistic is greater than this tabulated value.
Therefore, we have enough evidence to reject the null hypothesis and conclude that there is a
significant linear relationship between BMI and SBP. This gives us the green light that SBP can be
predicted using simple linear regression by taking BMI as the independent variable
2 The total variation which is captured by SST is 21377, and the sum square error (SSE) is given to
be 10498. Thus,
SSR = SST − SSE = 21377 − 10498 = 10879
Dividing the sum squares by their corresponding degrees of freedom gives the mean square. i.e.,
SSR SSE
M SR = , M SE = = 107
1 n−2
The F test statistic is then
M SR 10879
F = = = 102
M SE 107
The tabulated value is F0.05 (1, 98) = 3.94. This value is less than the test statistic. So, we
have sufficient evidence to reject H0 : β1 = 0, and the model is good enough to capture
variation in the response variable (SBP)
The coefficient of multiple determination is
SSR 10879
R2 = r2 = 0.7132 = = = 0.51
SST 21377
About 51% of the variation in SBP is explained by BMI alone.
• We have witnessed that the goodness-of-fit test of the simple linear regression has the same null and
alternative hypothesis as the slope.
• This is because we have only one predictor. However, the goodness-of-fit test for multiple linear
regression is different from testing the significance of individual βs
Caution!
It is true that " #2
2 β̂1 M SR
t = = =F
s.e(β̂1 ) M SE
and (tα/2 (n − 2))2 = Fα (1, n − 2). So, for simple linear regression, we can conclude that testing the
significance of the slope means testing the overall goodness of the model.
Keep in mind! this coincidence happens only in simple linear regression
Yi = β0 + β1 Xi + ϵi , i = 1, 2, . . . , n