Multiple Linear Regression in Excel

This document discusses multiple linear regression analysis using Microsoft Excel. It begins by classifying regression models as either linear or nonlinear depending on whether the regression function is linear in its parameters. The main objectives of multiple linear regression are to determine the best parameter values to accurately predict dependent variable values and to assess the adequacy of the regression model. It then describes the regression input and output in Excel, including residuals, predicted values, and other statistical values used to evaluate the regression model.

Uploaded by

Thomas Santosa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

181 views19 pages

Multiple Linear Regression in Excel

Uploaded by

Thomas Santosa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

MULTIPLE LINEAR REGRESSION ANALYSIS

USING MICROSOFT EXCEL

by Michael L. Orlov
Chemistry Department, Oregon State University (1996)

INTRODUCTION

In modern science, regression analysis is a necessary part of virtually almost any data
reduction process. Popular spreadsheet programs, such as Quattro Pro, Microsoft Excel,
and Lotus 1-2-3 provide comprehensive statistical program packages, which include a
regression tool among many others.

Usually, the regression module is explained clearly enough in on-line help and spreadsheet
documentation (i.e. items in the regression input dialog box). However, the description of
the output is minimal and is often a mystery for the user who is unfamiliar with certain
statistical concepts.

The objective of this short handout is to give a more detailed description of the regression
tool and to touch upon related statistical topics in a hopefully readable manner. It is
designed for science undergraduate and graduate students inexperienced in statistical
matters. The regression output in Microsoft Excel is pretty standard and is chosen as a
basis for illustrations and examples ( Quattro Pro and Lotus 1-2-3 use an almost identical
format).

CLASSIFICATION OF REGRESSION MODELS

In a regression analysis we study the relationship, called the regression function, between
one variable y, called the dependent variable, and several others x
i
, called the
independent variables. Regression function also involves a set of unknown parameters
b
i
. If a regression function is linear in the parameters (but not necessarily in the
independent variables ! ) we term it a linear regression model. Otherwise, the model is
called non-linear. Linear regression models with more than one independent variable are
referred to as multiple linear models, as opposed to simple linear models with one
independent variable.
2

The following notation is used in this work:

y - dependent variable (predicted by a regression model)
y* - dependent variable (experimental value)
p - number of independent variables (number of coefficients)
x
i
(i=1,2, p) - ith independent variable from total set of p variables
b
i
(i=1,2, p) - ith coefficient corresponding to x
i

b
0
- intercept (or constant)
k=p+1 - total number of parameters including intercept (constant)
n - number of observations ( experimental data points)
i =1,2 p - independent variables index
j=1,2, n - data points index

Now let us illustrate the classification of regression models with mathematical expressions:

Multiple linear model

General formula:

y = b
0
+ b
1
x
1
+ b
2
x
2
+ b
p
x
p
(1)
or
y = b
0
+
i
b
i
x
i
i=1,2, p (1a)

Polynomial (model is linear in parameters , but not in independent variables):

y = b
0
+ b
1
x + b
2
x
2
+ b
3
x
3
b
p
x
p
, which is just a specific case of (1)

with x
1
= x, x
2
= x
2
, x
3
= x
3
..x
p
= x
p

Simple linear model

y = b
0
+ b
1
x
1

It is obvious that simple linear model is just specific case of multiple one with k=2 (p=1)

Non-linear model

y = A(1-e
-Bx
),
where A, B are parameters

In further discussion we restrict ourselves to multiple linear regression analysis.
3
MAIN OBJECTIVES OF MULTIPLE LINEAR REGRESSION
ANALYSIS

Our primary goal is to determine the best set of parameters b
i
, such that the model predicts
experimental values of the dependent variable as accurately as possible (i.e. calculated
values y
j
should be close to experimental values y
j
* ).

We also wish to judge whether our model itself is adequate to fit the observed experimental
data (i.e. whether we chose the correct mathematical form of it).

We need to check whether all terms in our model are significant (i.e. is the improvement in
goodness of fit due to the addition of a certain term to the model bigger than the noise in
experimental data).

DESCRIPTION OF REGRESSION INPUT AND OUTPUT

The standard regression output of spreadsheet programs provides information to reach the
objectives raised in the previous section. Now we explain how to do that and touch upon
related statistical terms and definitions.

The following numerical example will be used throughout the handout to illustrate the
discussion:

Table 1. Original experimental data
Data point # y* z
j
1 20.6947 2.5
2 28.5623 3.1
3 157.0020 8.1
4 334.6340 12.2
5 406.5697 13.5
6 696.0331 17.9
7 945.1385 21.0

We choose y* to be the dependent experimental observable and z to be the independent
one. Suppose we have, say, theoretical reasons to believe that relationship between two is:

y* = b
0
+ b
1
*z + b
2
*z
2
+ b
3
*z
3
4
We can rewrite this expression in form (1):

y = b
0
+ b
1
x
1
+ b
2
x
2
+ b
3
x
3
, where (1b)
x
1
= z, x
2
= z
2
and x
2
= z
3

In the next step we prepare the spreadsheet input table for regression analysis:

Table 2. Regression input
Data point # Dependent var. Independent variables
j y* x
1
(=z) x
2
(=z
2
) x
3
(=z
3
)
1 20.6947 2.5 6.25 15.63
2 28.5623 3.1 9.61 29.79
3 157.0020 8.1 65.61 531.44
4 334.6340 12.2 148.84 1815.85
5 406.5697 13.5 182.25 2460.38
6 696.0331 17.9 320.41 5735.34
7 945.1385 21.0 441.00 9261.00

In order to perform a regression analysis we choose from the Microsoft Excel menu*:

Tools Data analysis Regression

Note that data analysis tool should have been previously added to Microsoft Excel during the program
setup (Tools Add-Ins Analysis ToolPak).

The pop-up input dialog box is shown on Fig.1. Elements of this box are described in on-
line help. Most of them become clear in the course of our discussion as well.

The Input Y range refers to the spreadsheet cells containing the independent variable y*
and the Input X range to those containing independent variables x ( in our example x =
x
1
, x
2
, x
3
)

(see Table 2). If we do not want to force our model through the origin we leave
the Constant is Zero box unchecked. The meaning of Confidence level entry will
become clear later. The block Output options allows one to choose the content and
locations of the regression output. The minimal output has two parts Regression
Statistics and ANOVA (ANalysis Of VAriance). Checking the appropriate boxes in
subblocks Residuals and Normal Probability will expand the default output
information. We omit from our discussion description of Normal Probability output.
Now we are ready to proceed with the discussion of the regression output.

* - In Quattro Pro the sequence is Tools - Numeric Tools - Analysis Tools - Advanced Regression. In the
last step instead of Advanced Regression, one can choose Regression from the menu. In this case
the simplified regression output will be obtained.
5
Fig. 1. Regression input dialog box

Residual output

Example

In Microsoft Excel the residual output has the following format:

Table3. Residual output*
Observation
(j)
Predicted Y
(y
j
)
Residuals
( r )
Standard
Residuals (r)
1 20.4424 0.2523 0.3351
2 28.9772 -0.4149 -0.5511
3 156.3982 0.6038 0.8020
4 335.5517 -0.9178 -1.2189
5 406.3355 0.2342 0.3111
6 695.6173 0.4159 0.5524
7 945.3121 -0.1736 -0.2305
* - Corresponding notation used in this handout is given in parenthesis

Residual (or error, or deviation) is the difference between the observed value y* of the
dependent variable for the jth experimental data point (x
1j
, x
2j
, , x
pj
, y
j
*) and the
6
corresponding value y
j
given by the regression function y
j
= b
0
+ b
1
x
1j
+ b
2
x
2j
+ b
p
x
pj
(y
j
= b
0
+ b
1
x
1j
+ b
2
x
2j
+ b
3
x
3j
in our example):

r
j
= y
j
* - y
j
(2)

Parameters b (b
0
, b
1
, b
2
, b
p
) are part of the ANOVA output (discussed later).

If there is an obvious correlation between the residuals and the independent variable x (say,
residuals systematically increase with increasing x), it means that the chosen model is not
adequate to fit the experiment (e.g. we may need to add an extra term x
4
=z
4
to our model
(1b)). A plot of residuals is very helpful in detecting such a correlation. This plot will be
included in the regression output if the box Residual Plots was checked in the regression
input dialog window (Fig. 1).

Example

X Variable 1 Residual Plot
-1
-0.5
0
0.5
1
0.0 5.0 10.0 15.0 20.0 25.0
X Variable 1 (x
1
)
R
e
s
i
d
u
a
l
s

(
r
)

However, the fact that the residuals look random and that there is no obvious correlation
with the variable x does not necessarily mean by itself that the model is adequate. More
tests are needed.

Standard ( or standardized ) residual is a residual scaled with respect to the standard
error (deviation) S
y
in a dependent variable:

r
j
= r
j
/ S
y
(2a)

The quantity S
y
is part of the Regression statistics output (discussed later). Standardized
residuals are used for some statistical tests, which are not usually needed for models in
physical sciences.
7
ANOVA output

There are two tables in ANOVA (Analysis of Variance).

Example

Table 4. ANOVA output (part I)*
df SS MS F Significance F
Regression 3 (df
R
) 723630.06 (SS
R
) 241210.02 (MS
R
) 425507.02 (F
R
) 6.12E-09 (P
R
)
Residual (error) 3 (df
E
) 1.70 (SS
E
) 0.57 (MS
E
)
Total 6 (df
T
) 723631.76 (SS
T
) N/A (MS
T
)

Table 4a. ANOVA output (part II)*
Coefficients
(b
i
)
Standard
Error (se (b
i
))
t Stat
(t
i
)
P-value
(P
i
)
Lower 95%
(b
L
,
(1-Pi)
)
Upper 95%
(b
U
,
(1-Pi)
)
Intercept (b
0
) 0.52292226 1.77984111 0.293802778 0.7881 -5.1413318 6.1871763
X Variable 1 (x
1
) 2.91437225 0.73039587 3.990126957 0.0282 0.58992443 5.2388201
X Variable 2 (x
2
) 2.02376459 0.07318737 27.65182747 0.0001 1.79084949 2.2566797
X Variable 3 (x
3
) -0.0009602 0.00206174 -0.46574477 0.6731 -0.0075216 0.0056011

* - Corresponding notation used in this handout is given in parenthesis; N/A means not available in
Microsoft Excel regression output.

Coefficients.
The regression program determines the best set of parameters b (b
0
, b
1
, b
2
, b
p
) in the
model y
j
=b
0
+b
1
x
1j
+b
2
x
2j
+ b
p
x
pj
by minimizing the error sum of squares SS
E
(discussed
later). Coefficients are listed in the second table of ANOVA (see Table 4a). These
coefficients allow the program to calculate predicted values of the dependent variable y (y
1
,
y
2
, y
n
), which were used above in formula (2) and are part of Residual output ( Table 3).

Sum of squares.
In general, the sum of squares of some arbitrary variable q is determined as:

SS
q
=
j
n
(q
j
- q
avg
)
2
, where (3)

q
j
- jth observation out of n total observations of quantity q

q
avg
- average value of q in n observations: q
avg
=
j
n
q
j
)/n

8
In the ANOVA regression output one will find three types of sum of squares (see Table 4):

1). Total sum of squares SS
T
:

SS
T
=
j
n
(y
j
*

- y*
avg
)
2
, where (3a)
y*
avg
=
j
n
y
j
*)/n

It is obvious that SS
T
is the sum of squares of deviations of the experimental values of
dependent variable y* from its average value. SS
T
could be interpreted as the sum of
deviations of y* from the simplest possible model (y is constant and does not depend on
any variable x):

y = b
0
, with b
0
= y*
avg
(4)

SS
T
has two contributors: residual (error) sum of squares (SS
E
) and regression sum of
squares(SS
R
):

SS
T
= SS
E
+ SS
R
(5)

2). Residual (or error) sum of squares SS
E
:

SS
E
=
j
n
(r
j
- r
avg
)
2
(6)

Since in the underlying theory the expected value of residuals r
avg
is assumed to be zero,
expression (6) simplifies to:

SS
E
=
j
n
(r
j
)
2
(6a)

The significance of this quantity is that by the minimization of SS
E
the spreadsheet
regression tool determines the best set of parameters b= b
0
, b
1
, b
2
, , b
p
for a given
regression model. SS
E
could be also viewed as the due-to-random-scattering-of -y*-about-
predicted-line contributor to the total sum of squares SS
T
. This is the reason for calling the
quantity due to error (residual) sum of squares.

3). Regression sum of squares SS
R
:

SS
R
=
j
n
(y
j
- y*
avg
)
2
(7)

SS
R
is the sum of squares of deviations of the predicted-by-regression-model values of
dependent variable y from its average experimental value y*
avg
. It accounts for addition of
p variables (x
1
, x
2
, , x
p
) to the simplest possible model (4) (variable y is just a constant
and does not depend on variables x), i.e. y = b
0
vs. y = b
0
+ b
1
x
1
+ b
2
x
2
+ + b
p
x
p
. Since
this is a transformation from the non-regression model (4) to the true regression
model (1), SS
R
is called the due to regression sum of squares.

9
The definition of SS
R
in the form (7) is not always given in the literature. One can find
different expressions in books on statistics

[1, 2] :

SS
R
=
i
p
b
i

j
n
(x
i j
- x
avg
) y
j
, where (7a)
x
avg
=
j
n
x
j
*)/n

or

SS
R
=
i
p
b
i

j
n
x
i j
y
j
* - (
j
n
y
j
*)
2
/n (7b)

Relationships (7a-b) give the same numerical result, however, it is difficult to see the
physical meaning of SS
R
from them.

Mean square (variance) and degrees of freedom
The general expression for the mean square of an arbitrary quantity q is:

MS
q
= SS
q
/ df (8)

SS
q
is defined by (3) and df is the number of degrees of freedom associated with quantity
SS
q
. MS is also often referred to as the variance. The number of degrees of freedom could
be viewed as the difference between the number of observations n and the number of
constraints (fixed parameters associated with the corresponding sum of squares SS
q
).

1). Total mean squareMS
T
(total variance):

MS
T
= SS
T
/(n - 1) (9)

SS
T
is associated with the model (4), which has only one constraint (parameter b
0
),
therefore the number of degrees of freedom in this case is:

df
T
= n - 1 (10)

2). Residual (error) mean squareMS
E
(error variance):

MS
E
= SS
E
/ (n - k) (11)

SS
E
is associated with the random error around the regression model (1), which has k=p+1
parameters (one per each variable out of p variables total plus intercept). It means there are
k constraints and the number of degrees of freedom is :

df
E
= n - k (12)
10

3). Regression mean squareMS
R
(regression variance):

MS
R
= SS
R
/(k - 1) (13)

The number of degrees of freedom in this case can be viewed as the difference between the
total number of degrees of freedom df
T
(10) and the number of degrees of freedom for
residuals df
E
(12) :

df
R
= df
T
- df
E
= (n - 1) - (n - k)

df
R
= k - 1 = p (14)

Tests of significance and F-numbers
The F-number is the quantity which can be used to test for the statistical difference between
two variances. For example, if we have two random variables q and v, the corresponding
F- number is:

F
qv
= MS
q
/ MS
v
(15)

The variances MS
q
and MS
v
are defined by an expression of type (8). In order to tell
whether two variances are statistically different, we determine the corresponding
probability P from F-distribution function:

P=P(F
qv
, df
q
, df
v
) (16)

The quantities df
q
, df
v
- degrees of freedom for numerator and denominator - are
parameters of this function. Tabulated numerical values of P for the F-distribution can be
found in various texts on statistics or simply determined in a spreadsheet directly by using
the corresponding statistical function (e.g. in Microsoft Excel one would use FDIST(F
qv
,
df
q
, df
v
) to return the numerical value of P). An interested reader can find the analytical
form of P=P(F
qv
, df
q
, df
v
) in the literature (e.g. [1, p.383]).

The probability P given by (16) is a probability that the variances MS
q
and MS
v
are
statistically indistinguishable. On the other hand, 1-P is the probability that they are
different and is often called confidence level. Conventionally, a reasonable confidence
level is 0.95 or higher. If it turns out that 1-P < 0.95, we say that MS
q
and MS
v
are
statistically the same. If 1-P > 0.95, we say that at least with the 0.95 (or 95%) confidence
MS
q
and MS
v
are different. The higher the confidence level, the more reliable our
conclusion. The procedure just described is called the F-test.

There are several F-tests related to regression analysis. We will discuss the three most
common ones. They deal with significance of parameters in the regression model . The first
11
and the last of them is performed by spreadsheet regression tool automatically, whereas the
second one is not.

1). Significance test of all coefficients in the regression model

In this case we ask ourselves: With what level of confidence can we state that AT LEAST
ONE of the coefficients b (b
1
, b
2
, b
p
) in the regression model is significantly different
from zero?. The first step is to calculate the F-number for the whole regression (part of
the regression output (see Table 4)):

F
R
= MS
R
/ MS
E
(17)

The second step is to determine the numerical value of the corresponding probability P
R

(also part of the regression output ( see Table 4)) :

P
R
= FDIST(F
R
, df
R
, df
E
) (18)

Taking into account expressions (12) and (14) we obtain:

P
R
= FDIST(F
R
, k - 1, n - k) (18a)

Finally we can determine the confidence level 1 - P
R
. At this level of confidence, the
variance due to regression MS
R
is statistically different from the variance due to error
MS
E
. In its turn it means that the addition of p variables (x
1
, x
2
, , x
p
) to the simplest
model (4) (dependent variable y is just a constant) is a statistically significant improvement
of the fit. Thus, at the confidence level not less than 1- P
R
we can say: At least ONE of
coefficients in the model is significant. F
R
could be also used to compare two models
describing the same experimental data: the higher F
R
the more adequate the corresponding
model.

Example

In our illustrative exercise we have P
R
= 6.12E-09 (Table 4), the corresponding level of
confidence 1 - P
R
= 0.9999. Therefore with the confidence close to 100% we can say that at
least one of coefficients b
1
, b
2
and b
3
is significant for the model y = b
0
+ b
1
x
1
+ b
2
x
2
+
b
3
x
3
, where x
1
=z, x
2
= z
2
and x
3
= z
3
.

NOTE: From this test, however, we can not be sure that ALL coefficients b
1
,b
2
and b
3

are non-zero .

If 1- P
R
is not big enough (usually less than 0.95), we conclude that ALL the coefficients
in the regression model are zero (in other words, the hypothesis that the variable y is just
a constant is better than it is function of variables x (x
1
, x
2
, , x
p
) ).

12
2). Significance test of subset of coefficients in the regression model

Now we want to decide With what level of confidence can we be sure that at least ONE
of the coefficients in a selected subset of all the coefficients is significant?. Let us test a
subset of the last m coefficients in the model with a total of p coefficients (b
1
, b
2
, b
p
).
Here we need to consider two models:

y = b
0
+ b
1
x
1
+ b
2
x
2
+ b
p
x
p
(unrestricted) (19)

and

y = b
0
+ b
1
x
1
+ b
2
x

+ b
p-m
x
p-m
(restricted) (20)

These models are called unrestricted (19) and restricted (20) respectively. We need to
perform two separate least square regression analyses for each model.

From the regression output (see Table 4) for each model we obtain the corresponding error
sum of squares SS
E
and SS
E
as well as variance MS
E
for the unrestricted model. The next
step is to calculate the F-number for testing a subset of m variables by hand (it is not
part of Microsoft Excel ANOVA for an obvious reason, i.e. you must decide how many
variables to include in the subset):

F
m
= {( SS
E
- SS
E
) / m} / MS
E
(21)

F
m
could be viewed as an indicator of whether the reduction in the error variance due to the
addition of the subset of m variables to the restricted model (20) (( SS
E
- SS
E
) / m ) is
statistically significant with respect to the overall error variance MS
E
for the unrestricted
model (19). It is equivalent to testing the hypothesis that at least one of coefficients in the
subset is not zero. In the final step, we determine probability P
m
(also by hand):

P
m
= FDIST(F
m
, m, n - k) (22)

At the confidence level 1- P
m
at least ONE of the coefficients in the subset of m is
significant. If 1- P
m
is not big enough (less than 0.95) we state that ALL m coefficients
in the subset are insignificant.

13
Example

The regression output for the unrestricted model (y = b
0
+ b
1
x
1
+ b
2
x
2
+ b
3
x
3
, where x
1
=z,
x
2
= z
2
and x
3
= z
3
) is presented in Table 4.

Say, we want to test whether the quadratic and
the cubic terms are significant. In this case the restricted model is:

y = b
0
+ b
1
x
1
, (restricted model) (23)
where x
1
= z

The subset of parameters consists of two parameter and m=2. By analogy with the input
table for the unrestricted model (Table 2) we prepare one for the restricted model:

Table 5. Regression input for restricted model
Data point # Dependent var. Independent var.
j y* x
1
(=z)
1 20.6947 2.5
2 28.5623 3.1
3 157.0020 8.1
4 334.6340 12.2
5 406.5697 13.5
6 696.0331 17.9
7 945.1385 21.0

We perform an additional regression using this input table and as part of ANOVA obtain:

Table 6. Regression ANOVA output for the restricted model
df SS MS F Significance
F
Regression 1 689216 689216 100 1.70-4
Residual (error) 5 34415.70 6883
Total 6 723632

From Table 4 and Table 6 we have:

SS
E
= 1.70 (error sum of squares; unrestricted model)
MS
E
= 0.57 (error mean square; unrestricted model)
df
E
=(n - k)= 3 (degrees of freedom; unrestricted model)
SS
E
= 34415.70 (error sum of squares; restricted model)

Now we are able to calculate F
m=2
:

F
m=2
= {(34415.70-1.70)/ 2} / 0.57
F
m=2
= 30187.72
14
Using the Microsoft Excel function for the F-distribution we determine the probability
P
m=2
:

P
m=2
= FDIST(30187.72, 2, 3)
P
m=2
= 3.50E-07

Finally we calculate the level of confidence 1- P
m=2
:

1-P
m=1
= 1 - 3.50E-07
1-P
m=1
= 0.99999

The confidence level is high (more than 99.99 %). We conclude that at least one of the
parameters (b
2
or b
3
) in the subset is non-zero. However, we can not be sure that both
quadratic and cubic terms are significant.

3). Significance test of an individual coefficient in the regression model

Here the question to answer is: With what confidence level can we state that the ith
coefficient b
i
in the model is significant?. The corresponding F-number is:

F
i
= b
i
2
/ [se(b
i
)]
2
(24)

se(b
i
) is the standard error in the individual coefficient b
i
and is part of the ANOVA output
(see Table 4a). The corresponding probability

P
i
= FDIST(F
i
, 1, n - k) (25)

leads us to the confidence level 1- P
i
at which we can state that coefficient b
i
is significant.
If this level is lower than desired one we say that coefficient b
i
is insignificant. F
i
is not
part of spreadsheet regression output, but might be calculated by hand if needed.

However, there is another statistics for testing individual parameters, which is part of
ANOVA (see Table 4a):

t
i
= b
i
/ se(b
i
) (26)

The t
i
- number is the square root of F
i
(expression (24)). It has a Students distribution
(see [1, p. 381] for the analytical form of the distribution). The corresponding probability is
numerically the same as that given by (25). There is a statistical function in Microsoft
Excel which allows one to determine P
i
( part of ANOVA (see Table 4a)):

P
i
= TDIST(t
i
, n-k, 2) (27)
15

Parameters of the function (27) are: the number of degrees of freedom df (df
E
=n - k) and form of test (TL=2).
If TL=1 a result for a one-tailed distribution is returned; if TL=2 two-tailed distribution result is returned.
An interested reader can find more information about the issue in ref. [1]

Example

In our illustration P
0
=0.7881 and P
3
= 0.6731 (see Table 4a) corresponds to fairly low
confidence levels, 1 - P
0
= 0.2119 and 1 - P
3
= 0.3269. This suggests that parameters b
0
and
b
3
are not significant. The confidence levels for b
1
and b
2
are high (1 - P
1
= 1 - 0.0282 =
0.9718 and 1- P
2
= 1 - 0.0001 = 0.9999), which means that they are significant.

In conclusion of this F-test discussion, it should be noted that in case we remove even one
insignificant variable from the model, we need to test the model once again, since
coefficients which were significant in certain cases might become insignificant after
removal and visa versa. It is a good practice to use a reasonable combination of all three
tests in order to achieve the most reliable conclusions.

Confidence interval
In the previous section we were obtaining confidence levels given F-numbers or t-
numbers. We can go in an opposite direction: given a desired minimal confidence level 1-P
(e.g. 0.95) calculate the related F- or t-number. Microsoft Excel provides two statistical
functions for that purpose:

F
(1-P)
=FINV (P, df
q
, df
v
) (28)

t
(1-P)
=TINV(P, df ) (29)

df
q
, df
v
- degrees of freedom of numerator and denominator, respectively (see (15))
df - degree of freedom associated with a given t-test (varies from test to test)

NOTE: in expression (29) P

is the probability associated with so called two-
tailed Students distribution. A one- tailed distribution has the different
probability The relationship between the two is:

=P/2 (30)

Values of F-numbers and t-numbers for various probabilities and degrees of freedom
are tabulated and can be found in any text on statistics [1,2,3,4] . Usually the one-tailed
Students distribution is presented.

16
Knowing the t-number for a coefficient b
i
we can calculate the numerical interval which
contains the coefficient b
i
with the desired probability 1-P
i
:

b
L, (1- Pi)
= b
i
- se(b
i
)*t
(1-Pi)
(lower limit) (31)

b
U, (1- Pi)
= b
i
+ se(b
i
)*t
(1-Pi)
(upper limit) (31a)

t
(1-Pi)
=TINV(P
i
, n-k) (32)

The standard errors for individual parameters se(b
i
) are part of ANOVA (Table 4a). The
interval [b
L, (1- Pi)
; b
U, (1- Pi)
] is called the confidence interval of parameter b
i
with the 1-P
i

confidence level. The upper and lower limits of this interval at a 95% confidence are listed
in the ANOVA output by default ( Table 4a; columns Lower 95% and Upper 95%). If
in addition to this default, the confidence interval at a confidence other than 95% is desired,
the box Confidence level should be checked and the value of the alternative confidence
entered in the corresponding window of the Regression input dialog box (see Fig. 1).

Example

For the unrestricted model (1b), the lower and upper 95% limits for intercept are -5.1413
and 6.1872 respectively (see Table 4a). The fact that with the 95% probability zero falls
in this interval is consistent with our conclusion of insignificance of b
0
made in the course
of F-testing of individual parameters (see Example at the end of previous section). The
confidence intervals at the 95% level for b
1
and b
2
do not include zero. This also agrees
with the F-test of individual parameters.

I n fact, analysis whether zero falls in a confidence interval could be viewed as a different
way to perform the F-test (t-test) of individual parameters and must not be used as an
additional proof of conclusions made in such a test.

Regression statistics output

The information contained in the Regression statistics output characterizes the
goodness of the model as a whole. Note that quantities listed in this output can be
expressed in terms of the regression F-number F
R
(Table 4) which we have already used
for the significance test of all coefficients.

17
Example
For our unrestricted model (1b) the output is:

Table 7. Regression statistics output*

Multiple R 0.99999882
R Square (R
2
) 0.99999765
Adjusted R Square (R
2
adj
) 0.9999953
Standard Error (S
y
) 0.75291216
Observations (n) 7
* - Corresponding notation used in this handout is given in parenthesis

Standard error (S
y
):

S
y
= (MS
E
)
0.5
(33)

MS
E
is an error variance discussed before (see expression (11)). Quantity S
y
is an estimate
of the standard error (deviation) of experimental values of the dependent variable y* with
respect to those predicted by the regression model. It is used in statistics for different
purposes. One of the applications we saw in the discussion of Residual output
(Standardized residuals; see expression (2a)).

Coefficient of determination R
2

(or R Square):

R
2
=SS
R
/ SS
T
= 1 - SS
E
/SS
T
(34)

SS
R
, SS
E
and SS
T
are regression, residual (error) and total sum of squares defined by (7),
(6a) and (3a) respectively. The coefficient of determination is a measure of the regression
model as whole. The closer R
2
is

to one, the better the model (1) describes the data. In the
case of a perfect fit R
2
=1.

Adjusted coefficient of determination R
2
(or Adjusted R Square):

R
2
adj
=1- {SS
E
/ (n-k)} / {SS
T
/ (n-1)} (35)

SS
E
and SS
T
are the residual (error) and the total sum of squares (see expressions (6a) and
(3a)). The significance of R
2
adj
is basically the same as that of R
2
(the closer to one the
better). Strictly speaking R
2
adj
should be used as an indicator of an adequacy of the model,
since it takes in to account not only deviations, but also numbers of degrees of freedom.

18
Multiple correlation coefficient R:

R = ( SS
R
/ SS
T
)
0.5
(36)

This quantity is just the square root of coefficient of determination.

Example

The fact that R
2
adj
= 0.9999953 in our illustration is fairly close to 1 (see Table 7) suggests
that overall model (1b) is adequate to fit the experimental data presented in Table 1.
However, it does not mean that there are no insignificant parameters in it.

REGRESSION OUTPUT FORMULA MAP

For references, the following tables present a summary of the formula numbers for
individual items in the Microsoft Excel Regression Output. Variables in parenthesis,
introduced and used in this handout, do not appear in the output.

Table 8. Formula map of Regression statistics output

Multiple R (36)
R Square (R
2
) (34)
Adjusted R Square (R
2
adj
) (35)
Standard Error (S
y
) (33)
Observations (n)

Table 9. Formula map of Residual output
Observation (j) Predicted Y (y
j
) Residuals (r
j
) Standard Residuals(r
j
)
1 (1) (2) (2a)
2 (1) (2) (2a)

Table 10. Formula map of ANOVA output (part I)
df SS MS F Significance F
Regression (df
R
) (14) (SS
R
) (7) (MS
R
) (13) (F
R
) (17) (P
R
) (18)
Residual (error) (df
E
) (12) (SS
E
) (6a) (MS
E
) (11)
Total (df
T
) (10) (SS
T
) (3a) (MS
T
)* (9)

*- not reported in Microsoft Excel Regression output
19
Table 10a. Formula map of ANOVA output (part II)
Coefficients
(b
i
)
Standard
Error (se(b
i
))
t Stat
(t
i
)
P-value
(P
i
)
Lower
95%
(b
L,(1-Pi)
)
Upper
95%
(b
U,(1-
Pi)
)
Intercept (b
0
) (26) (25) (27) (31) (31a)
X Variable 1 (x
1
) (26) (25) (27) (31) (31a)
X Variable 2 (x
2
) (26) (25) (27) (31) (31a)

LITERATURE

1. Afifi A.A., Azen S.P. Statistical analysis. Computer oriented approach, Academic
press, New York (1979)
2. Natrella M.G. Experimental Statistics, National Bureau of Standards, Washington
DC (1963)
3. Neter J., Wasserman W. Applied linear statistical models, R.D. Irwin Inc.,
Homewood, Illinois (1974)
4. Gunst R.F., Mason R.L. Regression analysis and its application, Marcel Dekker Inc.,
NewYork (1980)
5. Shoemaker D.P., Garland C.W., Nibler J.W. Experiments in Physical Chemistry,
The McGraw-Hill Companies Inc. (1996)

Regression Analysis Using Excel
100% (1)
Regression Analysis Using Excel
85 pages
Real Statistics Examples Regression 1
No ratings yet
Real Statistics Examples Regression 1
412 pages
Linear Regression Models
No ratings yet
Linear Regression Models
42 pages
Linear Regression
100% (2)
Linear Regression
28 pages
Simple Linear and Logistic Regression
No ratings yet
Simple Linear and Logistic Regression
81 pages
Simple Linear Regression Sample
No ratings yet
Simple Linear Regression Sample
55 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
20 pages
LGT2425 Lecture 3 Part II (Notes)
No ratings yet
LGT2425 Lecture 3 Part II (Notes)
55 pages
Estad Istica II Chapter 5. Regression Analysis (Second Part)
No ratings yet
Estad Istica II Chapter 5. Regression Analysis (Second Part)
39 pages
Regression Anslysis
No ratings yet
Regression Anslysis
23 pages
Lecture3 221109 035214
No ratings yet
Lecture3 221109 035214
87 pages
Day7-Linear Regression New
No ratings yet
Day7-Linear Regression New
26 pages
DMJAP LinearRegression 3
No ratings yet
DMJAP LinearRegression 3
28 pages
Least Square Method Using Excel
No ratings yet
Least Square Method Using Excel
21 pages
P4-FDA-B29-Monish Patle
No ratings yet
P4-FDA-B29-Monish Patle
14 pages
Data Analysis
100% (1)
Data Analysis
28 pages
Mult Regression
No ratings yet
Mult Regression
28 pages
Simple Linear Regression Analysis
No ratings yet
Simple Linear Regression Analysis
55 pages
Multiple Linear Regression Analysis Usin
No ratings yet
Multiple Linear Regression Analysis Usin
19 pages
Introduction To Management Science: Post Mid Sessions 2 & 3 November 4 and 6 2019
No ratings yet
Introduction To Management Science: Post Mid Sessions 2 & 3 November 4 and 6 2019
26 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
95 pages
Regression Model
No ratings yet
Regression Model
30 pages
Linear Regression Analysis in Excel Assingment
No ratings yet
Linear Regression Analysis in Excel Assingment
17 pages
Regrion
No ratings yet
Regrion
19 pages
Business Statistics II
100% (2)
Business Statistics II
100 pages
Simple Liner REgression
No ratings yet
Simple Liner REgression
27 pages
W6 - L6 - Multiple Linear Regression
No ratings yet
W6 - L6 - Multiple Linear Regression
3 pages
Applying Machine Learning Algorithms With Scikit-Learn (Sklearn) - Notes
No ratings yet
Applying Machine Learning Algorithms With Scikit-Learn (Sklearn) - Notes
19 pages
Linear Regression Analysis in Excel 2
No ratings yet
Linear Regression Analysis in Excel 2
15 pages
How To Do Linear Regression With Excel
No ratings yet
How To Do Linear Regression With Excel
8 pages
Regression Analysis
No ratings yet
Regression Analysis
49 pages
Module01.1 LinearRegression
No ratings yet
Module01.1 LinearRegression
32 pages
Lecture5 Mar22 2024
No ratings yet
Lecture5 Mar22 2024
44 pages
Lecture Note #8 - PEC-CS701E
No ratings yet
Lecture Note #8 - PEC-CS701E
20 pages
Evans Analytics2e PPT 08
No ratings yet
Evans Analytics2e PPT 08
65 pages
ML Exp1 C36
No ratings yet
ML Exp1 C36
13 pages
Group 1 Practical
No ratings yet
Group 1 Practical
16 pages
5 - Part II - Regression Analysis W-Notes
No ratings yet
5 - Part II - Regression Analysis W-Notes
10 pages
Lecture 12
No ratings yet
Lecture 12
5 pages
1.linear Regression PSP
No ratings yet
1.linear Regression PSP
92 pages
Regression Analysis
100% (2)
Regression Analysis
9 pages
Multiple Regression Analysis
No ratings yet
Multiple Regression Analysis
14 pages
Chapter 3 Multiple Linear Regression - We Use This One
No ratings yet
Chapter 3 Multiple Linear Regression - We Use This One
6 pages
Fba 1
No ratings yet
Fba 1
9 pages
Simple Linear Regression: Coefficient of Determination
No ratings yet
Simple Linear Regression: Coefficient of Determination
21 pages
01 SLR Final
No ratings yet
01 SLR Final
37 pages
Regression Analysis
No ratings yet
Regression Analysis
65 pages
What Is Linear Regression
No ratings yet
What Is Linear Regression
14 pages
SimpleMultipleLinearRegression FoundationalMathofAI S24
No ratings yet
SimpleMultipleLinearRegression FoundationalMathofAI S24
6 pages
Recent Developments in Polysaccharide-Based Materials Used As Adsorbents in Wastewater Treatment
No ratings yet
Recent Developments in Polysaccharide-Based Materials Used As Adsorbents in Wastewater Treatment
33 pages
Untitled 472
No ratings yet
Untitled 472
13 pages
Intro To Regresion: Codergirl Data Analysis
No ratings yet
Intro To Regresion: Codergirl Data Analysis
32 pages
Lab-3: Regression Analysis and Modeling Name: Uid No. Objective
No ratings yet
Lab-3: Regression Analysis and Modeling Name: Uid No. Objective
9 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
11 pages
Regression Checklist
No ratings yet
Regression Checklist
3 pages
Business Statistics, 5 Ed.: by Ken Black
No ratings yet
Business Statistics, 5 Ed.: by Ken Black
34 pages
Bivariate
No ratings yet
Bivariate
28 pages
Chapter 3 - Chemical Reactor Design For Single Reaction
No ratings yet
Chapter 3 - Chemical Reactor Design For Single Reaction
32 pages
Drawing 3 Data
No ratings yet
Drawing 3 Data
35 pages
Simple Regression 1
No ratings yet
Simple Regression 1
18 pages
Volume Molar Parsial
No ratings yet
Volume Molar Parsial
4 pages
Ice 204 - chp3 - Non Catalytic Reaction Kinetics
No ratings yet
Ice 204 - chp3 - Non Catalytic Reaction Kinetics
22 pages
Lec 2
No ratings yet
Lec 2
18 pages
Equilibrium Reactions of N-Butanol Dehydration
No ratings yet
Equilibrium Reactions of N-Butanol Dehydration
1 page
Voice 1
No ratings yet
Voice 1
12 pages
Unsteady Bernoulli Equation: Advanced Fluid M Ech Anics
No ratings yet
Unsteady Bernoulli Equation: Advanced Fluid M Ech Anics
13 pages
Material Safety Data Sheet: Product and Company Identification 1
No ratings yet
Material Safety Data Sheet: Product and Company Identification 1
11 pages
Jitv33 8
No ratings yet
Jitv33 8
5 pages
Complex Variables
From Everand
Complex Variables
Francis J. Flanigan
No ratings yet
Digital Signal Processing (DSP) with Python Programming
From Everand
Digital Signal Processing (DSP) with Python Programming
Maurice Charbit
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet

Multiple Linear Regression in Excel

Uploaded by

Multiple Linear Regression in Excel

Uploaded by

MULTIPLE LINEAR REGRESSION ANALYSIS

USING MICROSOFT EXCEL

You might also like