Regression Using Excel

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

T

H E P R A C T I C E O F applying curve fitting tech-


niques to describe data is widespread in all
fields of biology. The purpose of curve fitting of
biological data is to describe data in the universally
recognized form y = f(x), where y is the dependent
variable and is measured in the experiment, and x i s
controlled during the experiment and is called the in-
dependent variable as its value on the x-axis is fixed. f
is the function used to describe the relationship be-
tween x and y, and takes the form of an equation com-
posed of one or more parameters. In general, the better
the fit of the curve the more accurately it describes the
data. Applying linear fits to data is a comparatively
simple procedure and can be carried out easily with a
pocket calculator. Describing data with nonlinear
functions (nonlinear regression) is more problemati-
cal. Prior to the advent of personal computers, data
was linearly transformed, then fit with a linear func-
tion. However, in the age of personal computers there
is no shortage of specialist programs that will carry
out nonlinear regression. Many of the programs of
choice of biologists carrying out nonlinear regression
analysis tend to be expensive and contain an excess of
redundant features. These programs do not generally
handle data well and display data, graphs, curve fits,
and analysis in a multitude of separate windows that
may be confusing to the user. Thus, data relating to
one experiment may be contained in the acquisition
program, data handling program, curve fitting pro-
gram, and presentation program. The task of keeping
track of these data files from multiple experiments can
be a logistical nightmare, particularly for a laboratory
head. A solution is to reduce the number of programs
involved in data analysis by carrying out as much
analysis as possible in one program. Excel (M i-
c rosoft, Redmond, WA) is part of the Microsoft Of-
fice suite that is usually offered as part of the com-
puter package upon purchase. Excel offers a
u s e r-friendly interface with good data handling capa-
bilities, built-in mathematical functions, and instanta-
neous graphing. In addition, the program contains the
SOLVERfunction, which is well suited to fitting data
with nonlinear functions via an iterative algorithm. In
this article, a method for carrying out nonlinear re-
gression analysis of data with user-input functions us-
ing a template created in an Excel spreadsheet is de-
scribed in detail. A preliminary description of this
procedure has previously appeared.
1
Method
The method involves manual data entry and graph-
ing of data, followed by curve fitting and displaying the
resulting curve fit on top of the data. The goodness of fit
is calculated so that the accuracy of fit can be assessed.
Nonlinear regression
The description of data by a function is carried out by
the process of iterative (i.e., cyclical) nonlinear regres-
sion. This process minimizes the value of the squared
sum of the difference between the data and the fit.
SS =

n
i = 1
[y yfit]
2
(1)
where, y is the data point, yfit is the value of the curve
at point y, and SS is the sum of the squares of all the
data points. It involves the user providing initial esti-
mates of the parameter values upon which the first it-
eration calculates an initial SS value. The second iter-
58 / O C TO B ER 2001
Application Note
Nonlinear regression analysis of data using a spreadsheet
BY ANGUS M.BROWN
ation involves changing the parameter values by a
small amount and recalculating the SS. This process is
repeated many times to ensure that changes in the pa-
rameter values result in the smallest possible value of
SS. SOLVER employs the generalized reduced gradi-
ent (GRG) method of iteration.
2
The following example illustrates how to use the
spreadsheet programs SOLV E R function to fit data
with user-input nonlinear functions. The example used
is the Boltzmann equation, but any nonlinear function
can be used simply by substituting the relevant equation.
1
y =

1 + exp

(2)
where y is the dependent variable, x is the independent
variable (Voltage), and V and Slope are the parameter
values. V is the half activation voltage and Slope de-
scribes the slope at the point V and indicates the steep-
ness of the curve. This paper does not address the critical
issue of which functions are suitable to describe individ-
ual data, but this topic is discussed in detail elsewhere.
3 , 4
Configuring the spreadsheet for nonlinear regression
1. Input onto a spreadsheet the raw data in two
columns, the x column containing the independent
variable, and the y column containing the dependent
variable. This is illustrated as Columns A and B of
Figure 1a.
2. Graph the data contained in cells A2 to B20.
3. Enter labels in cells G1 to G8 to describe the con-
tents of the adjacent cells. In cell G1 enter V, which
will describe the parameter in cell H1. For cell H1 se-
lect the Insert menu, choose Name then Define for cell
H1. Name the cell V. Similarly, for cells G2 to G8, en-
(V x)

Slope
ter Slope, Mean of y, df, SE of y, R2, Critical t and CI,
r e s p e c t i v e l y. Name cells H2 to H8, Slope, Mean_of_y,
df, SE_of_y, RSQ, Critical_t and CI, respectively.
4. Insert initial estimates of the parameters V a n d
Slope into cells G1 and G2, respectively. Approximate
estimates are 80 and 30, respectively.
5. In Column C (Boltzmann) enter the equation de-
scribing the Boltzmann function. This has been rear-
ranged from Equation 2 into a form that the program
recognizes: =(1/(1+EXP((V-A2)/Slope))), where Va n d
Slope refer to the parameter values in cells H1 and H2.
6. Copy the equation from cell C2 down to and in-
cluding C20.
7. The mean of the y values is calculated by entering
the following formula in H3: =AV E R A G E ( B 2 : B 2 0 ) .
8. The degrees of freedom (df) is defined as the num-
ber of data points minus the number of parameters in the
function. It is calculated by entering the following for-
mula in H4: =COUNT(B2:B20)-COUNT(H1:H2).
9. The standard error of the y values is calculated by
entering this formula in H5: =SQRT ( S U M ( ( B 2 : B 2 0 -
C 2 : C 2 0 ) ^ 2 ) / d f ) .
H o w e v e r, because this formula must be expressed
as an array formula, press Ctrl+Shift+Enter. This en-
closes the whole formula within a pair of curly brack-
ets ({}), denoting it as an array formula.
10. The R
2
value, the correlation index or coeff i-
cient of determination, is calculated by entering the
following formula in H6 and expressing it as an array
formula as described above: =1-SUM((B2:B20-
C2:C20)^2)/SUM((B2:B20-Mean_of_y)^2).
11. In order for the confidence interval of the fit to
be calculated, the critical t value at a significance level
of 95%is calculated by entering the following formula
in H7: =tinv(0.05,df). The confidence interval is calcu-
lated by entering the following formula in H8: =Criti-
Figure 1 Spreadsheet template for nonlinear regr e s s i o n . a) Fo rmulae used in the curve fitting procedure. The (x , y) data are en -
tered into Columns A and B, respective l y, with Column C used to generate the fit based on the parameters in Cells H1 and H2.
Columns D and E calculate the 95% confidence interval around the fit. b) The solution of the fit calculated by SOLV E R .
60 / O C TO B ER 2001
c a l _ t * S E _ o f _ y. Enter the following formula in D2:
=C2+CI, and copy it down to D20. Similarly, enter
=C2-CI in E2 and copy down to E20. This calculates
the upper and lower confidence limits (95%) of the fit.
12. The SE of the y values, R
2
and CI, are automati-
cedure utilizing the SOLV E R function is performed and
the resulting curve fit overlaid on the data. In addition,
the R
2
value, an index of goodness of fit, and the 95%
confidence intervals are calculated and displayed.
Once the spreadsheet template has been set up, it
can be repeatedly used for new sets of data. If a new
function is used to describe data, it is entered manually
in Column C and the appropriate parameters are desig-
nated on the worksheet. It is important that the function
be entered in the correct format, since it is very easy to
make mistakes when converting formula into the sin-
gle line format that the program recognizes. Care
should also be taken when entering the initial parame-
ter estimates, because the iteration procedure may pro-
ceed in the wrong direction and a solution never found
if inappropriate values are entered.
The spreadsheet template described in this paper is
available for download from the authors Web site at
http://faculty.washington.edu/ambrown/.
References
1. Brown AM. A step-by-step guide to non-linear regression analy-
sis of experimental data using a Microsoft Excel spreadsheet.
Comp Prog Meth Biomed 2001; 65:191200.
2. Smith S, Lasdon L. Solving large sparse nonlinear programs us-
ing GRG. ORSAJournal on Computing 1992; 4:215.
3. Johnson ML. W h y, when, and how biochemists should use least
squares. Anal Biochem 1992; 206:21525.
4. Dempster J. Computer Analysis of Electrophysiological Signals.
London:Academic Press, 1993; 10432.
cally calculated: 0.122, 0.895, and 0.257, respectively.
13. Figure 1a illustrates the spreadsheet template
with the formulas used in the fitting protocol displayed.
14. Graph Columns C, D and E versus Column A
such that they are displayed as continuous lines on the
graph (shown in Figure 2a). It can be seen that the ini-
tial estimate (blue line) is not a good fit of the data
with large confidence limits (red line).
15. Open the SOLVER function, which can be
found under the Tools menu.
16. In Set Target Cell box enter RSQ.
17. Set the Equal To option to Max. SOLVER at-
tempts to maximize the value of R
2
.
18. In By Changing Cells box, enter V, Slope.
19. Choose Solve to perform the fit. The program
will iteratively cycle through the fitting routine,
changing the parameter values of V and Slope until
the largest value of R
2
is calculated. These changes
will be displayed on the spreadsheet template, as illus-
trated in cells H1 and H2 of Figure 1b . The optimal
values of V and Slope are 99.366 and 24.388, respec-
tively, and the maximal value of R
2
is 0.997. The blue
line in Figure 2b illustrates the best fit and it is clear
that it is an improvement over the fit provided by the
initial parameter values. A d d i t i o n a l l y, the confidence
intervals (red line) around the fit have been reduced.
Conclusion
The procedure described in this paper allows the user
to carry out nonlinear regression analysis of data within
an Excel spreadsheet without the need of specialist
curve fitting programs. The procedure involves manu-
ally entering data and graphing it. The curve fitting pro-
Figure 2 Boltzmann fit. a) This graph displays the ex p e ri m e n -
tal data points (filled circles), the fit based on the initial para m -
eter estimates (blue line), and the 95% confidence interva l s
(red line) around the fit. b) The fit as calculated by SOLV E R .
D r. Brown is Assistant Pro f e s s o r, Department of Neuro l o g y,
Box 356465, University of Washington School of Medicine,
Seattle, WA 98195-6465, U.S.A.; tel.: 206-616-8278; fax:
206-685-8100; e-mail: [email protected].
a
b
APPLICATION NOTE cont.

You might also like