Excel Regression

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 41

Spreadsheet Problem Solving

fitting models to data


straight-line regression
multilinear regression
nonlinear regression
model building and selection
Data Analysis Regression tool
using
Trendline
Solver

Review of Straight-line Linear Regression


[ from Class #6 ]
y1

y = ax + b
Model

y
y11
e11

y11
x

x11

For each data point, there is an error between that


point and the model line. Fitting the model has to do
with minimizing these errors.

Finding the model parameters that give the best fit


For the straight-line model, the model parameters are
the slope (a) and the intercept (b).
The problem is then to find the values of a and b that
give the best fit. What is meant by the best fit?
The standard measure of goodness of fit is the sum
of squares of the errors:
n

SSE yi yi
i 1

yi a xi b

So, the problem reduces to finding the minimum of


SSE by adjusting a and b.

Fitting a straight-line model to data


The minimization of SSE can be solved by calculus
to give formulas for the best values of a and b:

n xi yi xi yi
i 1 i 1
a i 1
2
n
n

2
n xi xi
i 1
i 1
n

y
i 1

x
i 1

and Excel solves problems like this with either formulas


or built-in tools (Data Analysis Regression & Trendline).
4

Example: straight-line fit

Transfer the data to an Excel spreadsheet


and create a graph

CO2 Emissions for the US


1520
1500
1480

CO2 Emissions (MMT C)

1460
1440
1420
1400
1380
1360
1340
1320
1989

1990

1991

1992

1993

1994

1995
Year

1996

1997

1998

1999

2000

Calculating the slope and intercept using Excel formulas

n xi yi xi yi
i 1 i 1
a i 1
2
n
n

n xi2 xi
i 1
i 1
n

y
i 1

x
i 1

The formulas behind the numbers

Using the model straight-line equation to compute


the predictions:

and copy these


to the graph,
displaying as
a straight line

CO2 Emissions for the US


1550

CO2 Emissions (MMT C)

1500

y = 21.32x - 41090
1450

1400

1350

1300
1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

Year

10

Using an alternate, shortcut approach

Trendline

Start with a simple graph of the data


Select the data series by
clicking on it
CO2 Emissions for the US
1520
1500
1480

Select
Add Trendline
option

1460
CO2 Emissions (MMT C)

Right-click on a
data point to get
context-sensitive
menu

1440
1420
1400
1380
1360
1340
1320
1989

1990

1991

1992

1993

1994

1995
Year

1996

1997

1998

1999

11

2000

The Add Trendline dialog box

Linear selected
by default
OK for this
problem
Click on
Options tab

12

Options tab

Set for
Display equation
on chart

Click OK
13

Fix up
equation
display

Initial form of graph with straight-line added


CO2 Emissions for the US
1550

y = 21.315x - 41090

CO2 Emissions (MMT C)

1500

1450

1400

1350

1300
1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

Year

14

CO2 Emissions for the US


1550

CO2 Emissions (MMT C)

1500

y = 21.315x - 41090

1450

1400

1350

1300
1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

Year

Looks just like before, but we got there quicker


But neither of these approaches gives us much information
15
about the model, how good it is, etc.

A 2nd alternate approach


Tools
Data Analysis

Data Analysis Regression tool


recall that, if Data Analysis
does not appear on the Tools
menu, you will need to check
Analysis Toolpak in the Add-ins
dialog box [if its not there, you
will have to go back to Microsoft
Office/Excel set-up]

Initial, empty
Regression
dialog box

16

Regression dialog box set up for our problem

checking Residuals
will give us also
model predictions
17

Initial (poorly formatted) Regression output display


[ on new worksheet ]

Format
Autoformat
OK
and fix up
display for
appropriate
significant
figures
18

Final Display of Regression Output


[ tons of info, most of
which you will not
understand for a
couple years ]

used to judge
goodness of
fit
intercept
and slope
values

used to judge
whether terms
belong in the
model
add to data graph
for visual comparison
with model

19

Judging Goodness of Fit

correlation coefficient: if close


to +1 or 1, indicates strong
correlation between x and y
[something we already know
from the original graph!]
coefficient of determination:
%-age of the variability in y
thats accounted for by the
model

gives an idea of how


far off the model
predictions will be

adjustment to R2 that
penalizes the value for
using a model with too
many terms

Adjusted R2 or Standard Error can be used to compare


different models and choose which fits best. The higher
the value of Adjusted R2 the better, the lower the value
of Standard Error the better.
20

Judging whether terms belong in the model


P-values estimate the probability
that the true value of the coefficient
could be zero

A P-value of 5%
(0.05) or greater
causes suspicion
that the coefficient
may not be
significant and that
the term should
probably be dropped
from the model

P-values that are quite small, like


these, indicate that there is little
question about the significance of
the term coefficients. In our case
here, that means that both the
intercept term and the slope term
belong in the model.

21

The Data Analysis Regression tool appears much more


complicated and involved that the shortcut Trendline tool, so . . .
Why use Data Analysis Regression?
1) It provides more information that lets us
judge the goodness of fit and significance
of model terms
2) It can handle model forms that cannot be
handled by Trendline
So, generally, when using Excel, we prefer
the Data Analysis Regression tool over Trendline
but Trendline is still quite good for quick and dirty
looks at the data
Learn to use both!

22

More complicated models


Polynomial models

y a bx cx 2 dx 3 L

Note: it is called linear regression,


even when there are nonlinear
terms in x, because the terms are
linear in the model parameters,
a, b, c, etc.

General linear models


y a f1 x b f 2 x c f 3 x d f 4 x L
Examples:

polynomial models above


1
y a b c ln x
x
Multilinear models

y a f1 x1 ,x2 ,K b f 2 x1 ,x2 ,K c f 3 x1 ,x2 ,K L


Examples:

y a bx1 cx2 dx1 x2

y ae

x1
x2
23

Nonlinear models
Transformable to linear

ln y ln a b x

y a eb x
Not transformable

P 10

B
T C

straight-line
regression!

We can use the Data Analysis Regression tool for everything


except the nonlinear models that cant be transformed into
linear. For those, we can use the Solver.

24

Example: polynomial regression


curvature evident

Viscosity of Water at Atmospheric Pressure


2.000
1.800
1.600

Viscosity (cp)

1.400
1.200
1.000
0.800
0.600
0.400
0.200
0.000
0

50

100

150

200

250

Temperature (degF)

25

Setting up for polynomial fits

Select for quadratic model, etc

26

Data Analysis Regression tool

check Labels because


headings are included
in selections for Y and X

check
Residuals

27

Quadratic model regression results

model performance
adjR2

model coefficients
copy to graph

28

Quadratic model really doesnt capture behavior of data


Viscosity of Water at Atmospheric Pressure
2.000
1.800
1.600
Data

Viscosity (cp)

1.400

Quadratic

1.200
1.000
0.800
0.600
0.400
0.200
0.000
0

50

100

150

200

250

Temperature (degF)

29

Continue with fits of cubic, 4th- & 5th-order polynomials


Summary of results

Looks like 5th-order offers best performance


but improvement is marginal over 4th-order.
Resulting model:
Visc 3.161 0.05699 T 5.023 10 4 T 2 2.162 10 6 T 3 3.593 10 9 T 4

30

Viscosity of Water at Atmospheric Pressure


2.000
1.800
1.600
Data

Viscosity (cp)

1.400

Quadratic
Cubic

1.200

4th Order
1.000
0.800
0.600
0.400
0.200
0.000
20

40

60

80

100

120

140

160

180

200

220

Temperature (degF)

31

Precautions on polynomial fitting


Try to use the lowest-order model that gives a good fit.
Higher-order models will have wiggles between data
points that will cause prediction errors.
In fact, an (n-1)th-order polynomial will provide a perfect
fit to the n data points, but it will usually do bizarre things
in between the data points.

32

Example: multi-linear regression

Model 1: y a b x1 c x2

Model 2:

y b x1 c x2

X-input range includes


two independent variables:
x1 and x2
High P value for intercept in
Model 1 suggests Model 2
without intercept, but there
is a significant loss in adjR2

33

Multilinear Model Performance


12.0

Model performance isnt that


great for either model, and
Model 1 doesnt appear
dramatically better than Model 2

10.0

Predicted y

8.0

Model 1

6.0

Model 2

4.0

2.0

0.0
0

10

12

Measured y

Note: for multi-linear models, we plot Predicted vs Measured y.


A perfect model would place points directly on the 45-degree line.

34

Nonlinear Regression
Fitting the parameters of the van der Waals equation of state
Data for SO2
RT
a

V b V 2

Find the values of a and b


that give the best predictions
for P, when compared to the
measured values of P

35

Strategy for Nonlinear Regression


1) estimate initial values for a and b
2) compute predicted Ps using data for V and T
3) compute errors between predicted Ps and measured Ps
4) sum the squares of these errors to compute SSE
5) have the Solver minimize SSE
by adjusting the values of a and b

36

Basic data

Calculated Pressure

by both ideal gas law


and van der Waals
Sum of
squares
of this
column

37

Ideal Gas
Sum of Squares
Calculation Calculation

van der Waals Calculation

Error Calculation

38

Setting up Solver Parameters


SSE as Target Cell
Minimize
by adjusting a and b
with b>=0 constraint

Results

39

Results

40

Fit of van der Waals Eqn for SO2


and Comparison to Ideal Gas Law
12000000

Note departure of
ideal gas predictions
at higher pressures

Predicted Pressure (Pa)

10000000

8000000
van der Waals
Ideal Gas

6000000

4000000

2000000

0
0

2000000

4000000

6000000

8000000

10000000

12000000

Measured Pressure (Pa)

41

You might also like