Binder1 193 244 Pages 2
Binder1 193 244 Pages 2
Binder1 193 244 Pages 2
Application: ˆ
After fitting data with one or more models, you should evaluate the goodness of
fit. A visual examination of the fitted curve displayed in the Curve Fitting Tool
should be your first step. Beyond that, the toolbox provides these goodness of fit
measures for both linear and nonlinear parametric fits:
Residuals
Goodness of fit statistics
Confidence and prediction bounds
You can group these measures into two types: graphical and numerical. The
residuals and prediction bounds are graphical measures, while the goodness of fit
statistics and confidence bounds are numerical measures.
Generally speaking, graphical measures are more beneficial than numerical
measures because they allow you to view the entire data set at once, and they can
easily display a wide range of relationships between the model and the data. The
numerical measures are more narrowly focused on a particular aspect of the data and
often try to compress that information into a single number. In practice, depending on
your data and analysis requirements, you might need to use both types to determine
the best fit.
Note that it is possible that none of your fits can be considered the best one. In this
case, it might be that you need to select a different model. Conversely, it is also
possible that all the goodness of fit measures indicate that a particular fit is the best
Predictive Analytics 5.5
one. However, if your goal is to extract fitted coefficients that have physical meaning,
but your model does not reflect the physics of the data, the resulting coefficients are
useless. In this case, understanding what your data represents and how it was
measured is just as important as evaluating the goodness of fit.
5.2.1.1. Residuals
The residuals from a fitted model are defined as the differences between the
response data and the fit to the response data at each predictor value.
residual = data - fit
You display the residuals in the Curve Fitting Tool by selecting the menu item
View Residuals.
Mathematically, the residual for a specific predictor value is the difference
between the response value y and the predicted response value.
Assuming the model you fit to the data is correct, the residuals approximate the
random errors. Therefore, if the residuals appear to behave randomly, it suggests that
the model fits the data well. However, if the residuals display a systematic pattern, it
is a clear sign that the model fits the data poorly.
A graphical display of the residuals for a first degree polynomial fit is shown
below. The top plot shows that the residuals are calculated as the vertical distance
from the data point to the fitted curve. The bottom plot shows that the residuals are
displayed relative to the fit, which is the zero line.
The residuals appear randomly scattered around zero indicating that the model
describes the data well.
A graphical display of the residuals for a second-degree polynomial fit is shown
below. The model includes only the quadratic term, and does not include a linear or
constant term.
The residuals are systematically positive for much of the data range indicating that
this model is a poor fit for the data.
The 95% non-simultaneous prediction bounds for new observations are shown
below. To display prediction bounds in the Curve Fitting Tool, select the View-
>Prediction Bounds menu item. Alternatively, you can view prediction bounds for
the function or for new observations using the Analysis GUI.
The prediction bounds for poly3 indicate that new observations can be predicted
accurately throughout the entire data range. This is not the case for poly5. It has wider
prediction bounds in the area of the missing data, apparently because the data does
not contain enough information to estimate the higher degree polynomial terms
accurately. In other words, a fifth-degree polynomial overfits the data. You can
confirm this by using the Analysis GUI to compute bounds for the functions
themselves.
The 95% prediction bounds for poly5 are shown below. As you can see, the
uncertainty in estimating the function is large in the area of the missing data.
Therefore, you would conclude that more data must be collected before you can make
accurate predictions using a fifth-degree polynomial.
In conclusion, you should examine all available goodness of fit measures before
deciding on the best fit. A graphical examination of the fit and residuals should
always be your initial approach. However, some fit characteristics are revealed only
through numerical fit results, statistics, and prediction bounds.
There are four principal assumptions which justify the use of linear models for
purposes of inference or prediction:
(i) Linearity and additivity of the relationship between dependent and independent
variables:
(a) The expected value of dependent variable is a straight-line function of each
independent variable, holding the others fixed.
(b) The slope of that line does not depend on the values of the other variables.
(c) The effects of different independent variables on the expected value of the
dependent variable are additive.
(ii) Statistical independence of the errors (in particular, no correlation between
consecutive errors in the case of time series data)
(iii) Homoscedasticity (constant variance) of the errors