Binder1 193 244 Pages 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

5.

4 Fundamentals of Data Science and Analytics

U = In − AC−1AT and the vector


z = y − Aatrue.
(check the agreement with Sˆ above by multiplication).
Properties of z:
E [z] = 0 and covariance matrix
V [Z] = σ2In
i
i.e., V [Zi] = E z2 = σ2
E [ZiZk] = 0 .

Application: ˆ

estimate data variance (for n p) by σ^2 = S/(n − p)

5.2. EVALUATING THE GOODNESS OF FIT

After fitting data with one or more models, you should evaluate the goodness of
fit. A visual examination of the fitted curve displayed in the Curve Fitting Tool
should be your first step. Beyond that, the toolbox provides these goodness of fit
measures for both linear and nonlinear parametric fits:
 Residuals
 Goodness of fit statistics
 Confidence and prediction bounds
You can group these measures into two types: graphical and numerical. The
residuals and prediction bounds are graphical measures, while the goodness of fit
statistics and confidence bounds are numerical measures.
Generally speaking, graphical measures are more beneficial than numerical
measures because they allow you to view the entire data set at once, and they can
easily display a wide range of relationships between the model and the data. The
numerical measures are more narrowly focused on a particular aspect of the data and
often try to compress that information into a single number. In practice, depending on
your data and analysis requirements, you might need to use both types to determine
the best fit.
Note that it is possible that none of your fits can be considered the best one. In this
case, it might be that you need to select a different model. Conversely, it is also
possible that all the goodness of fit measures indicate that a particular fit is the best
Predictive Analytics 5.5

one. However, if your goal is to extract fitted coefficients that have physical meaning,
but your model does not reflect the physics of the data, the resulting coefficients are
useless. In this case, understanding what your data represents and how it was
measured is just as important as evaluating the goodness of fit.

5.2.1.1. Residuals
The residuals from a fitted model are defined as the differences between the
response data and the fit to the response data at each predictor value.
 residual = data - fit
You display the residuals in the Curve Fitting Tool by selecting the menu item
View Residuals.
Mathematically, the residual for a specific predictor value is the difference
between the response value y and the predicted response value.
Assuming the model you fit to the data is correct, the residuals approximate the
random errors. Therefore, if the residuals appear to behave randomly, it suggests that
the model fits the data well. However, if the residuals display a systematic pattern, it
is a clear sign that the model fits the data poorly.
A graphical display of the residuals for a first degree polynomial fit is shown
below. The top plot shows that the residuals are calculated as the vertical distance
from the data point to the fitted curve. The bottom plot shows that the residuals are
displayed relative to the fit, which is the zero line.
The residuals appear randomly scattered around zero indicating that the model
describes the data well.
A graphical display of the residuals for a second-degree polynomial fit is shown
below. The model includes only the quadratic term, and does not include a linear or
constant term.
The residuals are systematically positive for much of the data range indicating that
this model is a poor fit for the data.

5.2.1.2. Goodness of Fit Statistics


After using graphical methods to evaluate the goodness of fit, you should examine
the goodness of fit statistics. The Curve Fitting Toolbox supports this goodness of fit
statistics for parametric models:
5.6 Fundamentals of Data Science and Analytics

 The sum of squares due to error (SSE)


 R-square
 Adjusted R-square
 Root mean squared error (RMSE)
For the current fit, these statistics are displayed in the Results list box in the Fit
Editor. For all fits in the current curve-fitting session, you can compare the goodness
of fit statistics in the Table of fits.
Sum of Squares Due to Error. This statistic measures the total deviation of the
response values from the fit to the response values. It is also called the summed
square of residuals and is usually labeled as SSE.
Example: Evaluating the Goodness of Fit
This example fits several polynomial models to generated data and evaluates the
goodness of fit. The data is cubic and includes a range of missing values.
 rand('state',0)
 x = [1:0.1:3 9:0.1:10]';
 c = [2.5 -0.5 1.3 -0.1];
 y = c(1) + c(2)*x + c(3)*x.^2 + c(4)*x.^3 + (rand(size(x))-0.5);
After you import the data, fit it using a cubic polynomial and a fifth degree
polynomial. The data, fits, and residuals are shown below. You display the residuals
in the Curve Fitting Tool with the View->Residuals menu item.
Both models appear to fit the data well, and the residuals appear to be randomly
distributed around zero. Therefore, a graphical evaluation of the fits does not reveal
any obvious differences between the two equations.
The numerical fit results are shown below.
As expected, the fit results for poly3 are reasonable because the generated data is
cubic. The 95% confidence bounds on the fitted coefficients indicate that they are
acceptably accurate. However, the 95% confidence bounds for poly5 indicate that the
fitted coefficients are not known accurately.
The goodness of fit statistics are shown below. By default, the adjusted R-square
and RMSE statistics are not displayed in the Table of Fits. To display these statistics,
open the Table Options GUI by clicking the Table options button. The statistics do
not reveal a substantial difference between the two equations.
Predictive Analytics 5.7

The 95% non-simultaneous prediction bounds for new observations are shown
below. To display prediction bounds in the Curve Fitting Tool, select the View-
>Prediction Bounds menu item. Alternatively, you can view prediction bounds for
the function or for new observations using the Analysis GUI.
The prediction bounds for poly3 indicate that new observations can be predicted
accurately throughout the entire data range. This is not the case for poly5. It has wider
prediction bounds in the area of the missing data, apparently because the data does
not contain enough information to estimate the higher degree polynomial terms
accurately. In other words, a fifth-degree polynomial overfits the data. You can
confirm this by using the Analysis GUI to compute bounds for the functions
themselves.
The 95% prediction bounds for poly5 are shown below. As you can see, the
uncertainty in estimating the function is large in the area of the missing data.
Therefore, you would conclude that more data must be collected before you can make
accurate predictions using a fifth-degree polynomial.
In conclusion, you should examine all available goodness of fit measures before
deciding on the best fit. A graphical examination of the fit and residuals should
always be your initial approach. However, some fit characteristics are revealed only
through numerical fit results, statistics, and prediction bounds.

5.3. TESTING A LINEAR MODEL

There are four principal assumptions which justify the use of linear models for
purposes of inference or prediction:
(i) Linearity and additivity of the relationship between dependent and independent
variables:
(a) The expected value of dependent variable is a straight-line function of each
independent variable, holding the others fixed.
(b) The slope of that line does not depend on the values of the other variables.
(c) The effects of different independent variables on the expected value of the
dependent variable are additive.
(ii) Statistical independence of the errors (in particular, no correlation between
consecutive errors in the case of time series data)
(iii) Homoscedasticity (constant variance) of the errors

You might also like