Matlab Mathworks Data Analysis
Matlab Mathworks Data Analysis
Data Analysis
R2013b
How to Contact MathWorks
www.mathworks.com Web
comp.soft-sys.matlab Newsgroup
www.mathworks.com/contact_TS.html Technical Support
[email protected] Product enhancement suggestions
[email protected] Bug reports
[email protected] Documentation error reports
[email protected] Order status, license renewals, passcodes
[email protected] Sales, pricing, and general information
508-647-7000 (Phone)
508-647-7001 (Fax)
The MathWorks, Inc.
3 Apple Hill Drive
Natick, MA 01760-2098
For contact information about worldwide offices, see the MathWorks Web site.
MATLAB
Data Analysis
COPYRIGHT 20052013 by The MathWorks, Inc.
The software described in this document is furnished under a license agreement. The software may be used
or copied only under the terms of the license agreement. No part of this manual may be photocopied or
reproduced in any form without prior written consent from The MathWorks, Inc.
FEDERAL ACQUISITION: This provision applies to all acquisitions of the Program and Documentation
by, for, or through the federal government of the United States. By accepting delivery of the Program
or Documentation, the government hereby agrees that this software or documentation qualifies as
commercial computer software or commercial computer software documentation as such terms are used
or defined in FAR 12.212, DFARS Part 227.72, and DFARS 252.227-7014. Accordingly, the terms and
conditions of this Agreement and only those rights specified in this Agreement, shall pertain to and govern
the use, modification, reproduction, release, performance, display, and disclosure of the Program and
Documentation by the federal government (or other entity acquiring for or through the federal government)
and shall supersede any conflicting contractual terms or conditions. If this License fails to meet the
governments needs or is inconsistent in any respect with federal procurement law, the government agrees
to return the Program and Documentation, unused, to The MathWorks, Inc.
Trademarks
MATLAB and Simulink are registered trademarks of The MathWorks, Inc. See
www.mathworks.com/trademarks for a list of additional trademarks. Other product or brand
names may be trademarks or registered trademarks of their respective holders.
Patents
MathWorks products are protected by one or more U.S. patents. Please see
www.mathworks.com/patents for more information.
Revision History
September 2005 Online only New for MATLAB 7.1 (Release 14SP3)
March 2006 Online only Revised for MATLAB Version 7.2 (Release 2006a)
September 2006 Online only Revised for MATLAB Version 7.3 (Release 2006b)
March 2007 Online only Revised for MATLAB Version 7.4 (Release 2007a)
September 2007 Online only Revised for MATLAB Version 7.5 (Release 2007b)
March 2008 Online only Revised for MATLAB Version 7.6 (Release 2008a)
October 2008 Online only Revised for MATLAB Version 7.7 (Release 2008b)
March 2009 Online only Revised for MATLAB 7.8 (Release 2009a)
September 2009 Online only Revised for MATLAB 7.9 (Release 2009b)
March 2010 Online only Revised for MATLAB 7.10 (Release 2010a)
September 2010 Online only Revised for MATLAB Version 7.11 (R2010b)
April 2011 Online only Revised for MATLAB Version 7.12 (R2011a)
September 2011 Online only Revised for MATLAB Version 7.13 (R2011b)
March 2012 Online only Revised for MATLAB Version 7.14 (R2012a)
September 2012 Online only Revised for MATLAB Version 8.0 (R2012b)
March 2013 Online only Revised for MATLAB Version 8.1 (R2013a)
September 2013 Online only Revised for MATLAB Version 8.2 (R2013b)
Contents
Data Processing
1
Importing and Exporting Data . . . . . . . . . . . . . . . . . . . . . . 1-2
Importing Data into the Workspace . . . . . . . . . . . . . . . . . . . 1-2
Exporting Data from the Workspace . . . . . . . . . . . . . . . . . . 1-2
Plotting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3
Example: Loading and Plotting Data . . . . . . . . . . . . . . . . . 1-3
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-6
Representing Missing Data Values . . . . . . . . . . . . . . . . . . . 1-6
Calculating with NaNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-6
Removing NaNs from Data . . . . . . . . . . . . . . . . . . . . . . . . . . 1-7
Interpolating Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . 1-8
Inconsistent Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-9
Filtering Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-11
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-11
Filter Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-11
Example: Moving Average Filter . . . . . . . . . . . . . . . . . . . . . 1-12
Example: Discrete Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-13
Detrending Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-16
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-16
Example: Removing Linear Trends from Data . . . . . . . . . . 1-16
Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-20
Functions for Calculating Descriptive Statistics . . . . . . . . . 1-20
Example: Using MATLAB Data Statistics . . . . . . . . . . . . . 1-23
v
Interactive Data Exploration
2
What Is Interactive Data Exploration? . . . . . . . . . . . . . . . 2-2
Interacting with MATLAB Data Graphs . . . . . . . . . . . . . . . 2-2
Marking Up Graphs with Data Brushing . . . . . . . . . . . . . 2-4
What Is Data Brushing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4
How to Brush Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-5
Effects of Brushing on Data . . . . . . . . . . . . . . . . . . . . . . . . . 2-8
Other Data Brushing Aspects . . . . . . . . . . . . . . . . . . . . . . . 2-10
Making Graphs Responsive with Data Linking . . . . . . . 2-12
What Is Data Linking? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-12
Why Use Linked Plots? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-13
How to Link Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-13
How Linked Plots Behave . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-15
Linking vs. Refreshing Plots . . . . . . . . . . . . . . . . . . . . . . . . 2-18
Using Linked Plot Controls . . . . . . . . . . . . . . . . . . . . . . . . . 2-20
Interacting with Graphed Data . . . . . . . . . . . . . . . . . . . . . 2-23
Data Brushing with the Variables Editor . . . . . . . . . . . . . . 2-23
Using Data Tips to Explore Graphs . . . . . . . . . . . . . . . . . . . 2-24
Example Visually Exploring Demographic Statistics . . 2-26
Regression Analysis
3
Linear Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3
Correlation Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4
Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6
Residuals and Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . 3-7
Fitting Data with Curve Fitting Toolbox Functions . . . . . . 3-11
vi Contents
Interactive Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-13
The Basic Fitting GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-13
Preparing for Basic Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . 3-14
Opening the Basic Fitting GUI . . . . . . . . . . . . . . . . . . . . . . 3-14
Example: Using Basic Fitting GUI . . . . . . . . . . . . . . . . . . . 3-16
Programmatic Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-35
MATLAB Functions for Polynomial Models . . . . . . . . . . . . 3-35
Linear Model with Nonpolynomial Terms . . . . . . . . . . . . . . 3-41
Multiple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-42
Example: Programmatic Fitting . . . . . . . . . . . . . . . . . . . . . 3-43
Time Series Analysis
4
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
Time Series Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3
Time Series Data Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3
Example: Time Series Objects and Methods . . . . . . . . . . . . 4-6
Time Series Constructor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-29
Time Series Collection Constructor . . . . . . . . . . . . . . . . . . . 4-30
Index
vii
viii Contents
1
Data Processing
Importing and Exporting Data on page 1-2
Plotting Data on page 1-3
Missing Data on page 1-6
Inconsistent Data on page 1-9
Filtering Data on page 1-11
Detrending Data on page 1-16
Descriptive Statistics on page 1-20
1 Data Processing
Importing and Exporting Data
Importing Data into the Workspace on page 1-2
Exporting Data from the Workspace on page 1-2
Importing Data into the Workspace
The first step in analyzing data is to import it into the MATLAB
workspace.
See Methods for Importing Data for information about importing data from
specific file formats.
Exporting Data from the Workspace
When you analyze your data, you might create new variables or modified
imported variables. You can export variables from the MATLAB workspace to
various file formats, both character-based and binary. You can, for example,
create HDF and Microsoft
Excel
1
1 1
1
1 2
1 2
aa
N
z
X z
a
)
( )
+1
Here Y(z) is the z-transform of the filtered output y(n). The coefficients b and
a are unchanged by the z-transform.
In digital signal processing (DSP), it is customary to write transfer functions
as rational expressions in z
1
and to order the numerator and denominator
terms in ascending powers of z
1
.
Consider the following transfer function:
H z
b z
a z
z
z
( )
( )
( ) .
= =
+
+
1
1
1
1
1
2 3
1 0 2
To apply this transfer function to the data in count.dat:
1 Load the matrix count into the workspace:
load count.dat;
2 Extract the first column and assign it to x:
x = count(:,1);
3 Enter the coefficients of the denominator ordered in ascending powers of
z
1
to represent 1 0 2
1
+
. z :
a = [1 0.2];
1-14
Filtering Data
4 Enter the coefficients of the numerator to represent 2 3
1
+
z :
b = [2 3];
5 Call the filter function:
y = filter(b,a,x);
6 Compare the original data and the shaped data with an overlaid plot of the
two curves:
t = 1:length(x);
plot(t,x,'-.',t,y,'-'), grid on
legend('Original Data','Shaped Data',2)
The plot shows this filter primarily modifies the amplitude of the original data.
Plot of Original and Shaped Data
1-15
1 Data Processing
Detrending Data
In this section...
Introduction on page 1-16
Example: Removing Linear Trends from Data on page 1-16
Introduction
The MATLAB function detrend subtracts the mean or a best-fit line (in
the least-squares sense) from your data. If your data contains several data
columns, detrend treats each data column separately.
Removing a trend from the data enables you to focus your analysis on the
fluctuations in the data about the trend. A linear trend typically indicates
a systematic increase or decrease in the data. A systematic shift can result
from sensor drift, for example. While trends can be meaningful, some types of
analyses yield better insight once you remove trends.
Whether it makes sense to remove trend effects in the data often depends on
the objectives of your analysis.
Example: Removing Linear Trends from Data
This example shows how to remove a linear trend from daily closing stock
prices to emphasize the price fluctuations about the overall increase. If the
data does have a trend, detrending it forces its mean to zero and reduces
overall variation. The example simulates stock price fluctuations using a
distribution taken from the gallery function.
Follow the steps in this example to learn how to detrend time-varying data.
1-16
Detrending Data
1 Create a simulated data set and compute its mean. sdata represents the
daily price changes of a stock:
t = 0:300;
dailyFluct = gallery('normaldata',size(t),2);
sdata = cumsum(dailyFluct) + 20 + t/100;
mean(sdata)
ans =
39.4851
2 Plot and label the data. Notice the systematic increase in the stock prices
that the data displays:
figure
plot(t,sdata);
legend('Original Data','Location','northwest');
xlabel('Time (days)');
ylabel('Stock Price (dollars)');
1-17
1 Data Processing
3 Apply detrend, which performs a linear fit to sdata and then removes the
trend from it. Subtracting the output from the input yields the computed
trend line:
detrend_sdata=detrend(sdata);
trend = sdata - detrend_sdata;
% As expected, the detrended data has a mean very close to 0.
mean(detrend_sdata)
ans =
3.1420e-014
4 Display the results by adding the trend line, the detrended data, and its
mean to the graph:
hold on
plot(t,trend,':r')
plot(t,detrend_sdata,'m')
plot(t,zeros(size(t)),':k')
legend('Original Data','Trend','Detrended Data',...
'mean(Detrended)','Location','northwest')
xlabel('Time (days)');
ylabel('Stock Price (dollars)');
1-18
Detrending Data
1-19
1 Data Processing
Descriptive Statistics
In this section...
Functions for Calculating Descriptive Statistics on page 1-20
Example: Using MATLAB Data Statistics on page 1-23
If you need more advanced statistics features, you might want to use the
Statistics Toolbox software.
Functions for Calculating Descriptive Statistics
Use the following MATLAB functions to calculate the descriptive statistics
for your data.
Note For matrix data, descriptive statistics for each column are calculated
independently.
Statistics Function Summary
Function Description
max Maximum value
mean Average or mean value
median Median value
min Smallest value
mode Most frequent value
std Standard deviation
var Variance, which measures the spread or dispersion of
the values
1-20
Descriptive Statistics
The following examples apply MATLAB functions to calculate descriptive
statistics:
Example 1 Calculating Maximum, Mean, and Standard Deviation
on page 1-21
Example 2 Subtracting the Mean on page 1-23
Example 1 Calculating Maximum, Mean, and Standard
Deviation
This example shows how to use MATLAB functions to calculate the maximum,
mean, and standard deviation values for a 24-by-3 matrix called count.
MATLAB computes these statistics independently for each column in the
matrix.
% Load the sample data
load count.dat
% Find the maximum value in each column
mx = max(count)
% Calculate the mean of each column
mu = mean(count)
% Calculate the standard deviation of each column
sigma = std(count)
The results are
mx =
114 145 257
mu =
32.0000 46.5417 65.5833
sigma =
25.3703 41.4057 68.0281
1-21
1 Data Processing
To get the row numbers where the maximum data values occur in each data
column, specify a second output parameter indx to return the row index.
For example:
[mx,indx] = max(count)
These results are
mx =
114 145 257
indx =
20 20 20
Here, the variable mx is a row vector that contains the maximum value in each
of the three data columns. The variable indx contains the row indices in each
column that correspond to the maximum values.
To find the minimum value in the entire count matrix, reshape this 24-by-3
matrix into a 72-by-1 column vector by using the syntax count(:). Then, to
find the minimum value in the single column, use the following syntax:
min(count(:))
ans =
7
1-22
Descriptive Statistics
Example 2 Subtracting the Mean
Subtract the mean from each column of the matrix by using the following
syntax:
% Get the size of the count matrix
[n,p] = size(count)
% Compute the mean of each column
mu = mean(count)
% Create a matrix of mean values by
% replicating the mu vector for n rows
MeanMat = repmat(mu,n,1)
% Subtract the column mean from each element
% in that column
x = count - MeanMat
Note Subtracting the mean from the data is also called detrending. For
more information about removing the mean or the best-fit line from the data,
see Detrending Data on page 1-16.
Example: Using MATLAB Data Statistics
The Data Statistics dialog box helps you calculate and plot descriptive
statistics with the data. This example shows how to use MATLAB Data
Statistics to calculate and plot statistics for a 24-by-3 matrix, called count.
The data represents how many vehicles passed by traffic counting stations
on three streets.
This section contains the following topics:
Calculating and Plotting Descriptive Statistics on page 1-24
Formatting Data Statistics on Plots on page 1-26
Saving Statistics to the MATLAB Workspace on page 1-29
Generating Code Files on page 1-30
Note MATLAB Data Statistics is available for 2-D plots only.
1-23
1 Data Processing
Calculating and Plotting Descriptive Statistics
1 Load and plot the data:
load count.dat
[n,p] = size(count);
% Define the x-values
t = 1:n;
% Plot the data and annotate the graph
plot(t,count)
legend('Station 1','Station 2','Station 3',...
'Location','northwest')
xlabel('Time'), ylabel('Vehicle Count')
Note The legend contains the name of each data set, as specified by the
legend function: Station 1, Station 2, and Station 3. A data set refers
to each column of data in the array you plotted. If you do not name the data
sets, default names are assigned: data1, data2, and so on.
1-24
Descriptive Statistics
2 In the Figure window, select Tools > Data Statistics .
The Data Statistics dialog box opens and displays descriptive statistics for
the X- and Y-data of the Station 1 data set.
Note The Data Statistics GUI calculates the range, which is the difference
between the minimum and maximum values in the selected data set. The
Data Statistics GUI does not display the range on the plot.
3 Select a different data set in the Statistics for list: Station 2.
This displays the statistics for the X and Y data of the Station 2 data set.
4 Select the check box for each statistic you want to display on the plot, and
then click Save to workspace.
For example, to plot the mean of Station 2, select the mean check box
in the Y column.
1-25
1 Data Processing
This plots a horizontal line to represent the mean of Station 2 and
updates the legend to include this statistic.
Formatting Data Statistics on Plots
The Data Statistics GUI uses colors and line styles to distinguish statistics
from the data on the plot. This portion of the example shows how to customize
the display of descriptive statistics on a plot, such as the color, line width,
line style, or marker.
1-26
Descriptive Statistics
Note Do not edit display properties of statistics until you finish plotting all
the statistics with the data. If you add or remove statistics after editing plot
properties, the changes to plot properties are lost.
To modify the display of data statistics on a plot:
1 In the MATLAB Figure window, click the (Edit Plot) button in the
toolbar.
This step enables plot editing.
2 Double-click the statistic on the plot for which you want to edit display
properties. For example, double-click the horizontal line representing the
mean of Station 2.
This step opens the Property Editor below the MATLAB Figure window,
where you can modify the appearance of the line used to represent this
statistic.
1-27
1 Data Processing
3 In the Property Editor, specify the Line and Marker styles, sizes, and
colors.
Tip Alternatively, right-click the statistic on the plot, and select an option
from the shortcut menu.
1-28
Descriptive Statistics
Saving Statistics to the MATLAB Workspace
This portion of the example shows how to save statistics in the Data Statistics
GUI to the MATLAB workspace.
Note When your plot contains multiple data sets, save statistics for each
data set individually. To display statistics for a different data set, select it
from the Statistics for list in the Data Statistics GUI.
1 In the Data Statistics dialog box, click the Save to workspace button.
2 In the Save Statistics to Workspace dialog box, select options to save statistics
for either X data, Y data, or both. Then, enter the corresponding variable
names.
In this example, save only the Y data. Enter the variable name as
Loc2countstats.
3 Click OK.
This step saves the descriptive statistics to a structure. The new variable is
added to the MATLAB workspace.
1-29
1 Data Processing
To view the new structure variable, type the variable name at the MATLAB
prompt:
Loc2countstats
Loc2countstats =
min: 9
max: 145
mean: 46.5417
median: 36
mode: 9
std: 41.4057
range: 136
Generating Code Files
This portion of the example shows how to generate a file containing MATLAB
code that reproduces the format of the plot and the plotted statistics with
new data.
1 In the Figure window, select File > Generate Code.
This step creates a function code file and displays it in the MATLAB Editor.
The code can programmatically reproduce what you did interactively with the
Data Statistics GUI and the Property Editor.
2 Change the name of the function on the first line of the file from createfigure
to something more specific, like countplot. Save the file to your current
folder with the file name countplot.m.
3 Generate some new, random count data:
randcount = 300*rand(24,3);
4 Reproduce the plot with the new data and the recomputed statistics:
countplot(t,randcount)
1-30
Descriptive Statistics
1-31
1 Data Processing
1-32
2
Interactive Data
Exploration
What Is Interactive Data Exploration? on page 2-2
Marking Up Graphs with Data Brushing on page 2-4
Making Graphs Responsive with Data Linking on page 2-12
Interacting with Graphed Data on page 2-23
2 Interactive Data Exploration
What Is Interactive Data Exploration?
Interacting with MATLAB Data Graphs
The MATLAB data analysis and graphics tools for visual data exploration
leverage its Handle Graphics
(
(
(
=
Here, s
2
ij
is the sample covariance between column i and column j of the data.
Because the count matrix contains three columns, the covariance matrix
is 3-by-3.
Note In the special case when a vector is the argument of cov, the function
returns the variance.
Correlation Coefficients
The MATLAB function corrcoef produces a matrix of sample correlation
coefficients for a data matrix (where each column represents a separate
quantity). The correlation coefficients range from -1 to 1, where
Values close to 1 indicate that there is a positive linear relationship
between the data columns.
Values close to -1 indicate that one column of data has a negative linear
relationship to another column of data (anticorrelation).
Values close to or equal to 0 suggest there is no linear relationship between
the data columns.
3-4
Linear Correlation
For an m-by-n matrix, the correlation-coefficient matrix is n-by-n. The
arrangement of the elements in the correlation coefficient matrix corresponds
to the location of the elements in the covariance matrix, as described in
Covariance on page 3-3.
For an example of calculating correlation coefficients, load the sample data in
count.dat that contains a 24-by-3 matrix:
load count.dat
Type the following syntax to calculate the correlation coefficients:
corrcoef(count)
This results in the following 3-by-3 matrix of correlation coefficients:
ans =
1.0000 0.9331 0.9599
0.9331 1.0000 0.9553
0.9599 0.9553 1.0000
Because all correlation coefficients are close to 1, there is a strong positive
correlation between each pair of data columns in the count matrix.
3-5
3 Regression Analysis
Linear Regression
In this section...
Introduction on page 3-6
Residuals and Goodness of Fit on page 3-7
Fitting Data with Curve Fitting Toolbox Functions on page 3-11
Introduction
A data model explicitly describes a relationship between predictor and
response variables. Linear regression fits a data model that is linear in
the model coefficients. The most common type of linear regression is a
least-squares fit, which can fit both lines and polynomials, among other linear
models.
Before you model the relationship between pairs of quantities, it is a good
idea to perform correlation analysis to establish if a linear relationship
exists between these quantities. Be aware that variables can have nonlinear
relationships, which correlation analysis cannot detect. For more information,
see Linear Correlation on page 3-2.
The MATLAB Basic Fitting GUI helps you to fit your data, so you can
calculate model coefficients and plot the model on top of the data. For an
example, see Example: Using Basic Fitting GUI on page 3-16. You also
can use the MATLAB polyfit and polyval functions to fit your data to
a model that is linear in the coefficients. For an example, see Example:
Programmatic Fitting on page 3-43.
If you need to fit data with a nonlinear model, transforming the variables
to make the relationship linear. Alternatively, try to fit a nonlinear
function directly using either the Statistics Toolbox nlinfit function, the
Optimization Toolbox lsqcurvefit function, or by applying functions in
the Curve Fitting Toolbox.
This topic explains how to:
Use correlation analysis to determine whether two quantities are related to
justify fitting the data.
3-6
Linear Regression
Fit a linear model to the data.
Evaluate the goodness of fit by plotting residuals and looking for patterns.
Calculate measures of goodness of fit R
2
and adjusted R
2
Residuals and Goodness of Fit
Residuals are the difference between the observed values of the response
(dependent) variable and the values that a model predicts. When you
fit a model that is appropriate for your data, the residuals approximate
independent random errors. That is, the distribution of residuals ought not to
exhibit a discernible pattern.
Producing a fit using a linear model requires minimizing the sum of
the squares of the residuals. This minimization yields what is called a
least-squares fit. You can gain insight into the goodness of a fit by visually
examining a plot of the residuals. If the residual plot has a pattern (that is,
residual data points do not appear to have a random scatter), the randomness
indicates that the model does not properly fit the data.
Evaluate each fit you make in the context of your data. For example, if
your goal of fitting the data is to extract coefficients that have physical
meaning, then it is important that your model reflect the physics of the data.
Understanding what your data represents, how it was measured, and how it
is modeled is important when evaluating the goodness of fit.
One measure of goodness of fit is the coefficient of determination, or R
2
(pronounced r-square). This statistic indicates how closely values you obtain
from fitting a model match the dependent variable the model is intended
to predict. Statisticians often define R
2
using the residual variance from a
fitted model:
R
2
= 1 SS
resid
/ SS
total
SS
resid
is the sum of the squared residuals from the regression. SS
total
is the
sum of the squared differences from the mean of the dependent variable (total
sum of squares). Both are positive scalars.
To learn how to compute R
2
when you use the Basic Fitting tool, see Derive
R
2
, the Coefficient of Determination on page 3-21. To learn more about
3-7
3 Regression Analysis
calculating the R
2
statistic and its multivariate generalization, continue
reading here.
Example: Computing R
2
from Polynomial Fits
You can derive R
2
from the coefficients of a polynomial regression to determine
how much variance in y a linear model explains, as the following example
describes:
1 Create two variables, x and y from the first two columns of the count
variable in the data file count.dat:
load count.dat
x = count(:,1);
y = count(:,2);
2 Use polyfit to compute a linear regression that predicts y from x:
p = polyfit(x,y,1)
p =
1.5229 -2.1911
p(1) is the slope and p(2) is the intercept of the linear predictor. You can
also obtain regression coefficients using the Basic Fitting GUI.
3 Call polyval to use p to predict y, calling the result yfit:
yfit = polyval(p,x);
Using polyval saves you from typing the fit equation yourself, which in
this case looks like:
yfit = p(1) * x + p(2);
4 Compute the residual values as a vector signed numbers:
yresid = y - yfit;
5 Square the residuals and total them obtain the residual sum of squares:
SSresid = sum(yresid.^2);
3-8
Linear Regression
6 Compute the total sum of squares of y by multiplying the variance of y by
the number of observations minus 1:
SStotal = (length(y)-1) * var(y);
7 Compute R
2
using the formula given in the introduction of this topic:
rsq = 1 - SSresid/SStotal
rsq =
0.8707
This demonstrates that the linear equation 1.5229 * x -2.1911 predicts
87% of the variance in the variable y.
Computing Adjusted R
2
for Polynomial Regressions
You can usually reduce the residuals in a model by fitting a higher degree
polynomial. When you add more terms, you increase the coefficient of
determination, R
2
. You get a closer fit to the data, but at the expense of a
more complex model, for which R
2
cannot account. However, a refinement of
this statistic, adjusted R
2
, does include a penalty for the number of terms
in a model. Adjusted R
2
, therefore, is more appropriate for comparing how
different models fit to the same data. The adjusted R
2
is defined as:
R
2
adjusted
= 1 - (SS
resid
/ SS
total
)*((n-1)/(n-d-1))
where n is the number of observations in your data, and d is the degree of
the polynomial. (A linear fit has a degree of 1, a quadratic fit 2, a cubic
fit 3, and so on.)
The following example repeats the steps of the previous example, Example:
Computing R
2
from Polynomial Fits on page 3-8, but performs a cubic (degree
3) fit instead of a linear (degree 1) fit. From the cubic fit, you compute both
simple and adjusted R
2
values to evaluate whether the extra terms improve
predictive power:
1 Create two variables, x and y from the first two columns of the count
variable in the data file count.dat:
load count.dat
3-9
3 Regression Analysis
x = count(:,1);
y = count(:,2);
2 Call polyfit to generate a cubic fit to predict y from x::
p = polyfit(x,y,3)
p =
-0.0003 0.0390 0.2233 6.2779
p(1) is the slope and p(2) is the intercept of the linear predictor. You can
also obtain regression coefficients using the Basic Fitting GUI.
3 Call polyval to use the coefficients in p to predict y, naming the result yfit:
yfit = polyval(p,x);
polyval evaluates the explicit equation you could manually enter as:
yfit = p(1) * x.^3 + p(2) * x.^2 + p(3) * x + p(4);
4 Compute the residual values as a vector signed numbers:
yresid = y - yfit;
5 Square the residuals and total them obtain the residual sum of squares:
SSresid = sum(yresid.^2);
6 Compute the total sum of squares of y by multiplying the variance of y by
the number of observations minus 1:
SStotal = (length(y)-1) * var(y);
7 Compute simple R
2
for the cubic fit using the formula given in the
introduction of this topic:
rsq = 1 - SSresid/SStotal
rsq =
0.9083
8 Finally, compute adjusted R
2
to account for degrees of freedom:
3-10
Linear Regression
rsq_adj = 1 - SSresid/SStotal * (length(y)-1)/(length(y)-length(p))
rsq_adj =
0.8945
The adjusted R
2
, 0.8945, is smaller than simple R
2
, .9083. It provides a
more reliable estimate of the power of your polynomial model to predict.
In many polynomial regression models, adding terms to the equation
increases both R
2
and adjusted R
2
. In the preceding example, using a cubic fit
increased both statistics compared to a linear fit. (You can compute adjusted
R
2
for the linear fit for yourself to demonstrate that it has a lower value.)
However, it is not always true that a linear fit is worse than a higher-order
fit: a more complicated fit can have a lower adjusted R
2
than a simpler fit,
indicating that the increased complexity is not justified. Also, while R
2
always
varies between 0 and 1 for the polynomial regression models that the Basic
Fitting tool generates, adjusted R
2
for some models can be negative, indicating
that a model that has too many terms.
Correlation does not imply causality. Always interpret coefficients of
correlation and determination cautiously. The coefficients only quantify how
much variance in a dependent variable a fitted model removes. Such measures
do not describe how appropriate your modelor the independent variables
you selectare for explaining the behavior of the variable the model predicts.
Fitting Data with Curve Fitting Toolbox Functions
The Curve Fitting Toolbox software extends core MATLAB functionality by
enabling the following data-fitting capabilities:
Linear and nonlinear parametric fitting, including standard linear least
squares, nonlinear least squares, weighted least squares, constrained least
squares, and robust fitting procedures
Nonparametric fitting
Statistics for determining the goodness of fit
Extrapolation, differentiation, and integration
GUI that facilitates data sectioning and smoothing
3-11
3 Regression Analysis
Saving fit results in various formats, including MATLAB code files,
MAT-files, and workspace variables
For more information, see the Curve Fitting Toolbox documentation.
3-12
Interactive Fitting
Interactive Fitting
In this section...
The Basic Fitting GUI on page 3-13
Preparing for Basic Fitting on page 3-14
Opening the Basic Fitting GUI on page 3-14
Example: Using Basic Fitting GUI on page 3-16
The Basic Fitting GUI
The MATLAB Basic Fitting GUI allows you to interactively:
Model data using a spline interpolant, a shape-preserving interpolant, or a
polynomial up to the tenth degree
Plot one or more fits together with data
Plot the residuals of the fits
Compute model coefficients
Compute the norm of the residuals (a statistic you can use to analyze how
well a model fits your data)
Use the model to interpolate or extrapolate outside of the data
Save coefficients and computed values to the MATLAB workspace for use
outside of the GUI
Generate MATLAB code to recompute fits and reproduce plots with new
data
Note The Basic Fitting GUI is only available for 2-D plots. For more
advanced fitting and regression analysis, see the Curve Fitting Toolbox
documentation and the Statistics Toolbox documentation.
3-13
3 Regression Analysis
Preparing for Basic Fitting
The Basic Fitting GUI sorts your data in ascending order before fitting. If
your data set is large and the values are not sorted in ascending order, it will
take longer for the Basic Fitting GUI to preprocess your data before fitting.
You can speed up the Basic Fitting GUI by first sorting your data. To create
sorted vectors x_sorted and y_sorted from data vectors x and y, use the
MATLAB sort function:
[x_sorted, i] = sort(x);
y_sorted = y(i);
Opening the Basic Fitting GUI
To use the Basic Fitting GUI, you must first plot your data in a figure window,
using any MATLAB plotting command that produces (only) x and y data.
To open the Basic Fitting GUI, select Tools > Basic Fitting from the menus
at the top of the figure window.
When you fully expand it by twice clicking the arrow button in the lower
right corner, the window displays three panels. Use these panels to:
Select a model and plotting options
Examine and export model coefficients and norms of residuals
Examine and export interpolated and extrapolated values.
3-14
Interactive Fitting
To expand or collapse panels one-by-one, click the arrow button in the lower
right corner of the interface.
3-15
3 Regression Analysis
Example: Using Basic Fitting GUI
This example shows how to use the Basic Fitting GUI to fit, visualize, analyze,
save, and generate code for polynomial regressions.
Load and Plot Census Data on page 3-16
Predict the Census Data with a Cubic Polynomial Fit on page 3-17
View and Save the Cubic Fit Parameters on page 3-20
Derive R
2
, the Coefficient of Determination on page 3-21
Interpolate and Extrapolate Population Values on page 3-26
Generate a Code File to Reproduce the Result on page 3-30
Learn How the Basic Fitting Tool Computes Fits on page 3-32
Load and Plot Census Data
The file, census.mat, contains U.S. population data for the years 1790
through 1990 at 10 year intervals.
To load and plot the data, type the following commands at the MATLAB
prompt:
load census
plot(cdate,pop,'ro')
The load command adds the following variables to the MATLAB workspace:
cdate A column vector containing the years from 1790 to 1990 in
increments of 10. It is the predictor variable.
pop A column vector with U.S. population for each year in cdate. It is
the response variable.
The data vectors are sorted in ascending order, by year. The plot shows the
population as a function of year.
Now you are ready to fit an equation the data to model population growth
over time.
3-16
Interactive Fitting
Predict the Census Data with a Cubic Polynomial Fit
1 Open the Basic Fitting dialog box by selecting Tools > Basic Fitting in
the Figure window.
2 In the Plot fits area of the Basic Fitting dialog box, select the cubic check
box to fit a cubic polynomial to the data.
MATLAB uses your selection to fit the data, and adds the cubic regression
line to the graph as follows.
In computing the fit, MATLAB encounters problems and issues the
following warning:
Polynomial is badly conditioned.
3-17
3 Regression Analysis
Add points with distinct X values,
select a polynomial with a lower degree,
or select "Center and scale X data."
This warning indicates that the computed coefficients for the model are
sensitive to random errors in the response (the measured population). It
also suggests some things you can do to get a better fit.
3 Continue to use a cubic fit. As you cannot add new observations to the
census data, improve the fit by transforming the values you have to z-scores
before recomputing a fit. Select the Center and scale X data check box in
the GUI to make the Basic Fitting tool perform the transformation.
To learn how centering and scaling data works, see Learn How the Basic
Fitting Tool Computes Fits on page 3-32.
4 Now view the equations and display residuals. In addition to selecting the
Center and scale X data and cubic check boxes, select the following
options:
Show equations
Plot residuals
Show norm of residuals
Selecting Plot residuals creates a subplot of them as a bar graph. The
following figure displays the results of the Basic Fitting GUI options you
selected.
3-18
Interactive Fitting
The cubic fit is a poor predictor before the year 1790, where it indicates a
decreasing population. The model seems to approximate the data reasonably
well after 1790. However, a pattern in the residuals shows that the model does
not meet the assumption of normal error, which is a basis for the least-squares
fitting. The data 1 line identified in the legend are the observed x (cdate) and
y (pop) data values. The cubic regression line presents the fit after centering
and scaling data values. Notice that the figure shows the original data units,
even though the tool computes the fit using transformed z-scores.
For comparison, try fitting another polynomial equation to the census data
by selecting it in the Plot fits area.
3-19
3 Regression Analysis
Tip You can change the default plot settings and rename data series with
the Property Editor.
View and Save the Cubic Fit Parameters
In the Basic Fitting dialog box, click the arrow button to display the
estimated coefficients and the norm of the residuals in the Numerical
results panel.
To view a specific fit, select it from the Fit list. This displays the coefficients
in the Basic Fitting dialog box, but does not plot the fit in the figure window.
3-20
Interactive Fitting
Note If you also want to display a fit on the plot, you must select the
corresponding Plot fits check box.
Save the fit data to the MATLAB workspace by clicking the Save to
workspace button on the Numerical results panel. The Save Fit to
Workspace dialog box opens.
With all check boxes selected, click OK to save the fit parameters as a
MATLAB structure:
fit
fit =
type: 'polynomial degree 3'
coeff: [0.9210 25.1834 73.8598 61.7444]
Now, you can use the fit results in MATLAB programming, outside of the
Basic Fitting GUI.
Derive R
2
, the Coefficient of Determination
You can get an indication of how well a polynomial regression predicts your
observed data by computing the coefficient of determination, or R-square
(written as R
2
). The R
2
statistic, which ranges from 0 to 1, measures how
useful the independent variable is in predicting values of the dependent
variable:
An R
2
value near 0 indicates that the fit is not much better than the model
y = constant.
An R
2
value near 1 indicates that the independent variable explains most
of the variability in the dependent variable.
To compute R
2
, first compute a fit, and then obtain residuals from it. A
residual is the signed difference between an observed dependent value and
the value your fit predicts for it.
residuals = y
observed
- y
fitted
3-21
3 Regression Analysis
The Basic Fitting tool can generate residuals for any fit it calculates. To
view a graph of residuals, select the Plot residuals check box. You can view
residuals as a bar, line or scatter plot.
After you have residual values, you can save them to the workspace, where
you can compute R
2
. Complete the preceding part of this example to fit a cubic
polynomial to the census data, and then perform these steps:
Compute Residual Data and R
2
for a Cubic Fit.
1 Click the arrow button at the lower right to open the Numerical
results tab if it is not already visible.
2 From the Fit drop-down menu, select cubic if it does not already show.
3 Save the fit coefficients, norm of residuals, and residuals by clicking Save
to Workspace.
The Save Fit to Workspace dialog box opens with three check boxes and
three text fields.
4 Select all three check boxes to save the fit coefficients, norm of residuals,
and residual values.
5 Identify the saved variables as belonging to a cubic fit. Change the variable
names by adding a 3 to each default name (for example, fit3, normresid3,
and resids3). The dialog box should look like this figure.
3-22
Interactive Fitting
6 Click OK. Basic Fitting saves residuals as a column vector of numbers, fit
coefficients as a struct, and the norm of residuals as a scalar.
Notice that the value that Basic Fitting computes for norm of residuals is
12.2380. This number is the square root of the sum of squared residuals
of the cubic fit.
7 Optionally, you can verify the norm-of-residuals value that the Basic
Fitting tool provided. Compute the norm-of-residuals yourself from the
resids3 array that you just saved:
mynormresid3 = sum(resids3.^2)^(1/2)
mynormresid3 =
12.2380
8 Compute the total sum of squares of the dependent variable, pop to compute
R
2
. Total sum of squares is the sum of the squared differences of each value
from the mean of the variable. For example, use this code:
SSpop = (length(pop)-1) * var(pop)
SSpop =
1.2356e+005
var(pop) computes the variance of the population vector. You multiply it
by the number of observations after subtracting 1 to account for degrees
of freedom. Both the total sum of squares and the norm of residuals are
positive scalars.
9 Now, compute R
2
, using the square of normresid3 and SSpop:
rsqcubic = 1 - normresid3^2 / SSpop
rsqcubic =
0.9988
10 Finally, compute R
2
for a linear fit and compare it with the cubic R
2
value
that you just derived. The Basic Fitting GUI also provides you with the
linear fit results. To obtain the linear results, repeat steps 2-6, modifying
your actions as follows:
3-23
3 Regression Analysis
To calculate least-squares linear regression coefficients and statistics,
in the Fit drop-down on the Numerical results pane, select linear
instead of cubic.
In the Save to Workspace dialog, append 1 to each variable name to
identify it as deriving from a linear fit, and click OK. The variables fit1,
normresid1, and resids1 now exist in the workspace.
Use the variable normresid1 (98.778) to compute R
2
for the linear fit, as
you did in step 9 for the cubic fit:
rsqlinear = 1 - normresid1^2 / SSpop
rsqlinear =
0.9210
This result indicates that a linear least-squares fit of the population data
explains 92.1% of its variance. As the cubic fit of this data explains 99.9% of
that variance, the latter seems to be a better predictor. However, because
a cubic fit predicts using three variables (x, x
2
, and x
3
), a basic R
2
value
does not fully reflect how robust the fit is. A more appropriate measure for
evaluating the goodness of multivariate fits is adjusted R
2
. For information
about computing and using adjusted R
2
, see Residuals and Goodness of
Fit on page 3-7.
Caution R
2
measures how well your polynomial equation predicts the
dependent variable, not how appropriate the polynomial model is for your
data. When you analyze inherently unpredictable data, a small value of
R
2
indicates that the independent variable does not predict the dependent
variable precisely. However, it does not necessarily mean that there is
something wrong with the fit.
Compute Residual Data and R
2
for a Linear Fit. In this next example,
use the Basic Fitting GUI to perform a linear fit, save the results to the
workspace, and compute R
2
for the linear fit. You can then compare linear R
2
with the cubic R
2
value that you derive in the example Compute Residual
Data and R
2
for a Cubic Fit on page 3-22.
3-24
Interactive Fitting
1 Click the arrow button at the lower right to open the Numerical
results tab if it is not already visible.
2 Select the linear check box in the Plot fits area.
3 From the Fit drop-down menu, select linear if it does not already show.
The Coefficients and norm of residuals area displays statistics for the
linear fit.
4 Save the fit coefficients, norm of residuals, and residuals by clicking Save
to Workspace.
The Save Fit to Workspace dialog box opens with three check boxes and
three text fields.
5 Select all three check boxes to save the fit coefficients, norm of residuals,
and residual values.
6 Identify the saved variables as belonging to a linear fit. Change the
variable names by adding a 1 to each default name (for example, fit1,
normresid1, and resids1).
7 Click OK. Basic Fitting saves residuals as a column vector of numbers, fit
coefficients as a struct, and the norm of residuals as a scalar.
Notice that the value that Basic Fitting computes for norm of residuals is
98.778. This number is the square root of the sum of squared residuals
of the linear fit.
8 Optionally, you can verify the norm-of-residuals value that the Basic
Fitting tool provided. Compute the norm-of-residuals yourself from the
resids3 array that you just saved:
mynormresid1 = sum(resids1.^2)^(1/2)
mynormresid3 =
98.7783
9 Compute the total sum of squares of the dependent variable, pop to compute
R
2
. Total sum of squares is the sum of the squared differences of each value
from the mean of the variable. For example, use this code:
3-25
3 Regression Analysis
SSpop = (length(pop)-1) * var(pop)
SSpop =
1.2356e+005
var(pop) computes the variance of the population vector. You multiply it
by the number of observations after subtracting 1 to account for degrees
of freedom. Both the total sum of squares and the norm of residuals are
positive scalars.
10 Now, compute R
2
, using the square of normresid1 and SSpop:
rsqlinear = 1 - normresid1^2 / SSpop
rsqcubic =
0.9210
This result indicates that a linear least-squares fit of the population data
explains 92.1% of its variance. As the cubic fit of this data explains 99.9%
of that variance, the latter seems to be a better predictor. However, a cubic
fit has four coefficients (x, x
2
, x
3
, and a constant), while a linear fit has two
coefficients (x and a constant). A simple R
2
statistic does not account for the
different degrees of freedom. A more appropriate measure for evaluating
polynomial fits is adjusted R
2
. For information about computing and using
adjusted R
2
, see Residuals and Goodness of Fit on page 3-7.
Caution R
2
measures how well your polynomial equation predicts the
dependent variable, not how appropriate the polynomial model is for your
data. When you analyze inherently unpredictable data, a small value of
R
2
indicates that the independent variable does not predict the dependent
variable precisely. However, it does not necessarily mean that there is
something wrong with the fit.
Interpolate and Extrapolate Population Values
Suppose you want to use the cubic model to interpolate the U.S. population
in 1965 (a date not provided in the original data).
3-26
Interactive Fitting
1 In the Basic Fitting dialog box, click the button to specify a vector of x
values at which to evaluate the current fit.
2 In the Enter value(s)... field, type the following value:
1965
Note Use unscaled and uncentered x values. You do not need to center
and scale first, even though you selected to scale x values to obtain the
coefficients in Predict the Census Data with a Cubic Polynomial Fit on
page 3-17. The Basic Fitting tool makes the necessary adjustments behind
the scenes.
3-27
3 Regression Analysis
3 Click Evaluate.
The x values and the corresponding values for f(x) computed from the fit
and displayed in a table, as shown below:
3-28
Interactive Fitting
4 Select the Plot evaluated results check box to display the interpolated
value as a magenta diamond marker:
3-29
3 Regression Analysis
5 Save the interpolated population in 1965 to the MATLAB workspace by
clicking Save to workspace.
This opens the following dialog box, where you specify the variable names:
6 Click OK, but keep the Figure window open if you intend to follow the
steps in the next section, Generate a Code File to Reproduce the Result
on page 3-30.
Generate a Code File to Reproduce the Result
After completing a Basic Fitting session, you can generate MATLAB code that
recomputes fits and reproduces plots with new data.
1 In the Figure window, select File > Generate Code.
This creates a function and displays it in the MATLAB Editor. The code
shows you how to programmatically reproduce what you did interactively
with the Basic Fitting dialog box.
2 Change the name of the function on the first line from createfigure to
something more specific, like censusplot. Save the code file to your current
folder with the file name censusplot.m The function begins with:
function censusplot(X1, Y1, X2, Y2, valuesToEvaluate1)
3 Generate some new, randomly perturbed census data:
randpop = pop + 10*randn(size(pop));
4 Reproduce the plot with the new data and recompute the fit:
3-30
Interactive Fitting
censusplot(cdate,randpop,cdate,randpop,1965)
You need five input arguments: two x,y values (data 1 and data 2) plotted in
the original graph, plus an x-value for a marker. For this invocation, set the
variables x2, y2 to be the same as x1, y1 when you call censusplot.m.
The following figure displays the plot that the generated code produces. The
new plot matches the appearance of the figure from which you generated code
except for the y data values, the equation for the cubic fit, and the residual
values in the bar graph, as expected.
3-31
3 Regression Analysis
Learn How the Basic Fitting Tool Computes Fits
The Basic Fitting tool calls the polyfit function to compute polynomial fits.
It calls the polyval function to evaluate the fits. polyfit analyzes its inputs
to determine if the data is well conditioned for the requested degree of fit.
When it finds badly conditioned data, polyfit computes a regression as
well as it can, but it also returns a warning that the fit could be improved.
The Basic Fitting example section Predict the Census Data with a Cubic
Polynomial Fit on page 3-17 displays this warning.
One way to improve model reliability is to add data points. However, adding
observations to a data set is not always feasible. An alternative strategy is
to transform the predictor variable to normalize its center and scale. (In the
example, the predictor is the vector of census dates.)
The polyfit function normalizes by computing z-scores:
z
x
=
o
where x is the predictor data, is the mean of x, and is the standard
deviation of x. The z-scores give the data a mean of 0 and a standard deviation
of 1. In the Basic Fitting GUI, you transform the predictor data to z-scores by
selecting the Center and scale x data check box.
After centering and scaling, model coefficients are computed for the y data as
a function of z. These are different (and more robust) than the coefficients
computed for y as a function of x. The form of the model and the norm of the
residuals do not change. The Basic Fitting GUI automatically rescales the
z-scores so that the fit plots on the same scale as the original x data.
To understand the way in which the centered and scaled data is used as an
intermediary to create the final plot, run the following code in the Command
Window:
close
load census
x = cdate;
y = pop;
3-32
Interactive Fitting
z = (x-mean(x))/std(x); % Compute z-scores of x data
plot(x,y,'ro') % Plot data as red markers
hold on % Prepare axes to accept new graph on top
zfit = linspace(z(1),z(end),100);
pz = polyfit(z,y,3); % Compute conditioned fit
yfit = polyval(pz,zfit);
xfit = linspace(x(1),x(end),100);
plot(xfit,yfit,'b-') % Plot conditioned fit vs. x data
The centered and scaled cubic polynomial plots as a blue line, as shown here:
3-33
3 Regression Analysis
In the code, computation of z illustrates how to normalize data. The polyfit
function performs the transformation itself if you provide three return
arguments when calling it:
[p,S,mu] = polyfit(x,y,n)
The returned regression parameters, p, now are based on normalized x. The
returned vector, mu, contains the mean and standard deviation of x. For more
information, see the polyfit reference page.
3-34
Programmatic Fitting
Programmatic Fitting
In this section...
MATLAB Functions for Polynomial Models on page 3-35
Linear Model with Nonpolynomial Terms on page 3-41
Multiple Regression on page 3-42
Example: Programmatic Fitting on page 3-43
MATLAB Functions for Polynomial Models
Two MATLAB functions can model your data with a polynomial.
Polynomial Fit Functions
Function Description
polyfit polyfit(x,y,n) finds the coefficients of a polynomial
p(x) of degree n that fits the y data by minimizing the
sum of the squares of the deviations of the data from
the model (least-squares fit).
polyval polyval(p,x) returns the value of a polynomial of
degree n that was determined by polyfit, evaluated
at x.
For example, suppose you measure a quantity y at several values of time t:
t = [0 0.3 0.8 1.1 1.6 2.3];
y = [0.6 0.67 1.01 1.35 1.47 1.25];
plot(t,y,'o')
3-35
3 Regression Analysis
Plot of y Versus t
You can try modeling this data using a second-degree polynomial function:
y a t a t a = + +
2
2
1 0
The unknown coefficients a
0
, a
1
, and a
2
are computed by minimizing the sum
of the squares of the deviations of the data from the model (least-squares fit).
To find the polynomial coefficients, type the following at the MATLAB prompt:
p=polyfit(t,y,2)
3-36
Programmatic Fitting
MATLAB calculates the polynomial coefficients in descending powers:
p =
-0.2942 1.0231 0.4981
The second-degree polynomial model of the data is given by the following
equation:
y t t = + + 0 2942 1 0231 0 4981
2
. . .
To plot the model with the data, evaluate the polynomial at uniformly spaced
times t2 and overlay the original data on a plot:
t2 = 0:0.1:2.8; % Define a uniformly spaced time vector
y2=polyval(p,t2); % Evaluate the polynomial at t2
figure
plot(t,y,'o',t2,y2) % Plot the fit on top of the data
% in a new Figure window
3-37
3 Regression Analysis
Plot of Data (Points) and Model (Line)
Use the following syntax to calculate the residuals:
y2=polyval(p,t); % Evaluate model at the data time vector
res=y-y2; % Calculate the residuals by subtracting
figure, plot(t,res,'+') % Plot the residuals
3-38
Programmatic Fitting
Plot of the Residuals
Notice that the second-degree fit roughly follows the basic shape of the data,
but does not capture the smooth curve on which the data seems to lie. There
appears to be a pattern in the residuals, which indicates that a different
model might be necessary. A fifth-degree polynomial (shown next) does a
better job of following the fluctuations in the data.
Repeat the exercise, this time using a fifth-degree polynomial from polyfit:
p5= polyfit(t,y,5)
p5 =
0.7303 -3.5892 5.4281 -2.5175 0.5910 0.6000
y3 = polyval(p5,t2); % Evaluate the polynomial at t2
3-39
3 Regression Analysis
figure
plot(t,y,'o',t2,y3) % Plot the fit on top of the data
% in a new Figure window
Fifth-Degree Polynomial Fit
Note If you are trying to model a physical situation, it is always important
to consider whether a model of a specific order is meaningful in your situation.
3-40
Programmatic Fitting
Linear Model with Nonpolynomial Terms
When a polynomial function does not produce a satisfactory model of your
data, you can try using a linear model with nonpolynomial terms. For
example, consider the following function that is linear in the parameters a
0
,
a
1
, and a
2
, but nonlinear in the t data:
y a a e a te
t t
= + +
0 1 2
You can compute the unknown coefficients a
0
, a
1
, and a
2
by constructing and
solving a set of simultaneous equations and solving for the parameters. The
following syntax accomplishes this by forming a design matrix, where each
column represents a variable used to predict the response (a term in the
model) and each row corresponds to one observation of those variables:
% Enter t and y as columnwise vectors
t = [0 0.3 0.8 1.1 1.6 2.3]';
y = [0.6 0.67 1.01 1.35 1.47 1.25]';
% Form the design matrix
X = [ones(size(t)) exp(-t) t.*exp(-t)];
% Calculate model coefficients
a = X\y
a =
1.3983
- 0.8860
0.3085
Therefore, the model of the data is given by
y e te
t t
= +
1 3983 0 8860 0 3085 . . .
Now evaluate the model at regularly spaced points and plot the model with
the original data, as follows:
T = (0:0.1:2.5)';
Y = [ones(size(T)) exp(-T) T.*exp(-T)]*a;
plot(T,Y,'-',t,y,'o'), grid on
3-41
3 Regression Analysis
Multiple Regression
When y is a function of more than one predictor variable, the matrix equations
that express the relationships among the variables must be expanded to
accommodate the additional data. This is called multiple regression.
Suppose you measure a quantity y for several values of x
1
and x
2
. Enter these
variables in the MATLAB Command Window, as follows:
x1 = [.2 .5 .6 .8 1.0 1.1]';
x2 = [.1 .3 .4 .9 1.1 1.4]';
y = [.17 .26 .28 .23 .27 .24]';
A model of this data is of the form
3-42
Programmatic Fitting
y a a x a x = + +
0 1 1 2 2
Multiple regression solves for unknown coefficients a
0
, a
1
, and a
2
by
minimizing the sum of the squares of the deviations of the data from the
model (least-squares fit).
Construct and solve the set of simultaneous equations by forming a design
matrix, X, and solving for the parameters by using the backslash operator:
X = [ones(size(x1)) x1 x2];
a = X\y
a =
0.1018
0.4844
-0.2847
The least-squares fit model of the data is
y x x = + 0 1018 0 4844 0 2847
1 2
. . .
To validate the model, find the maximum of the absolute value of the
deviation of the data from the model:
Y = X*a;
MaxErr = max(abs(Y - y))
MaxErr =
0.0038
This value is much smaller than any of the data values, indicating that this
model accurately follows the data.
Example: Programmatic Fitting
In this example, you use MATLAB functions to accomplish the following:
Calculating Correlation Coefficients on page 3-45
Fitting a Polynomial to the Data on page 3-46
3-43
3 Regression Analysis
Plot and Calculate Confidence Bounds on page 3-48
This example uses the data in census.mat, which contains U.S. population
data for the years 1790 to 1990.
To load and plot the data, type the following commands at the MATLAB
prompt:
load census
plot(cdate,pop,'ro')
This adds the following two variables to the MATLAB workspace:
cdate is a column vector containing the years 1790 to 1990 in increments
of 10.
pop is a column vector with the U.S. population numbers corresponding to
each year in cdate.
3-44
Programmatic Fitting
The following plot of the data shows a strong pattern, which indicates a high
correlation between the variables.
U.S. Population from 1790 to 1990
Calculating Correlation Coefficients
In this portion of the example, you determine the statistical correlation
between the variables cdate and pop to justify modeling the data. For more
information about correlation coefficients, see Linear Correlation on page
3-2.
Type the following syntax at the MATLAB prompt:
corrcoef(cdate,pop)
MATLAB calculates the following correlation-coefficient matrix:
3-45
3 Regression Analysis
ans =
1.0000 0.9597
0.9597 1.0000
The diagonal matrix elements represent the perfect correlation of each
variable with itself and are equal to 1. The off-diagonal elements are very
close to 1, indicating that there is a strong statistical correlation between
the variables cdate and pop.
Fitting a Polynomial to the Data
This portion of the example applies the polyfit and polyval MATLAB
functions to model the data:
% Calculate fit parameters
[p,ErrorEst] = polyfit(cdate,pop,2);
% Evaluate the fit
pop_fit = polyval(p,cdate,ErrorEst);
% Plot the data and the fit
plot(cdate,pop_fit,'-',cdate,pop,'+');
% Annotate the plot
legend('Polynomial Model','Data','Location','NorthWest');
xlabel('Census Year');
ylabel('Population (millions)');
3-46
Programmatic Fitting
The following figure shows that the quadratic-polynomial fit provides a good
approximation to the data:
To calculate the residuals for this fit, type the following syntax at the
MATLAB prompt:
res = pop - pop_fit;
figure, plot(cdate,res,'+')
title('Residuals for the Quadratic Polynomial Model')
3-47
3 Regression Analysis
Notice that the plot of the residuals exhibits a pattern, which indicates that a
second-degree polynomial might not be appropriate for modeling this data.
Plot and Calculate Confidence Bounds
Confidence bounds are confidence intervals for a predicted response. The
width of the interval indicates the degree of certainty of the fit.
This example applies polyfit and polyval to the census sample data to
produce confidence bounds for a second-order polynomial model.
3-48
Programmatic Fitting
The following syntax uses an interval of 2 , which corresponds to a 95%
confidence interval for large samples:
% Evaluate the fit and the prediction error estimate (delta)
[pop_fit,delta] = polyval(p,cdate,ErrorEst);
% Plot the data, the fit, and the confidence bounds
plot(cdate,pop,'+',...
cdate,pop_fit,'g-',...
cdate,pop_fit+2*delta,'r:',...
cdate,pop_fit-2*delta,'r:');
% Annotate the plot
xlabel('Census Year');
ylabel('Population (millions)');
title('Quadratic Polynomial Fit with Confidence Bounds')
grid on
The 95% interval indicates that you have a 95% chance that a new observation
will fall within the bounds.
3-49
3 Regression Analysis
3-50
4
Time Series Analysis
Introduction on page 4-2
Time Series Objects on page 4-3
4 Time Series Analysis
Introduction
Time series are data vectors sampled over time, in order, often at regular
intervals. They are distinguished from randomly sampled data, which
form the basis of many other data analyses. Time series represent the
time-evolution of a dynamic population or process. The linear ordering of
time series gives them a distinctive place in data analysis, with a specialized
set of techniques.
Time series analysis is concerned with:
Identifying patterns
Modeling patterns
Forecasting values
Several dedicated MATLAB functions perform time series analysis. This
section introduces objects and interactive tools for time series analysis.
4-2
Time Series Objects
Time Series Objects
In this section...
Introduction on page 4-3
Time Series Data Sample on page 4-3
Example: Time Series Objects and Methods on page 4-6
Time Series Constructor on page 4-29
Time Series Collection Constructor on page 4-30
Introduction
MATLAB time series objects are of two types:
timeseries Stores data and time values, as well as the metadata
information that includes units, events, data quality, and interpolation
method
tscollection Stores a collection of timeseries objects that share a
common time vector, convenient for performing operations on synchronized
time series with different units
This section discusses the following topics:
Using time series constructors to instantiate time series classes
Modifying object properties using set methods or dot notation
Calling time series functions and methods
To get a quick overview of programming with timeseries and tscollection
objects, follow the steps in Example: Time Series Objects and Methods
on page 4-6.
Time Series Data Sample
To properly understand the description of timeseries object properties and
methods in this documentation, it is important to clarify some terms related
4-3
4 Time Series Analysis
to storing data in a timeseries objectthe difference between a data value
and a data sample.
A data value is a single, scalar value recorded at a specific time. A data
sample consists of one or more values associated with a specific time in the
timeseries object. The number of data samples in a time series is the same
as the length of the time vector.
For example, consider data that consists of three sensor signals: two signals
represent the position of an object in meters, and the third represents its
velocity in meters/second.
To enter the data matrix, type the following at the MATLAB prompt:
x = [-0.2 -0.3 13;
-0.1 -0.4 15;
NaN 2.8 17;
0.5 0.3 NaN;
-0.3 -0.1 15]
The NaN value represents a missing data value. MATLAB displays the
following 5-by-3 matrix:
x=
-0.2000 -0.3000 13.0000
-0.1000 -0.4000 15.0000
NaN 2.8000 17.0000
0.5000 0.3000 NaN
-0.3000 -0.1000 15.0000
The first two columns of x contain quantities with the same units and you
can create a multivariate timeseries object to store these two time series.
For more information about creating timeseries objects, see Time Series
Constructor on page 4-29. The following command creates a timeseries
object ts_pos to store the position values:
ts_pos = timeseries(x(:,1:2), 1:5, 'name', 'Position')
4-4
Time Series Objects
MATLAB responds by displaying the following properties of ts_pos:
timeseries
Common Properties:
Name: 'Position'
Time: [5x1 double]
TimeInfo: [1x1 tsdata.timemetadata]
Data: [5x2 double]
DataInfo: [1x1 tsdata.datametadata]
More properties, Methods
The Length of the time vector, which is 5 in this example, equals the number
of data samples in the timeseries object. Find the size of the data sample in
ts_pos by typing the following at the MATLAB prompt:
getdatasamplesize(ts_pos)
ans =
1 2
Similarly, you can create a second timeseries object to store the velocity data:
ts_vel = timeseries(x(:,3), 1:5, 'name', 'Velocity');
Find the size of each data sample in ts_vel by typing the following:
getdatasamplesize(ts_vel)
ans =
1 1
Notice that ts_vel has one data value in each data sample and ts_pos has
two data values in each data sample.
4-5
4 Time Series Analysis
Note In general, when the time series data is an M-by-N-by-P-by-...
multidimensional array with M samples, the size of each data sample is
N-by-P-by-... .
If you want to perform operations on the ts_pos and ts_vel timeseries
objects while keeping them synchronized, group them in a time series
collection. For more information, see Time Series Collection Constructor
Syntax on page 4-30.
Example: Time Series Objects and Methods
Creating Time Series Objects on page 4-6
Viewing Time Series Objects on page 4-8
Modifying Time Series Units and Interpolation Method on page 4-11
Defining Events on page 4-12
Creating Time Series Collection Objects on page 4-16
Resampling a Time Series Collection Object on page 4-18
Adding a Data Sample to a Time Series Collection Object on page 4-22
Removing and Interpolating Missing Data on page 4-23
Removing a Time Series from a Time Series Collection on page 4-24
Displaying Time Vector Values as Date Strings on page 4-25
Plotting Time Series Collection Members on page 4-26
Creating Time Series Objects
This portion of the example illustrates how to create several timeseries
objects from an array. For more information about the timeseries object, see
Time Series Constructor on page 4-29.
The sample data provided with this example consists of a 24-by-3 matrix of
double values, where each column represents hourly vehicle counts at each
of three town intersections.
4-6
Time Series Objects
This adds the variable count to the MATLAB workspace:
%% Import the sample data
load count.dat
To view the count matrix, type
count
MATLAB displays the following 24-by-3 matrix:
11 11 9
7 13 11
14 17 20
11 13 9
43 51 69
38 46 76
61 132 186
75 135 180
38 88 115
28 36 55
12 12 14
18 27 30
18 19 29
17 15 18
19 36 48
32 47 10
42 65 92
57 66 151
44 55 90
114 145 257
35 58 68
11 12 15
13 9 15
10 9 7
4-7
4 Time Series Analysis
Create three timeseries objects to store the data collected at each
intersection:
count1 = timeseries(count(:,1), 1:24,'name', 'intersection1');
count2 = timeseries(count(:,2), 1:24,'name', 'intersection2');
count3 = timeseries(count(:,3), 1:24,'name', 'intersection3');
Note In the above construction, timeseries objects have both a variable
name (e.g., count1) and an internal object name (e.g., intersection1).
The variable name is used with MATLAB functions. The object name is a
property of the object, accessed with object methods. For more information on
timeseries object properties and methods, see Time Series Properties on
page 4-30 and Time Series Methods on page 4-30.
By default, a time series has a time vector having units of seconds and a
start time of 0 sec. The example constructs the count1, count2, and count3
time series objects with start times of 1 sec, end times of 24 sec, and 1-sec
increments. You will change the time units to hours in Modifying Time
Series Units and Interpolation Method on page 4-11.
Note If you want to create a timeseries object that groups the three data
columns in count, use the following syntax:
count_ts = timeseries(count, 1:24,'name','traffic_counts')
This is useful when all time series have the same units and you want to keep
them synchronized during calculations.
Viewing Time Series Objects
After creating a timeseries object, as described in Creating Time Series
Objects on page 4-6, you can view it in the Variables editor.
To view a timeseries object like count1 in the Variables editor, use either of
the following methods:
Type open('count1') at the command prompt.
4-8
Time Series Objects
On the Home tab, in the Variable section, click Open Variable and
select count1.
4-9
4 Time Series Analysis
4-10
Time Series Objects
Modifying Time Series Units and Interpolation Method
After creating a timeseries object, as described in Creating Time Series
Objects on page 4-6, you can modify its units and interpolation method using
dot notation.
To view the current properties of count1, type
get(count1)
MATLAB responds by displaying the current property values of the count1
timeseries object:
Events: []
Name: 'intersection1'
UserData: []
Data: [24x1 double]
DataInfo: [1x1 tsdata.datametadata]
Time: [24x1 double]
TimeInfo: [1x1 tsdata.timemetadata]
Quality: []
QualityInfo: [1x1 tsdata.qualmetadata]
IsTimeFirst: 1
TreatNaNasMissing: 1
Length: 24
View the current DataInfo properties using dot notation:
count1.DataInfo
MATLAB responds with:
tsdata.datametadata
Package: tsdata
Common Properties:
Units: ''
Interpolation: linear (tsdata.interpolation)
More properties, Methods
4-11
4 Time Series Analysis
Change the data units and the default interpolation method for count1, as
follows:
count1.DataInfo.Units = 'cars'; % Specify new data units
% Set the interpolation method to zero-order hold
count1.DataInfo.Interpolation = tsdata.interpolation('zoh');
To verify that the DataInfo properties have been modified, type:
count1.DataInfo
tsdata.datametadata
Package: tsdata
Common Properties:
Units: 'cars'
Interpolation: zoh (tsdata.interpolation)
More properties, Methods
Modify the time units to be 'hours' for the three time series:
count1.TimeInfo.Units = 'hours';
count2.TimeInfo.Units = 'hours';
count3.TimeInfo.Units = 'hours';
Defining Events
This portion of the example illustrates how to define events for a timeseries
object by using the tsdata.event auxiliary object. Events mark the data
at specific times. When you plot the data, event markers are displayed on
the plot. Events also provide a convenient way to synchronize multiple time
series.
Use the following syntax to add two events to the data that mark the times of
the AM commute and PM commute:
% Construct and add the first event to all time series
% The first event occurs at 8 AM
e1 = tsdata.event('AMCommute',8);
4-12
Time Series Objects
e1.Units = 'hours'; % Specify the units for time
count1 = addevent(count1,e1); % Add the event to count1
count2 = addevent(count2,e1); % Add the event to count2
count3 = addevent(count3,e1); % Add the event to count3
%% Construct and add the second event to all time series
% The second event occurs at 6 PM
e2 = tsdata.event('PMCommute',18);
e2.Units = 'hours'; % Specify the units for time
count1 = addevent(count1,e2); % Add the event to count1
count2 = addevent(count2,e2); % Add the event to count2
count3 = addevent(count3,e2); % Add the event to count3
When you plot any of the time series, the plot method defined for time series
objects displays events as markers. By default markers are red filled circles.
figure
plot(count1)
4-13
4 Time Series Analysis
The plot reflects that count1 uses zero-order-hold interpolation.
If you plot time series count2, it replaces the count1 display. You see its
events and that it uses linear interpolation:
plot(count2)
4-14
Time Series Objects
You can overlay time series plots by setting hold on. When you hold the plot
and add new data to it, the title, data units and time units do not display.
The plot method cannot determine if the units are the same, so it does not
attempt to display x and y axis labels.
hold on
plot(count3)
4-15
4 Time Series Analysis
Creating Time Series Collection Objects
This portion of the example illustrates how to create a tscollection object.
Each individual time series in a collection is called a member. For more
information about the tscollection object, see Time Series Collection
Constructor on page 4-30.
4-16
Time Series Objects
Note Typically, you use the tscollection object to group synchronized time
series that have different units. In this simple example, all time series have
the same units and the tscollection object does not provide an advantage
over grouping the three time series in a single timeseries object. For an
example of how to group several time series in one timeseries object, see
Creating Time Series Objects on page 4-6.
Use the following syntax to create a tscollection object named count_coll
and use the constructor syntax to immediately add two of the three time series
currently in the MATLAB workspace (you will add the third time series later):
tsc = tscollection({count1 count2},'name', 'count_coll')
MATLAB responds with
Time Series Collection Object: count_coll
Time vector characteristics
Start time 1 hours
End time 24 hours
Member Time Series Objects:
intersection1
intersection2
Note The time vectors of the timeseries objects you are adding to the
tscollection must match.
Notice that the Name property of the timeseries objects is used to name the
collection members as intersection1 and intersection2.
4-17
4 Time Series Analysis
Add the third timeseries object in the workspace to the tscollection by
using the following syntax:
tsc = addts(tsc, count3)
All three members in the collection are listed:
Time Series Collection Object: count_coll
Time vector characteristics
Start time 1 hours
End time 24 hours
Member Time Series Objects:
intersection1
intersection2
intersection3
Resampling a Time Series Collection Object
This portion of the example illustrates how to resample each member in a
tscollection using a new time vector. The resampling operation is used to
either select existing data at specific time values, or to interpolate data at
finer intervals. If the new time vector contains time values that did not exist
in the previous time vector, the new data values are calculated using the
default interpolation method you associated with the time series.
4-18
Time Series Objects
To resample the time series to include data values every 2 hours instead of
every hour and save it as a new tscollection object, enter the following
syntax:
tsc1 = resample(tsc,1:2:24)
The result is:
Time Series Collection Object: count_coll
Time vector characteristics
Start time 1 hours
End time 23 hours
Member Time Series Objects:
intersection1
intersection2
intersection3
In some cases you might need a finer sampling of information than you
currently have and it is reasonable to obtain it by interpolating data values.
For example, the following syntax interpolates values at each half-hour mark:
tsc1 = resample(tsc,1:0.5:24)
The result is:
Time Series Collection Object: count_coll
Time vector characteristics
Start time 1 hours
End time 24 hours
Member Time Series Objects:
intersection1
intersection2
intersection3
4-19
4 Time Series Analysis
To add values at each half-hour mark, the default interpolation method of a
time series is used. For example, the new data points in intersection1
are calculated by using the zero-order hold interpolation method, which
holds the value of the previous sample constant. You set the interpolation
method for intersection1 as described in Modifying Time Series Units and
Interpolation Method on page 4-11.
The new data points in intersection2 and intersection3 are calculated
using linear interpolation, which is the default method. Plot the members of
tsc1 with markers to see the results of interpolating:
hold off % Allow axes to clear before plotting
plot(tsc1.intersection1,'-xb','Displayname','Intersection 1')
4-20
Time Series Objects
You can see that data points have been interpolated at half-hour intervals,
and that Intersection 1 uses zero-order-hold interpolation, while the other
two members use linear interpolation.
Maintain the graph in the figure while you add the other two members to the
plot. Because the plot method suppresses the axis labels while hold is on,
also add a legend to describe the three series:
hold on
plot(tsc1.intersection2,'-.xm','Displayname','Intersection 2')
plot(tsc1.intersection3,':xr','Displayname','Intersection 3')
legend('show','Location','NorthWest')
4-21
4 Time Series Analysis
Adding a Data Sample to a Time Series Collection Object
This portion of the example illustrates how to add a data sample to a
tscollection.
You can use the following syntax to add a data sample to the intersection1
collection member at 3.25 hours (i.e., 15 minutes after the hour):
tsc1 = addsampletocollection(tsc1,'time',3.25,...
'intersection1',5)
There are three members in the tsc1 collection, and adding a data sample
to one member adds a data sample to the other two members at 3.25 hours.
However, because you did not specify the data values for intersection2
and intersection3 in the new sample, the missing values are represented
by NaNs for these members. To learn how to remove or interpolate missing
data values, see Removing Missing Data on page 4-23 and Interpolating
Missing Data on page 4-23.
tsc1 Data from 2.0 to 3.5 Hours
Hours Intersection 1 Intersection 2 Intersection 3
2.0 7 13 11
2.5 7 15 15.5
3.0 14 17 20
3.25 5 NaN NaN
3.5 14 15 14.5
To view all intersection1 data (including the new sample at 3.25 hours),
type
tsc1.intersection1
Similarly, to view all intersection2 data (including the new sample at 3.25
hours containing a NaN value), type
tsc1.intersection2
4-22
Time Series Objects
Removing and Interpolating Missing Data
Time series objects use NaNs to represent missing data. This portion of the
example illustrates how to either remove missing data or interpolate values
for it by using the interpolation method you specified for that time series. In
Adding a Data Sample to a Time Series Collection Object on page 4-22, you
added a new data sample to the tsc1 collection at 3.25 hours.
As the tsc1 collection has three members, adding a data sample to one
member added a data sample to the other two members at 3.25 hours.
However, because you did not specify the data values for the intersection2
and intersection3 members at 3.25 hours, they currently contain missing
values, represented by NaNs.
Removing Missing Data. Use the following syntax to find and remove the
data samples containing NaN values in the tsc1 collection:
tsc1 = delsamplefromcollection(tsc1,'index',...
find(isnan(tsc1.intersection2.Data)));
This command searches one tscollection member at a timein this case,
intersection2. When a missing value is located in intersection2, the data
at that time is removed from all members of the tscollection.
Note Use dot-notation syntax to access the Data property of the
intersection2 member in the tsc1 collection:
tsc1.intersection2.Data
For a complete list of timeseries properties, see Time Series Properties
on page 4-30.
Interpolating Missing Data. For the sake of this example, you must
reintroduce NaN values in intersection2 and intersection3 (which you
remove):
tsc1 = addsampletocollection(tsc1,'time',3.25,...
'intersection1',5);
4-23
4 Time Series Analysis
To interpolate the missing values in tsc1 using the current time vector
(tsc1.Time), type the following syntax:
tsc1 = resample(tsc1,tsc1.Time);
This replaces the NaN values in intersection2 and intersection3 by using
linear interpolationthe default interpolation method for these time series.
Note Dot notation tsc1.Time is used to access the Time property of the tsc1
collection. For a complete list of tscollection properties, see Time Series
Collection Properties on page 4-31.
To view intersection2 data after interpolation, for example, type
tsc1.intersection2
New tsc1 Data from 2.0 to 3.5 Hours
Hours Intersection 1 Intersection 2 Intersection 3
2.0 7 13 11
2.5 7 15 15.5
3.0 14 17 20
3.25 5 16 17.3
3.5 14 15 14.5
Removing a Time Series from a Time Series Collection
To remove the intersection3 time series from the tscollection object
tsc1, type:
tsc1 = removets(tsc1,'intersection3')
4-24
Time Series Objects
Two time series as members in the collection are now listed:
Time Series Collection Object: count_coll
Time vector characteristics
Start time 1 seconds
End time 24 seconds
Member Time Series Objects:
intersection1
intersection2
Displaying Time Vector Values as Date Strings
This portion of the example illustrates how to control the format in which
numerical time vector display, using MATLAB date strings. For a complete
list of the MATLAB date-string formats supported for timeseries and
tscollection objects, see the definition of time vector definition in the
timeseries reference page.
To use date strings, you must set the StartDate field of the TimeInfo
property. All values in the time vector are converted to date strings using
StartDate as a reference date.
For example, suppose the reference date occurs on December 25, 2009:
tsc1.TimeInfo.Units = 'hours';
tsc1.TimeInfo.StartDate = 'DEC-25-2009 00:00:00';
Similarly to what you did with the count1, count2, and count3 time series
objects, set the data units to of the tsc1 members to the string 'car count':
tsc1.intersection1.DataInfo.Units = 'car count';
tsc1.intersection2.DataInfo.Units = 'car count';
4-25
4 Time Series Analysis
Plotting Time Series Collection Members
To plot data in a time series collection, you plot its members one at a time.
First graph tsc1 member intersection1:
hold off
plot(tsc1.intersection1);
When you plot a member of a time series collection, its time units display
on the x-axis and its data units display on the y-axis. and the plot title is
displayed as 'Time Series Plot:<member name>'. If you use the same
figure to plot a different member of the collection, no annotations display. The
time series plot method does not attempt to update labels and titles when
hold is on because the descriptors for the series can be different. To describe
4-26
Time Series Objects
multiple series, add a legend. Set the DisplayName property of the line series
to label each member, as follows:
plot(tsc1.intersection1,'-xb','Displayname','Intersection 1')
% Prevent overwriting plot, but remove axis labels and title:
hold on
plot(tsc1.intersection2,'-.xm','Displayname','Intersection 2')
legend('show','Location','NorthWest')
The plot now includes the two time series in the collection: intersection1
and intesection2. Plotting the second graph erased the labels on the first
graph.
4-27
4 Time Series Analysis
Finally, change the date strings on the x-axis to hours and plot the two time
series collection members again with a legend.
% Specify time units to be 'hours' for the collection:
tsc1.TimeInfo.Units = 'hours';
% Specify the format for displaying time
tsc1.TimeInfo.Format='HH:MM';
% Recreate the last plot with new time units:
hold off
plot(tsc1.intersection1,'-xb','Displayname','Intersection 1')
% Prevent overwriting plot, but remove axis labels and title:
hold on
plot(tsc1.intersection2,'-.xm','Displayname','Intersection 2')
legend('show','Location','NorthWest')
Restore the labels with the xlabel and ylabel commands and overlay a data
grid:
xlabel('Time (hours)')
ylabel('car count')
grid on
The final plot looks like this.
4-28
Time Series Objects
For more information on plotting options for time series, see timeseries.
Time Series Constructor
Before implementing the various MATLAB functions and methods specifically
designed to handle time series data, you must create a timeseries object to
store the data. See timeseries for the timeseries object constructor syntax.
For an example of using the constructor, see Creating Time Series Objects
on page 4-6.
4-29
4 Time Series Analysis
Time Series Properties
See timeseries for a description of all the timeseries object properties.
You can specify the Data, IsTimeFirst, Name, Quality, and Time properties
as input arguments in the constructor. To assign other properties, use the
set function or dot notation.
Note To get property information from the command line, type help
timeseries/tsprops at the MATLAB prompt.
For an example of editing timeseries object properties, see Modifying Time
Series Units and Interpolation Method on page 4-11.
Time Series Methods
For a description of all the time series methods, see timeseries.
Time Series Collection Constructor
Introduction on page 4-30
Time Series Collection Constructor Syntax on page 4-30
Time Series Collection Properties on page 4-31
Time Series Collection Methods on page 4-33
Introduction
The MATLAB object, called tscollection, is a MATLAB variable that groups
several time series with a common time vector. The timeseries objects that
you include in the tscollection object are called members of this collection,
and possess several methods for convenient analysis and manipulation of
timeseries.
Time Series Collection Constructor Syntax
Before you implement the MATLAB methods specifically designed to operate
on a collection of timeseries objects, you must create a tscollection object
to store the data.
4-30
Time Series Objects
The following table summarizes the syntax for using the tscollection
constructor. For an example of using this constructor, see Creating Time
Series Collection Objects on page 4-16.
Time Series Collection Syntax Descriptions
Syntax Description
tsc = tscollection(ts) Creates a tscollection object tsc that
includes one or more timeseries objects.
The ts argument can be one of the
following:
Single timeseries object in the
MATLAB workspace
Cell array of timeseries objects in the
MATLAB workspace
The timeseries objects share the same
time vector in the tscollection.
tsc = tscollection(Time) Creates an empty tscollection object
with the time vector Time.
When time values are date strings, you
must specify Time as a cell array of date
strings.
tsc = tscollection(Time,
TimeSeries, 'Parameter',
Value, ...)
Optionally enter the following
parameter-value pairs after the
Time and TimeSeries arguments:
Name (see Time Series Collection
Properties on page 4-31)
Time Series Collection Properties
This table lists the properties of the tscollection object. You can specify the
Name, Time, and TimeInfo properties as input arguments in the tscollection
constructor.
4-31
4 Time Series Analysis
Time Series Collection Property Descriptions
Property Description
Name tscollection object name entered as a string. This
name can differ from the name of the tscollection
variable in the MATLAB workspace.
Time A vector of time values.
When TimeInfo.StartDate is empty, the numerical
Time values are measured relative to 0 in specified
units. When TimeInfo.StartDate is defined, the time
values represent date strings measured relative to
StartDate in specified units.
The length of Time must match either the first or
the last dimension of the Data property of each
tscollection member.
TimeInfo Uses the following fields to store contextual information
about Time:
Units Time units with the following
values: 'weeks', 'days', 'hours', 'minutes',
'seconds', 'milliseconds', 'microseconds', and
'nanoseconds'
Start Start time
End End time (read-only)
Increment Interval between two subsequent time
values. The increment is NaN when times are not
uniformly sampled.
Length Length of the time vector (read-only)
Format String defining the date string display
format. See the MATLAB datestr function
reference page for more information.
StartDate Date string defining the reference date.
See the MATLAB setabstime (tscollection)
function reference page for more information.
4-32
Time Series Objects
Time Series Collection Property Descriptions (Continued)
Property Description
UserData Stores any additional user-defined
information
Time Series Collection Methods
General Time Series Collection Methods on page 4-33
Data and Time Manipulation Methods on page 4-34
General Time Series Collection Methods. Use the following methods to
query and set object properties, and plot the data.
Methods for Querying Properties
Method Description
get (tscollection) Query tscollection object property values.
isempty (tscollection) Evaluate to true for an empty tscollection
object.
length (tscollection) Return the length of the time vector.
plot Plot the time series in a collection.
set (tscollection) Set tscollection property values.
size (tscollection) Return the size of a tscollection object.
4-33
4 Time Series Analysis
Data and Time Manipulation Methods. Use the following methods to add
or delete data samples, and manipulate the tscollection object.
Methods for Manipulating Data and Time
Method Description
addts Add a timeseries object to a tscollection
object.
addsampletocollection Add data samples to a tscollection object.
delsamplefromcollection Delete one or more data samples from a
tscollection object.
getabstime
(tscollection)
Extract a date-string time vector from a
tscollection object into a cell array.
getsampleusingtime
(tscollection)
Extract data samples from an existing
tscollectionobject into a new
tscollection object.
gettimeseriesnames Return a cell array of time series names in a
tscollection object.
horzcat (tscollection) Horizontal concatenation of tscollection
objects. Combines several timeseries
objects with the same time vector into one
time series collection.
removets Remove one or more timeseries objects
from a tscollection object.
resample (tscollection) Select or interpolate data in a tscollection
object using a new time vector.
setabstime
(tscollection)
Set the time values in the time vector of a
tscollection object as date strings.
settimeseriesnames Change the name of the selected timeseries
object in a tscollection object.
vertcat (tscollection) Vertical concatenation of tscollection
objects. Joins several tscollection objects
along the time dimension.
4-34
Index
B
Basic Fitting 3-13
Basic Fitting dialog box
usage example 3-16
C
condition
data 3-17
confidence bounds 3-48
correlation analysis 3-2
correlation coefficients 3-4
covariance 3-3
curve fitting. See data fitting
Curve Fitting Toolbox
for regression analysis 3-11
D
Data
badly conditioned 3-17
center and scale 3-17
data analysis
plotting data 1-3
data brushing
3-D plots 2-8
defined 2-4
multiple plots 2-9
techniques for 2-5
data cursor mode
update function example 2-24
data filtering. See filtering
data fitting 3-1
confidence bounds 3-48
example using functions 3-43
functions 3-35
multiple regression 3-42
nonpolynomial 3-41
polynomial 3-35
residuals 3-7
data linking
broken links 2-22
controls for 2-20
defined 2-12
reasons for using 2-13
data statistics. See statistics
Data Statistics dialog box 1-23
generating a code file 1-30 3-30
saving statistics 1-29
usage example 1-23
datatips
example of customizing 2-24
descriptive statistics 1-20
detrending data 1-16
difference equations 1-11
discrete filter 1-13
E
exploratory data analysis 2-2
exporting data
from MATLAB 1-2
F
filter function 1-11
filtering
detrending data 1-16
difference equations 1-11
discrete filter 1-13
filter function 1-11
moving average 1-12
functions
for data fitting 3-35
for data statistics 1-20
G
goodness of fit 3-7
Index-1
Index
I
importing data
into MATLAB 1-2
interactive data exploration 2-2
interpolating missing data 1-8
isnan function 1-7
L
linear regression 3-1
linked plots
behavior of 2-15
information bar 2-13
working with 2-12
linking versus refreshing graphs 2-18
load function 1-3
M
maximum 1-20
mean 1-20
median 1-20
methods
for timeseries object 4-30
for tscollection object 4-33
minimum 1-20
missing data
in calculations 1-6
interpolating 1-6
removing 1-6
representing by NaNs 1-6
mode 1-20
moving-average filter 1-12
multiple regression 3-42
N
NaNs
in calculations 1-6
removing from data 1-7
nonpolynomial fit 3-41
O
objects for time series analysis 4-3
outliers
removing 1-9
P
plot function 1-4
plotting data
in MATLAB 1-3
polyfit function 3-35
polynomial regression 3-35
polyval function 3-35
properties
of timeseries object 4-30
of tscollection object 4-31
R
range 1-20
refreshing versus linking graphs 2-18
regression 3-1
multiple 3-42
nonpolynomial 3-41
polynomial 3-35
removing
missing data 1-7
NaNs 1-7
outliers 1-9
resampling
tscollection object 4-18
residuals 3-7
S
standard deviation 1-20
statistics
formatting on plots 1-26
Index-2
Index
functions 1-20
MATLAB Data Statistics 1-23
removing NaNs 1-7
removing outliers 1-9
showing on plots 1-24
T
time series 4-2
time series analysis
example using methods 4-6
methods 4-3
timeseries object
creating 4-29
definition of data sample 4-3
methods 4-30
properties 4-30
tools
MATLAB Basic Fitting 3-13
MATLAB Data Statistics 1-23
transfer-function filter 1-13
tscollection object
constructor 4-30
creating 4-30
methods 4-33
properties 4-31
V
variance 1-20
visual data analysis 2-2
Index-3