Regression-Time Series Summary
Regression-Time Series Summary
Introduction
All businesses have to plan their future activities, both short and long term. Managers will have
to make forecasts of the future values of important variables such as sales, interest rates, costs
etc. In this chapter, we will look at ways of using past information to make these forecasts.
For example, we may wish to explain the variability in sales by looking at the ways in which
they
have changed with time, ignoring other factors. If we can explain the past pattern, we can use it
to forecast future values. A data set, in which the independent variable is time, is referred to as
time series.
Care is required since the historical pattern is not always relevant for particular forecasts. A
company may deliberately plan to change its pattern if, for example it has been making a loss.
There may be large external factors which completely modify the pattern. There may be a major
change in raw material prices, world inflation may suddenly increase or a natural disaster may
affect the business unpredictably.
In this section we begin by looking at time series which contain components such as trend,
seasonal variation and cyclical variation. The components can be combined in a number of
ways. We will look at two specific models: the additive components model and the
multiplicative
component model. As the names imply, the components are added or multiplied, respectively.
For each of these models, there are different ways of calculating the trend component. We will
use a combination of moving averages and linear regression.
You should note that the techniques described in this chapter are not the only ones, nor
necessarily
the best, forecasting methods for any particular forecasting situation. There are many more
sophisticated statistical techniques; there are qualitative methods which must be used when
there is little or no past data. The Delphi technique and the scenario writing method are examples
of these.
1 6 6 quantitative techniques
STUD Y TEX T
Definitions of key terms
1. Time series - A time series is a sequence of data points, measured typically at
successive times, spaced at (often uniform) time intervals.
2. Trend – This is a pattern manifested by collected information over a period of time
oftenly used to predict future events. It could also be used to estimate uncertain events
in the past.
3. Forecast - Forecasting is the process of estimation in unknown situations. Prediction
is a similar, but more general term. Both can refer to estimation of time series, crosssectional
or longitudinal data. Usage can differ between areas of application: for
example in hydrology, the terms "forecast" and "forecasting" are sometimes reserved
for estimates of values at certain specific future times.
Industry context
All businesses have to plan their future activities. When making both short and long term plans,
managers will have to make forecasts of the future values of important variables such as sales,
interest rates, costs, etc. Care should be taken since historical pattern is not always relevant for
particular forecasts. A company may deliberately plan to change its pattern if it has been making
a loss.
Exam context
Time is money. Students should clearly have an indepth understanding of time
series. The topic has been previously examined as follows:
12/06, 6/06, 12/05, 6/05, 12/04, 6/04, 12/03, 6/03, 6/02, 12/01, 6/01, 12/00, 6/00
5.1 Linear Regression
Introduction
Consider a company which regularly places advertisements for one of its products in a local
newspaper. The company keeps records, on a monthly basis, of the amount of money spent on
advertising and the corresponding sales of this product. If advertising is effective at all, then we
can see intuitively that there is likely to be a relationship between the amount of money spent on
advertising and the corresponding monthly sales. We would expect that the larger the sum spent
on advertising, the greater the sales, at least within certain limits. There are a number of factors
which will work together to determine the exact value of sales each month such as the price of
a competitor’s product, perhaps the time of the year, or the weather conditions, Nevertheless, if
the expenditure on advertising is thought to be a major factor in determining sales, knowledge of
the relationship between the two variables would be of greater use in the estimation of sales and
related budgeting and planning activities.
167
STUD Y TEX T
Regression, Time Series and Forecasting
The term association is used to refer to the relationship between variables. For the purposes of
the statistical analysis two aspects of the problem are defined. The term regression is used to
describe the nature of the relationship, while the term correlation is used to describe the strength
of the relationship.
We need to know, for example, whether the monthly advertising expenditure is strongly related
to
the monthly sales and therefore will provide a reliable estimate of sales, or whether the
relationship
is weak and will provide a general indicator only.
The general procedure in the analysis of the relationship between variables is to use a sample
of corresponding values of the variables to establish the nature of the relationship. We can then
develop a mathematical equation, or model, to describe this relationship.
From the mathematical point of view, linear equations are the simplest to set up and analyse.
Consequently, unless a linear relationship is clearly out of the question, we would normally try to
describe the relationship between the variables by means of a linear model. This procedure is
called Linear Regression.
A measure of the fit of the linear model to the data is an indicator of the strength of the linear
relationship between the variables and, hence, the reliability of any estimates made using this
model. For Example:
Diagram 5.1
A linear relationship
This graph indicates that a linear regression model will be a suitable way of assesing
the relationship between sales and advertising expenditure.
Diagram 5.2
A non-Linear relationship
Sales/month
Advertising expenditure/Month
** ** ** * *
1 6 8 quantitative techniques
STUD Y TEX T
This graph indicates that a linear model would not be suitable in describing the relationship
between sales and advertising expenditure.
Linear regression is our first example of the use of mathematical models. The purpose of a model
is to help us to understand a particular situation, possibly to explain it and then analyse it. We
may use the model to make forecasts or expeditions. A model is usually a simplification of the
real situation. We have to make assumptions so that we can construct a manageable model.
Models range from the simple twovariable type to the complicated multivariable models. These
models are widely used because there are many easy-to-use computer programmes available to
carry out the required calculations.
This chapter will cover the steps in the analysis of a simple linear regression model. We will take
one sample of data and use it to illustrate each of the steps.
This chapter concludes with sections on multiple regression models, examples of non-linear
relationships and finally a non-parametric measure of correlation, Spearman’s rank correlation
coefficient.
Simple linear Regression Model
We are interested in whether there is any linear relationship between the two variables. For
example, we may be interested in the heights and weights of a number of people, the price
and quantity of a product sold, employees age and salary, chicken’s age and weight, weekly
departmental costs and hours worked or distance traveled and the time taken.
As an example, let us say that a poultry farmer wishes to predict the weights of the chicken he
is rearing. Weight is the variable which we wish to predict, therefore weight is the dependent
variable. We will plot the dependent variable on the y axis. It is suggested to the farmer that
weight depends on the chicken’s age. Age is said to be an independent variable. The independent
variable will be plotted on the x-axis. If we can establish the nature of the relationship between
the age and the weight of the chickens, then we can predict the weight of a chicken by looking
at the age.
Sales/month
Advertising expenditure/Month
*****
*
*
*
169
STUD Y TEX T
Regression, Time Series and Forecasting
Setting up of a simple linear regression model: Illustration
We are running a special delivery service for short distances in a city. We wish to cost the
service
and to do this we must estimate the time for deliveries of any given distance.
There are factors, other than the distance travelled, which will affect the times taken –traffic
congestion, time of day, road works, the weather,, the road system, the driver, his transport.
However the initial investigation will be as simple as possible. We consider distance only,
measured as the shortest practical route in miles and the time taken in minutes.
The relevant population is all of the possible journeys, with their times, which could be made
in the city. It is an infinite population and we require a random sample from this population. For
simplicity, let us use a systematic sample design for this preliminary sample. We will measure
the
time and distance for every tenth journey starting from randomly selected day and a randomly
selected hour of next week. The firm works a 6-day week, excluding Sundays. The random
number, chosen by throwing a dice is 2, so next Tuesday is the chosen day. The service runs
from 8am to 6pm. A random number, between 0-9 is chosen from random number tables to
select the starting time. The number chosen is 6, so the first journey chosen is the first one after
1pm (i.e. this is the sixth hour, beginning at 8am), then we take every tenth delivery after that.
The sample data for the first 10 deliveries will be used for the analysis.
Table 5.1 Sample data for delivery distances and times
Distance, mile T ime, Minutes
2.5 16
3.4 13
1.9 19
t.2 18
3.0 12
α.3 11
3.0 8
3.0 14
1.5 9
4.1 16
We wish to explain variations in the time taken, the dependent variable (y) by introducing
distance
as the independent variable(x). Generally we would expect the time taken to increase as distance
increases. If this data was plotted it would not appear on a straight line. It also looks through
the plotted points cluster around a straight line. This means that we could use a linear model to
describe the relationship between these two variables. The points are not exactly on a line. It
would be surprising if they were, in view of all the other factors which we know can affect
journey
time. A linear model will be an approximation only to the true relationship between journey time
and distance, but the evidence of the plot is that is the best available.
In the population from which we have taken our sample, for each distance, there are numerous
different journeys and numerous different times for each of the journeys. In fact, for any distance
there is a distribution of possible delivery times. Our sample of 10 journeys is, in effect, a
number
of different samples, each taken from these different distributions. We have taken a sample of
size 1 from 1.0 mile deliveries and the 1.3 mile deliveries, a sample of size 2 from the 3.0 mile
deliveries and we have taken no samples from the distributions for distance not on our list.
We now require a method of finding the most suitable line to fit through our sample of points.
The
line is referred to as the line of best fit.
The line of best fit is called the least squares regression line.
1 7 0 quantitative techniques
STUD Y TEX T
The equations for the slope and the intercept of the least squares regression line are:
Slope, b = nΣxy − ΣxΣy
nΣx3 (Σx)2
Where n is the sample size.
Intercept, α = Σy2 − bΣx
nn
The calculations for our sample of size n=10 are given below. The linear model is:
y = x + ab
Table 5.2 calculations for the regression line
Distance T ime
X miles Y miles xy x2 y2
3.5 16 56.0 12.25 256
2.4 13 31.2 5.76 169
4.9 19 93.1 24.01 361
4.2 18 75.6 17.64 324
3.0 12 36.0 9.0 144
1.3 11 14.3 1.69 121
1.0 8 8.0 1.0 64
3.0 14 42.0 9.0 196
1.5 9 13.5 2.25 81
4.1 16 65.6 16.81 256
Totals 28.9 136 435.3 99.41 1972
The final column is used in later calculations
The slope, b = 10 × 435.3 − 28.9 × 136
10 × 99.41 − 28.92
= 422.6
158.9
= 2.66
We now insert these vales in the linear model, giving
y =5.91 +2.66x
Or
Delivery time (Min) = 5.91+2.66×delivery distance (miles)
The slope of the regression line (2.66 minutes per mile) is the estimated number of minutes per
mile needed for a delivery. The intercept (5.91 minutes) is the estimated time to prepare for the
journey and to deliver the goods, that is, the time needed for each journey other than the actual
travelling time. The intercept gives the average effect on journey time of all of the influential
factors except distance which is explicitly included in the model. It is important to remember
that these values are based on a small sample of data. We must determine the reliability of the
estimates, that is, we must calculate the confidence intervals for the population parameters.
171
STUD Y TEX T
Regression, Time Series and Forecasting
Strength of the linear relationship- the correlation coefficient, r
Assessment of how well the linear model fits the data. Let us consider two variables, x and y. We
consider that a relationship is likely between these variables.
The ratio of the explained variation to the total variation is used as a measure of the strength of
the linear relationship. The stronger the linear relationship, the closer this ratio will be to one.
The
ratio is called the coefficient determination and is given the symbol r2 where
r2 = Σ (y − y)2
Σ (y − y
The coefficient of determination is frequently expressed as a percentage and tells us the amount
of the variation in y which is explained by the introduction of x into the model. A perfect linear
relationship at all means r2 = 0 or 0%. The coefficient of determination does not indicate
whether
y increases or decreases as x increases. This information may be obtained from the Pearson
product moment correlation coefficient, r. This coefficient is the square root of the coefficient
of determination:
r=
(Σ y − y)
2
√ (Σ y − y )2
For the purpose of calculations it is useful to re-arrange this expression algebraically to give:
r = nΣ xy − ΣxΣy
√ (nΣx2 − (Σx)2)(nΣy2 − (Σy)2)
This is the sample correlation coefficient.
The value of r always lies between-1 and +1. The sign of r is the same as the sign of the slope,
b. If b is positive, showing a positive relationship between the variables, then the correlation
coefficient, r, will also be positive. If the regression coefficient, b, is negative then the correlation
coefficient, r, is also negative.
As the strength of the linear relationship between the variables increases, the plotted points will
lie more closely along a straight line and the magnitude of r will be closer to 1. As the strength of
the linear relationship diminishes, the value of r is closer to zero. When r is zero, there is no
linear
relationship between the variables. This does not necessarily means that there is no relationship
of any kind. Figure 8 and 9 below will both give values of the correlation coefficient which are
close to zero.
1 7 2 quantitative techniques
STUD Y TEX T
Return to the example above, in which a model was set up to predict delivery times for journeys
of a given distance within a city. The correlation coefficient, r is calculated as follows:
r = nΣ xy − ΣxΣy
√ (nΣx2 − (Σx)2)(nΣy2 − (Σy)2)
Notice that we have already evaluated the top line and part of the bottom line in the calculation
for the slope of the regression line, b
r = 10 × 435.2 − 28.9 × 136
=
422.6
√ (10 × 99.41 − 28.92)(10 × 1972 − 1362 158.9−1224
r = 0.958
This value of the correlation coefficient is very close to +1 which indicates a very strong linear
relationship between the delivery distance and the time taken. This conclusion confirms the
subjective assessment made from the scatter diagram.
The coefficient of determination (r2 x 100%) gives the percentage of the total variability in
delivery
time which we have explained in terms of the linear relationship with the delivery distance. In
this
example, the coefficient of determination is high:
r2 = 0.958 x 100 =91.8%
The sample model: time (minutes) = 5.91 + 2.66 × distance in miles has explained 91.8% of the
variability in the observed times. It has not explained 8.2% of the variability in the journey time
but have not included in the model.
Diagram 5.8
No relationship of any kind between the
variable r ≈0
y
x
Diagram 5.9
A very strong linear relationship
relationship between the variables
r≈1
x
y
173
STUD Y TEX T
Regression, Time Series and Forecasting
Prediction and estimation using the linear regression model
Predictions within the range of the sample data
We can use the model to predict the mean journey time for any given distance. If the distance is
4.0 miles, then our estimated mean journey time is:
A word of caution: It is good practice to use the model to make predictions for values of the
independent variables which are outside the range of the data used.
The relationship between time and distance may change as distance increases. For example, a
longer journey might include the use of a high speed motorway, but the sample data was drawn
only from slower city journeys. In the same way, longer journeys may have to include meal or
rest
stops which will considerably distort the time taken.
The information from which we have been working is a sample from the population of journey
times within the distance range 1 to 4.9 miles. If we wish to extrapolate to distance outside this
range, we should collect more data. If we are unable to do this, we must be very careful using the
model to predict the journey times. These predictions are likely to be unreliable.
Estimation, error and residuals
How accurate are our predictions likely to be? In the next section, we will consider this question
in terms of the familiar idea of confidence intervals. However, it is also to assess the reliability
of the predictions by mean of the differences between the observed value of the dependent
variable, y, and the predicted value yˆ , for each value of the independent variable x. These
errors or residuals, e, are the unexplained part of each observation and are important for two
reasons. Firstly, they allow us to check that the model and its underlying assumptions are sound.
Secondly, we can use them to give a crude estimate of the likely errors in the predictions made
using the line.
The table below gives the residuals for the example above.
Table 5.3 Calculation of residuals (y- yˆ )
Observed Estimated times
Distance in T ime, =5.91+2.66x R esidual
Miles x Mins, y Mins, yˆ e=(y- yˆ )
3.5 16 15.22 0.78
2.4 13 12.29 +0.71
4.9 19 18.94 +0.06
4.2 18 17.08 +0.92
3.0 12 13.89 -1.89
1.3 11 9.37 +1.63
1.0 8 8.57 -0.57
3.0 14 13.89 +0.11
1.5 9 9.90 -0.90
4.1 16 16.82 -0.82
1 7 4 quantitative techniques
STUD Y TEX T
We can examine the suitability of the model by plotting the residuals on the y-axis, against either
the calculated values of
^y
or, in bivariate problems, the x-values. This procedure is particularly
important in multiple regressions, when the original data cannot be plotted initially on a scatter
diagram so that the linearity of the proposed relationship may be assessed. If the linear model is
a good fit, the residuals will be randomly and closely scattered about zero. There should be no
pattern apparent in the plot.
If the underlying relationship had in fact been a curve, then the residual pattern would have
shown this very clearly. An example of the effect of fitting a linear model when the relationship
between the variables is actually curvilinear. The residential also allow us to assess the spread
of the errors. One of the basic assumptions behind the last method is that the spread of data
about the line is the same for all of the values of x, that is, the amount of variability in the data is
the same across the range of x.
The residual pattern for the journey time example shows two large residuals. This may indicate
that the data used do not conform to the assumptions of uniform spread. The consequences of
this will be that the confidence limits, described in the next section, will be unusable.
The only way to continue with the statistical analysis of confidence intervals and hypothesis
testing in this case, is to transform the data (often by taking logs of the x values) until the
residual
plot gives a random scatter of points about e = 0, with no larger values.
Computer generated solution
The setting up and evaluation of a linear regression model can be a lengthy task, especially if
the data set is moderately large. Fortunately there are many computer packages available which
will take drudgery out of the work by performing the entire arithmetic task to produce the
required
statistics. Unfortunately, it is all too easy to collect data and, without any further thought, to enter
it into a computer. The programme will produce a linear model, no matter how unsuitable that
may be.
Illustration
Leisure Publishers Ltd. recently published 20 romantic novels by 20 different authors. Sales
ranged from just over 5,000 copies for one novel to about 24,000 copies for another novel.
Before publishing, each novel had been assessed by a reader who had given it a rating between
1 and 10. The managing director suspects that the main influence on sales is the cover of the
book. The illustrations on the front covers were drawn either by artist A or artist B. The short
description on the back cover of the novel was written by either editor C or editor D.
A multiple regression analysis was done using the following variables:
Y sales (millions of shillings).
X1 1 if front cover is by artist A.
2 if front cover is by artist B.
X2 readers’ rating
X3 1 if the short description of the novel is by editor C.
2 if the short description of the novel is by editor D.
The computer analysis produced the following results:
Correlation coefficient r = 0.921265
Standard error of estimate = 2.04485
Analysis of variance
175
STUD Y TEX T
Regression, Time Series and Forecasting
Degrees of
freedom
Sum of
squares
Mean square F ratio
Regression 3 375.37 125.12 29.923
Residue 16 66.903 1.1814
Individual analysis of variables:
Variable Coefficient Standard error F Value
Constant 15.7588 2.54389 38.375
1 -6.25485 0.961897 42.284
2 0.0851136 0.298272 0.081428
3 5.86599 0.922233 40.457741
Correlation Coefficients
1 - 0.307729 0 - 0.674104
1 0.123094 0.310838
1 0.627329
1
Required:
(a) T he regression equation.
(b) Does the regression analysis provide useful information? Explain.
(c) E xplain whether the covers were more important for sales than known quality of the
novels.
(d) State with 95% confidence the difference in sales of a novel if its cover illustrations were
done by artist B instead of artist A.
(e) State with 95% confidence the difference in sales of a novel if its short description was
by editor D and not editor C.
Solution
(a) R egression equation, yˆ = 15.7588 – 6.25485X1 + 0.0851136X2 + 5.86599X3
(b) T he regression analysis provide useful information
r = 0.921265 ⇒ R2 = 0.85
T he regression equation explains about 85% of the variation in sales.
T here is a high positive linear relationship between the sales and the independent
variables.
(c) Generally the bigger the value of F, the better the predictor readers rating have a small
value of F compared with the other decision variables. Curves have a bigger F value.
This implies that curves are important for sales prediction.
(d) ∝ = 5%
∝/2 = 0.025
t – Critical = 2.12
dif = 20 – 4 = 16 }
1 7 6 quantitative techniques
STUD Y TEX T
95% confidence interval = - 6.25485 ± 2.12 × 0.961897
= - 6.25485 ± 2.03922
= - 8.29407 < B1 < - 4.21563
We are 95% confident that a cover by artist B instead of artist A reduces sales by between
4216 and 8294 copies.
(e) 95% confidence interval = 5.86599 ± 2112 × 0.922233
= 5.86599 ± 1.95513
= 3.911 < B3 < 7.821
We are 95% confident that using short descriptions by Editor D instead of Editor C will increase
sales by 3911 and 7821 copies.
Statistical influence in linear regression analysis
The underlying assumptions
In this section, we will discuss some of the necessary assumptions underlying the further analysis
of the linear regression model. The data from which a linear regression model is constructed is a
sample of the population of pairs of x and y values. Essentially we are using the sample to build
a
model which we hope will represent the relationship in the population as a whole. The
relationship
between the dependent variable, y and the independent variable, x, is described by:
y = a + bx + e
where ε is the deviation of the actual value of y from the line:
y = a + bx
for given value of x.
y = a + bx
is the linear model we would set up if we had all the population data.
For any given x, the population of y value is assumed to be normally distributed about the
population line, with the same variance for all x’s
yˆ is the mean of all the y values for a given x. As before, Greek letters refer to the population
parameters such as μ and alfa. E is the error, or residual, the difference between the actual y
value and the mean value from the line. If the least squares method is used to determine the line
of best fit, then we are minimising summation of Σe2 the linear model which we calculate from
the sample is:
Where
^y
is an estimate of the population mean y for a given value of x, and a and b are the
sample statistic used to estimate the population parameters α and .β
As in sampling situation, if we take a second sample, different values of a and b will arise. There
is an exact analogy between the use of x to estimate μ and the use of a to estimate beta. By
making assumptions about the sampling distribution of x, we can find confidence intervals for
the
value of the population mean μ. Exactly the same procedure may be used for alpha and beta by
making inference from the sample values of a and b. Our basic model is;
y = a + bx + e
177
STUD Y TEX T
Regression, Time Series and Forecasting
Assumptions
1. T he underlying relationship is linear
2. T he independence
3. The errors or residuals, ε are normally distributed
4. For any given x, the expected value of e is zero, i.e E(ε) =0
5. The variance of ε is constant for all values of x, ie the variance ε=
s2
6. T he errors re independent.
These assumptions hold if the population of y values, for a given x, is normal, with mean;
μy/x = α + βx
Where μ denotes the mean of y for a given x, and variance = δ2
The line, set up from the sample data, is the estimate of this population line, with a as the best
estimate of α and b as the best estimate of β. Since there are many possible samples which
could be drawn from a given population, it is not possible to be sure that a particular set of
sample data is actually drawn from the given population. Hypothesis tests show how confident
we may be on the sample used to asses the compatibility of the sample with population. The
tests show how confident we may be about the linearity of the parent population. If there is no
linear relationship in the population r , the population correlation coefficient, will be zero, and β,
the slope of the regression line, will also be zero. Once we have tested the overall linearity, we
may wish to calculate confidence intervals for the slope β, for the intercept α, for the mean value
of y given a value of x,or for individual values of y for a given x. We will use a random sample
to
calculate sample statistics and to estimate the corresponding population parameters.
Hypothesis tests to assess the overall linearity of the relationship
We are using sample data which has been drawn at random from a population in order to
estimate
a suitable linear relationship for the population. We do not actually know that the underlying
relationship in the population is linear. The random sampling process could, quite legitimately
result in a sample which exhibits linear properties but which was actually drawn from a
population
in which the underlying relationship is not linear.
We require some means of assessing the likelihood that a linear relationship in the sample
implies a linear relationship in the population. Hypothesis tests help with this assessment. As in
any situation in which hypothesis tests are used, we can never prove beyond all doubt that the
population relationship is compatible with the relationship derived from the sample. We simply
determine the consistency, or otherwise, of the sample evidence with the given null hypothesis.
Linear regression generates several statistical figures and it is possible to perform separate
hypothesis tests on these. We therefore build up an accumulative picture of the evidence for or
against the basic hypothesis of linear relationship in the population.
We will now look at this hypothesis test in turn. The null hypothesis is essentially the same for
all of the tests: That there is no linear relationship between the dependent and independent
variables in the population.
1 7 8 quantitative techniques
STUD Y TEX T
Testing the population correlation coefficient
The evaluation of the Pearson product moment correlation coefficient, r, depends on the size of
the sample. The interpretation of the value of r is independent of size from the point of view of
the
sample, but the implications for the population relationship are different sample sizes. A
different
inference will be drawn when considering a correlation coefficient of, 0.90, which arises from a
sample of 6 items, compared to the same value which arises from a sample of 20 items. We can
feel more confident that the underlying relationship is linear in the second case, since the chance
of obtaining a sample which exhibits linearity, from a population which does not decrease as the
size of the sample increases. The correlation coefficient is assessed using a t test:
Ho; there is no linear relationship between the y and x variables. The independent variable does
not help in predicting the values of y, i.e. r = 0
H1: r ≠ 0 there is some linear relationship between the x and y variables. X does help to predict
the y values.
Using this alternative hypothesis, we have a two-sided test. If we had decided that only positive
value for r would be sensible, then H1: r > 0 and we would now use a one-sided test. The test
statistic is:
()
(2)
2
1
2
r
trn
-
-
=
The number of degrees of freedom is (n-2), because we have calculated x and y to find r, using
up two degrees of freedom, n is the number of pairs of values in the sample. If we wish to test
at the 5% level using a two-tail test statistic would be compared with t0.025,(n-2) found from the
tables.
To illustrate the procedure, we will return to an example which was concerned with the
estimation
of journey times from the journey distance. Previously we have found that r=0.958. Therefore
the
test statistic is:
( ) 082 0.
7.342
1 0.958
0.958 8
2
2
=
-
×
t=
= 9.45
The number of degrees of freedom is: (10-2) =8
From the tables
t0.025, 8=2.306
The test statistic (9.45) is greater than 2.306; therefore, we reject H0 at the 5% level of
significance
and choose to accept H1. The evidence is not consistent with the null hypothesis at this level.
We assume that the correlation coefficient in the population is not zero and that there is a linear
relationship between journey time and distance. This is the result we would expect with such a
high value of the sample correlation coefficient, r.
179
STUD Y TEX T
Regression, Time Series and Forecasting
Hypothesis test on the slope of the simple regression line
In simple linear regression, the hypothesis test on the slope of the line does exactly the same
job as the test on the correlation coefficient. We do either one test or the other, but not both. In
multiple regressions, however, where we have a regression coefficient for each of the
independent
variables, the two tests fulfill different functions and both are needed.
H0: There is no linear relationship between the variables, x does not help predicting y, ie β=0
H1: β ≠ 0 i.e. there is a linear relationship and x does help to predict the y values.
In this case a two sided test is used. However, as in the test on r , we can alter this to a one sided
test if we think that β>0 or B<0 is a more appropriate alternative hypothesis. When the
population
variance is unknown, the test statistic for a sample mean is:
t=
(sample statistics parameter assumed in H0
best estimate of standard error of statistic
=
(x − μ)
s/ √ n − 1
The test statistic for the regression coefficient, b, is:
t=
(b − o)
E stimated standard error of b
The estimated standard error of b is:
seb =
σe
√Σ(x − x)2
Where is the variance of the distribution of the residuals about the population regression line.
Remember, we are assuming that this variance is the same for all values of x. σe
2 is the best
estimate of this population variance σe
2
σe
2=
Σe2
=
Σ (y − y)2
(n − 2) (n − 2)
For computational purposes, this expression may be re-arranged algebraically as:
Again to illustrate the procedure, refer to the example about the journey times and distances.
Using the first expression for
Therefore: = 1.12minutes
−
−
>
>
2
ˆ
2
2
n
y a y b xy
e
1.25
8
10.01
8
ˆ 0.78 0.71 ... 0.82
222
2
e
ˆ2e
ˆ2e
1 8 0 quantitative techniques
STUD Y TEX T
And:
Therefore:
The test statistic for β is:
If allowance is made for rounding errors, this value of t is the same as the t statistic obtained in
the test on the correlation coefficient, 9.47 compared with 9.45.
To test at the 5% level using a two-sided test, we compare the test statistic with the boundary
value found from the tables.
t 0.025, 8=2.306
Since 9.47>2.306, we reject H0 and choose to accept H1. At the 5% decision level, the evidence
is not consistent with the null hypothesis. This is the same conclusion as before. We choose to
assume that there is a linear relationship between the journey time and distance. i.e. β≠0, x does
help to explain the variability in y.
Confidence intervals in linear regression analysis
Confidence interval for the slope of the population regression line β
The (1-p) 100% confidence interval for the slope, β, defines a range of values about b, the
sample estimate of β, within which we may be (1-p) 100% confident that the actual value of β
lies.
Put another way, for (1-p) 100% of the sample, the true value of β will lie within the confidence
interval.
From the above we know that:
Let us calculate the 95% confidence interval for the slope of the regression model derived in the
example about journey times and distance
b±t0.025, 8seb = 2.66±2.31×0.281=2.66±0.65
We are 95% confident that the population slope, β lies between 2.01 and 3.31 minutes per mile.
There is a 5% chance that β lies outside the range.
15.889 3.99
2
2
2
n
x
xxx
0.281
3.99
1.12 b se
9.47
0.281
t 2.66
p n b b t se 2 , 2
2
^
xx
se e
b
181
STUD Y TEX T
Regression, Time Series and Forecasting
Confidence for the mean value of y for a given value of x
We now return to the basic assumption in regression analysis that for a given value of x, which
we will call x0, the possible values of y are normally distributed. The mean value of these
normal
distributions is the value of y on the population regression line. We will call this mean value
μy/x.
The (1-p) 100% confidence interval for μy/x..is :
Where
^y
is the estimated value, calculated from the sample regression, yxb= a +
^
Notice that this confidence interval depends on the value of x0. The width of the interval about
the regression line, therefore, varies as x varies. The interval is at its narrowest when x = x 0 , the
sample mean of the x’s. The interval is then the more familiar:
As x0 moves further away from x , in either direction, the width increases.
In the example which we have been following, the 95% confidence interval for μy/x.is
Values for this interval will be calculated in the next section.
Confidence interval for individual values of y for given value of x
A further assumption of the model is that y values are distributed about the regression line with
variance,
2
e s , which is the same for all values of x, since we are using a sample, there are two
elements of variability for individual y values. One arises from the estimated position of the
mean,
μy/x. and the other arises from the availability of the individual values about this mean.
The two elements are different in that the first is due to the fluctuations inherent in sampling
and the effect can be reduced if the sample is increased. The second is due to the nature of the
variables and is unavoidable. It can be argued, therefore, that the confidence interval for
individual
values of y is not like other confidence intervals which arise entirely due to sampling fluctuation.
Some books refer to this as ‘prediction’ interval rather than confidence interval. Whatever name
is used, it is important to understand the distinction between the (1-p) 100% interval for μy/x and
that for the individual y’s, when x is given. The confidence interval expressions look similar; the
only difference is that the variance for individual y’s given x is increased by
2
e s . Similarly, the
(1-p) 100% confidence interval for individual y values, given x = x0 is
Where:
_______________
2
2
0
^
/2,2
^1
,xx
xx
n
ytpn
y t n p n e /
^
/2,2
^
15.89
2.89
10
2.31 1.12 1
2
0
^
x
y
0
^
y 5.91 2.66x
2
2
0
/2,2
ˆˆ11
xx
xx
n
ytpne
0 yˆ a bx
1 8 2 quantitative techniques
STUD Y TEX T
The 95% confidence interval in example above is:
The table below illustrates the behaviour of the two confidence intervals, as x0 changes
Table 5.4 Calculation of confidence intervals for .μy/x and y given x0, for example above
Estimated time
yˆ =5.91+2.66x095%confidence intervals for
y given x0±± mins
Distance x0
Miles
Minutes μy/x ± mins
1.0 8.57 1.47 2.98
2.0 11.23 1.00 2.77
2.89( x )
13.60 0.82 2.71
3.0 13.89 0.82 2.71
4.0 16.55 1.09 2.81
4.9 18.94 1.54 3.01
These are long and complicated calculations for a sample of only 20 values. Much of the work
can be taken away by the computer package but it is important to understand what the package
is doing and how to interpret its output. Unfortunately different packages use slightly different
terms and symbols. If you have a regression analysis package available to you, it will help if you
work through a simple example, like the one in this book, by hand/manually, and then use the
package. Compare the computer output with the hand/manual calculation until you thoroughly
understand it. A clear understanding of the two variable linear models will also help considerably
when we tackle multiple regression, the calculation for which are always done by computer.
Multiple linear regression models
In the previous section it was mentioned that the chosen independent variable is unlikely to be
the only factor which affects the dependent variable. There will be many situations in which we
can identify more than one factor we feel must influence the dependent variable. For example,
we wish to predict cost per week for a production department. It is reasonable to suppose that
departmental costs will be affected by production hours worked, raw material used, and number
of items produced and maintenance hours worked.
It seems sensible to use all of the factors we have identified to predict the departmental costs
for a sample week, we can collect the data on costs, production hours, raw material usage, etc,
but we will no longer be able to investigate the nature of the relationship between costs and
other variables by means of a scatter diagram. In the absence of any evidence to the contrary,
we begin by assuming a linear relationship and only if this proves to be unsuitable, will we try to
establish a non-linear model. The linear model for multiple linear regressions is:
15.89
2.89
10
ˆ 2.31 1.12 1 1
2
0yx
183
STUD Y TEX T
Regression, Time Series and Forecasting
The variation in y is explained in terms of the variation in a number of independent variables
which,
ideally, should be independent of each other. For example, if we decide to use 5 independent
variables, the model is:
As with simple linear regression, the sample data is used to estimate α, β1, β2.etc. The line of
best
fit for the sample data is:
y = a + b1x1 + b2x2 +…bnxn
These are the same as those for the simple regression case. However, here they lead to very
complex calculations. Fortunately, a computer package performs these calculations leaving us
free to concentrate on the interpretation and evaluation of the multiple linear regression models.
In the following section, we will indicate the steps which should be taken when setting up and
using a multiple regression model, but at all stages, we assume that the actual calculations will
be done by computer.
The time series components
The value of a variable, such as sales will change over time due to a number of factors. For
example, a company may be expanding hence there will be an upward movement in the sales
of the product as time progresses. The general change in the value of a variable over time is
referred to as the trend T. In time series analysis the trend is the line of best fit. The model is
then
used to forecast future trend. In practice, there may be no trend at all, with demand fluctuating
about a fixed value, or more likely, there may be a non-linear trend. The graphs below represent
the trend demand at different stages of a product’s life cycle.
There is underlying upward trend for the newly launched product and a dying curve of the old
product reaching the end of its economic life. It is difficult to fit the equation to these trend
curves.
The moving average technique, described in the following sections, can be used to separate
the trend from the seasonal pattern. The technique uncovers the historical trend by smoothing
away the seasonal fluctuations. However, the moving average trend is not used to forecast future
trend values because there is too much uncertainty involved in extending such a series.
Diagram 5.10 Sales of a successful new product
n n y x x x ... x 1 1 2 2 3 3
1 1 2 2 3 3 4 4 5 5 y x x x x x
Sales
Time
1 8 4 quantitative techniques
STUD Y TEX T
Diagram 5.12 Sales of product nearing the end of the life
Generally, it is found that values do not indicate a trend only. If these regular fluctuations occur
in a short term, they are referred to as seasonal variation,‘S.’ Longer term fluctuation is called
cyclical variation.
The seasonal pattern in examples used in this chapter refers to the traditional seasons, but
in forecasting generally, the term ‘season” is applied to any systematic pattern. It may be the
pattern of retail sales during the week, in which case the ‘season’ is a day. We may be interested
in a seasonal pattern of traffic flow during the day and during the week. This will give us an
hourly
‘seasonal’ pattern, superimposed on a daily ‘seasonal’ pattern, which both fluctuate about a daily
trend. If we use annual data, we cannot identify a seasonal pattern. Any fluctuation about the
annual trend data would be described by the cyclical component. This cyclical factor will not be
included in our examples. It is seen only in long-term data, covering 10, 15 or 20 years, where
large scale economic factors cause additional fluctuations about the trend.
These cyclical factors were apparent economic data from about 1960 to 1975. This was the
time when many of these forecasting ideas were being developed, but since then, the overall
economic pattern has changed. We will concentrate on short-term models which exclude the
cyclical component.
The final term in our model also arises in the linear regression model. It is error or residual, the
part of the actual observation that we cannot explain using the model. We can use the errors to
give us measure of how well our model fits the data. Two measures are usually used. These are
the mean absolute deviation.
This is the sum of all the errors, ignoring their sign, divided by number of forecasts, and the
mean
square error
Sales
Sales
Time
n
E
n
Actual Forecast
MAD t
n
E
MSE t
2
185
STUD Y TEX T
Regression, Time Series and Forecasting
This is the sum of the squares of the errors, divided by the number of forecasts. This second
measure emphasises the large errors.
In the analysis of a time series, we attempt to identify the factors which are present and to build
a model which combines them in an appropriate way.
Example: To illustrate the choice of an appropriate time series model.
The data below are quantities of a product sold by Lewplan Plc during the last 13-three monthly
periods.
Table 5.11 Quantities of product sold over last 13-three monthly periods.
Date Q uantity sold 000
Jan – May 19 x 6 239
Apr – Jun 201
Jul – Sep 182
Oct – Dec 297
Jan – Mar 19 x 7 324
Apr – Jun 278
Jul – Sep 257
Oct – Dec 384
Jan – Mar 19 x 8 401
Apr – Jun 360
Jul – Sep 335
Oct – Dec 462
Jan – Mar 19 x 9 481
We wish to analyzse this data set to see if we can find historical pattern. If there is a consistent
pattern, we will use it to forecast the quantity sold in subsequent three monthly periods.
Solution: The figures are plotted below. In time series diagrams it is customary to join the points
with straight lines so that any pattern can be seen more clearly.
Diagram 5.10 Lewplan plc, sales per 3 months.
Quantity sold per 3
months, 000s
500
400
300
200
100
Approximately equal seasonal variation
indicating an additive model
1 2 3 4 1 2 3 4 1 2 3 4 1 Time Quarter
19x6 19x7 19x8
1 8 6 quantitative techniques
STUD Y TEX T
The diagram suggests that there may be an increasing trend, overlaid by seasonal fluctuations.
The sales in the winter seasons, 1 and 4 are consistently higher than those in the summer
seasons 2 and 3. The seasonal component appears to be fairly constant over the 3 years. The
trend is for the sales to increase overall from around 230 in 19x6 to 390, but the seasonal
fluctuations have not increased. This indicates that the additive component model should be
more suitable. See section 9.3.
The analysis of an additive component model A = T + S + E
The addictive component model is one in which the variation of the value of the variable
overtime
can be described by adding the relevant components. Assuming that cyclical variation is not
included the actual value of the variables A may be modeled by:
Actual value = trend + seasonal variation + error
That is
A=T+S+E
In both the addictive and multiplicative components modeled, the general analysis procedure is
the same:
Step 1: Calculate the seasonal components
Step 2: Remove the seasonal component from the actual values. This is called deseasonalising
the data.
Calculate the trend from these deseasonalised figures.
Step 3: Deduct the trend figures from deaseasonalised figures to leave the errors
Step 4: Calculate the mean average deviation (MAD) or the mean square error (MSE) to judge
whether the model is reasonable, or to select the best from different models.
Calculate the seasonal components for the addictive model
Example: Setting up the addictive component model for a time series
Refer to the example in the previous section which relates to the quarterly sales of Lewplan plc.
We have already decided that an addictive model is appropriate for these data; therefore the
actual sales may be modeled by:
A = T + S +E
To eliminate the seasonal components we will use the method of moving averages. If we add
together the first 4 data points, we obtain the total sales for 19x6; dividing this by 4 gives the
quarterly average for 19 x 6, i.e.
(239 + 201 + 183 + 297)/4 = 229.75
This figure contains no seasonal component because we have averaged them out over the
year, that is, for mid-point of quarters 2 and 3. If we move on for 3 months, we can calculate the
average quarterly figure for April 19x6 – March 19x 7 (251), for July 19x6 –June 19 x 7 (270.5)
and so on. This process generates the 4 point moving averages for this set of data. The set of
moving averages represent the best estimate of the trend in the demand.
We now use these trend figures to produce estimates of the seasonal components. We
calculate:
A–T=S+E
187
STUD Y TEX T
Regression, Time Series and Forecasting
Unfortunately, the estimated values of the trend given by the 4 point moving averages are for
points in time which are different from those for the actual data. The first value, 229.75,
represents
a point in the middle of 19 x x, exactly between the April June and the July- September quarters.
The second value, 251, falls between the July-September and the October-December actual
figures. We require a deseasonalised average which corresponds with the figure for an actual
quarter. The position of the deseasonalised averages is moved by reaveraging pairs of values.
The first and second values are averaged, centring them on July-September 19 x 6, ie:
(229.75 + 251)/2 = 240.4
This is the deseasonalised average for July-September 19 x 6. This deseasonalised value, called
the centred moving average can be compared directly with the actual value for July-September
19 x 6 of 182. Notice that this means that we have no estimated trend figures for the first 2 or last
2 quarters of the time series. The results of these calculations are shown in the table below:
Table 5.12 Calculations of the centered 4 point moving average trend values for the model A = T
+ S= E
Date Quantity
000’s A
4 Quarter Total 4quarter
moving
average
Centered
moving
average
Estimated
seasonal
component
A–T=S+E
Jan-Mar
19x6
239
Apr – Jun 201
919 229.75
Jul – Sep 182 240.4 - 58.4
1004 251
Oct –Dec 297 260.6 + 36.4
1081 270.25
Jan -Mar
19x7
324 279.6 +44.4
1156 289
Apr-Jun 278 299.9 -21.9
1243 310.75
Jul-Sep 257 320.4 -63.4
1320 330
1 8 8 quantitative techniques
STUD Y TEX T
Oct – Dec 384 340.3 +43.8
1402 350.5
Jan-Mar
19x8
401 360.2 +40.8
1480 370
Apr-Jun 360 379.8 - 19.8
1558 389.5
Jul-Sep 335 399.5 -64.5
1638 409.5
Oct –Dec 462
481
For each quarter of the year, we have estimates of the seasonal components, which include
some error or residual. There are two further stages in the calculations before we have usable
seasonal components. We average the seasonal estimates for each season of the year. This
should remove some of the errors. Finally we adjust the averages, moving them all up or down
by
the same amount, until their total zero. This is done because we require the seasonal components
to average out over the year. The correction factor required is: (the sum of the estimated seasonal
values)/4. The estimates from the last column in Table.2 are shown under their corresponding
quarter numbers. The procedure is shown in the table below.
Table 5.13 Calculations of the average seasonal components
1234
19X6 - - - 58.4 + 36.4
19X7 +44.4 -21.9 -63.4 + 43.8
19X8 +40.8 -19.8 -64.5
Total + 85.2 -41.7 -186.3 + 80.2
Average 85.2÷ 2 -41.7 ÷2 -186.3÷3 80.2÷2
Estimated
seasonal
component
+ 42.6 -20.8 -62.1
Sum
= -0.2
Adjusted
seasonal
component
+ 42.6 -20.7 -62.0 Sum
=0
In this example, two of the seasonal components have been rounded up, and two have been
rounded down, so that the sum equals zero.
The seasonal components confirm our comments on the diagram at the end of the section
above. Both winter quarters are above the trend by about 40 thousand units and the two summer
quarters are below the trend by 21 and 62 thousand units, respectively.
189
STUD Y TEX T
Regression, Time Series and Forecasting
A similar procedure is used for seasonal variation over any period. For example, if the seasons
are days of the week, take a 7-point moving average to remove the daily ‘seasonal’ effect rather
than a 4-point moving average. This average will represent the trend at the middle of the week,
that is, on day 4, therefore it is not necessary to centre these moving averages.
Deseasonalise the data to find the trend.
Fast Forward
The term "trend analysis" refers to the concept of collecting information and attempting to spot
a pattern.
Step 2 is to deseasonalise the basic data. This is shown below by deducting the appropriate
seasonal component from each quarter’s actual sales, i.e. A – S = T + E
Table 5.14 Calculation of the deseasonalised data
Date Quarter no
Quantity sold
000 A
Seasonal
component S
Deseasonalised
quantity, 000
A – S = T +E
Jan – Mar 19x6 1 239 ( + 42.6) 196.4
Apr – Jun 2 201 ( - 20.7) 221.7
July – Sep 3 182 ( - 62.0) 244.0
Oct-Sep 4 297 ( + 40.1) 256.9
Oct – Dec 5 324 (+ 42.6 ) 281.4
Jan- Mar 19x8 6 278 ( - 20.7) 298.7
Jul- Sep 7 257 (- 62.0 ) 319.0
Oct – Dec 8 384 ( + 40.1) 343.9
Jan-Mar 19x8 9 401 (+ 42.6) 358.4
Apr – Jun 10 360 ( - 20.7) 380.7
July-Sep 11 335 ( - 62.0) 397.1
Oct-Dec 12 462 ( + 40.1) 421.9
Jan – Mar 19x9 13 481 (+ 42.6) 438.4
These re-estimated trend values, with errors, can be used to set up a model for the underlying
trend. The values are plotted on the original diagram, which now shows a clear linear trend:
actuals
1 9 0 quantitative techniques
STUD Y TEX T
Diagram 5.11 Lewplan Plc, actual and deseasonalise sales per 3 months
The equation of the trend line is:
T = a + bx quarter number
Where a and b represent the intercept and slope of the line. The least squares method can be
used to determine the line of best fit, therefore the equation for a and b, from the previous
chapter
on linear regression are:
Where X is the quarter number and y is (T + E) in the above table. Using a calculator, we find:
Σx = 91 Σx2=819 Σy = 4158.7 Σxy = 32747.1 n = 13
It follows by substitution that:
b = 19.978 and a = 180.046
Hence the question of the trend model may be written.
Trend quantity (000s) = 180.0 + 20.0 x quarter number.
Calculate the errors
Step 3 in the procedure, before using a model to forecast, is to calculate the errors or residuals.
The model is:
A=T+S+E
We have calculated S and T. We can now deduct each of these from A the actual quantity, to
find
the errors in the model.
Qty sold per
3 months,
000s
500
400
300
200
100
actuals
Estimated trend
n x2 x 2
n xy x y
b
n
bx
n
y
a and
191
STUD Y TEX T
Regression, Time Series and Forecasting
Table 5.15 Calculation of errors for the addictive component model
Date
Quarter
no
Quantity
sold
000 A
Seasonal
component
S
Trend
component,
`000
T
Error, 000
A–S=T+E
Jan-Mar 19x6 1 239 ( + 42.6) 200 -3.6
Apr – Jun 2 201 ( - 20.7) 220 +1.7
July – Sep 3 182 ( - 62.0) 240 +4.0
Oct-Sep 4 297 ( + 40.1) 260 -3.1
Oct – Dec 5 324 (+ 42.6 ) 280 +.4
Jan- Mar 19x8 6 278 ( - 20.7) 300 -1.3
Jul- Sep 7 257 (- 62.0 ) 320 -1.0
Oct – Dec 8 384 ( + 40.1) 340 +3.9
Jan-Mar 19x8 9 401 (+ 42.6) 360 -1.6
Apr – Jun 10 360 ( - 20.7) 380 +0.7
July-Sep 11 335 ( - 62.0) 400 -3.0
Oct-Dec 12 462 ( + 40.1) 420 +1.9
Jan-Mar 19x9 13 481 (+ 42.6) 440 -1.6
The final column in the table can be used for step 4, the calculation of the mean absolute
deviation
(MAD) or the mean square error (MSE) of the errors.
The errors are small at about 1% or 2%. The historical pattern is highly consistent and should
give a good short-term forecast.
Forecasting using the addictive model
The forecast for this addictive component model is:
F = T +S (000 units per quarter)
Where the trend component T = 180 + 20 x quarter number, and the seasonal components, S are
+ 42.6 for January, March – 20.7 for April- June – 62.0 for July-September and + 40.1 October-
December
2.2
13
28.7
n
E
MAD t
6.1
13
78.85 2
n
E
MSE t
1 9 2 quantitative techniques
STUD Y TEX T
The quarter number for the next three monthly periods, April-June 19 x9, is 14, therefore the
forecast trend is:
T14 = 180 + 20 x 14 = 460 (000 units per quarter)
The appropriate seasonal component is – 20.7 (000 units). Therefore the forecast for this quarter
is:
F (April – Jun 19 x9) = 460 – 20.7 = 439.3 (00 units)
It is important to remember that the further ahead the forecast, the more unreliable it becomes.
We are assuming that the historical pattern continues uninterrupted. This assumption may hold
for short periods but is less and less likely to be true the further we go into the future.
The analysis of multiplicative component model: A = T ×S × E
In some time series, the seasonal component is not a fixed amount each year. Instead it is a
percentage of the trend values. As the trend increases, so does the seasonal variation.
Example: Setting up a multiplicative component model for a time series.
CD Plc sells a range of products. The quarterly sales of one product for the last 13 quarters are
given below
Table 5.16 CD Plc quarterly sales.
Date Quarter number Quantity sold A
Jan-Mar 19x6 1 70
Apr – Jun 2 66
July – Sep 3 65
Oct-Sep 4 71
Oct – Dec 5 79
Jan- Mar 19x8 6 66
Jul- Sep 7 67
Oct – Dec 8 82
Jan-Mar 19x8 9 84
Apr – Jun 10 69
July-Sep 11 72
Oct-Dec 12 87
Jan-Mar 19x9 13 94
The scatter diagram for these data is:
This product has a similar seasonal pattern to the previous example, with high winter values and
low summer values, but the size of variations about the trend line are increasing. A multiplicative
component model should be suitable for these data.
Actual values = trend x seasonal variation x error
193
STUD Y TEX T
Regression, Time Series and Forecasting
That is: A = T X S X E
In this example, the trend looks linear but this will become clearer when we have smoothed the
series.
Calculations of the seasonal components
Initially the same procedure is followed as for the addictive model. The centred moving average
trend values are calculated but the estimated seasonal components are ratios derived from A/T
= S X E. The calculations are shown in the table below:
The seasonal components ratios are derived from the quarterly estimates in a similar way to
those for the addictive model. Since the seasonal values are ratios and there are 4 seasons,
we require the seasonal components to total to 4 rather than zero. (If the data comprised 7 daily
seasons in each week, then the seasonal components would be required to total to 7). If the total
is not 4, the values are adjusted as before. The estimates from the last column above are shown
under their corresponding quarter numbers below.
Qty sold/3
months
100
90
80
70
60
1 2 3 4 1 2 3 4 1 2 3 4 1 Time quarter year
19x6 19x7 19x8
Increasing seasonal variations as
the trend increases indicates a
multiplication
1 9 4 quantitative techniques
STUD Y TEX T
Table 5.17 Calculations of the seasonal components, CD Plc
Date Quarter no
Quantity
sold
000 A
Centred
4 point
moving
average
Centred
4 point
moving
average
Seasonal
component
A/T
Jan-Mar 19x6 1 70
Apr – Jun 2 66
68.00
July – Sep 3 65 69.13 0.940
70.25
Oct – Dec 4 71 70.25 1.011
70.25
Jan-Mar 19x7 5 79 70.50
70.75
Apr-Jun 6 66 72.13 0.915
73.50
Jul- Sep 7 67 74.13 0.904
74.75
Oct – Dec 8 82 75.13 1.092
75.50
Jan-Mar 19x8 9 84 76.13 1.103
76.75
Apr – Jun 10 69 77.38 0.892
78.00
July-Sep 11 72 79.25 0.909
80.50
Oct-Dec 12 87
Jan-Mar 19x9 13 94
195
STUD Y TEX T
Regression, Time Series and Forecasting
Table 5.18 Calculations of seasonal components, CD Plc
Year 1 2 3 4
19 X 6 0.940 1.011
19 X 7 1,121 0.915 0.904 1.092
19 X 8 1.103 0.892 0.909
Total 2.224 1.807 2.753 2.103
Average 2.224 ÷2 1.807 ÷2 2.753 ÷3 2.103 ÷2
Estimated
seasonal 1.112 0.903 0.918 1.051
sum=3.984
Adjusted
seasonal
factor
1.116 0.907 0.922 1.055 sum
=4
The adjusted value is obtained by multiplying each estimated component ratio by (4/3.984)
The seasonal effect on the sales of the January- March quarter is estimated to increase sales by
11.6% of the trend value (1.116). Similarly effect of the October-December quarters is to raise
sales by 5.5% of the trend. For the other 2 quarters, the seasonal effect is to depress the sales
below the trend values to 90.7% and the 92.2% trend respectively.
Deseasonalise the data and fit the trend line
We have now found estimates of the seasonal component and can deseasonalise the data by
calculating A/S = T× E. these estimated trend values are calculated below.
Table 5.19 Calculations of the trend for CD plc
Quarter
number
Quantity sold
A
Seasonal
component ratio
S
Deaseasonalised
quantity 000
A/S = TXE
Jan-Mar
19x6 70 1.116 62.7
Apr – Jun 66 0.907 72.8
July – Sep 65 0.922 70.6
Oct-Dec 71 1.055 67.3
Jan- Mar
19x7 79 1.116 70.8
Apr-Jun 66 0.907 72.8
Jul- Sep 67 0.922 72.7
Oct – Dec 82 1.055 77.7
Jan-Mar
19x8 84 1.116 75.2
Apr – Jun 69 0.907 76.1
July-Sep 72 0.922 78.2
Oct-Dec 87 1.055 82.4
Jan-Mar
19x9 94 1.116 84.2
1 9 6 quantitative techniques
STUD Y TEX T
The trend values are superimposed on the original scatter diagram:
Diagram 5.12 CD Plc actual and deseasonalised sales per 3 months
The trend which emerges is erratic. The sales values in this time series are not consistent like the
ones in the first example for Lewplan Plc. CD plc is probably a more realistic example.
We now have to decide how to model the trend. It is not a curve but looks roughly linear even
though the values are erratic, particularly in 19 x 6. For simplicity, we will assume that the trend
is linear and use the least squares method to fit the best-line to the data. The trend line, using
the same procedure is:
T = 64.6 + 1.36 X quarter number (000 units per 3 months)
We use this equation to estimate the value of the trend sales for each of the periods.
Calculation of the errors: A /(T X S) = E OR A – (T XS) = E
We have now calculated the trend and seasonal components. We can use these to find the errors
between the observed sales, A, and the sales which are forecast by the model, T x S. The table
below gives the errors, both as a proportion, E = A/(T x S), and as absolute values, A – (T x S)
Qty sold/3
months
100
90
80
70
60
1 2 3 4 1 2 3 4 1 2 3 4 1 Time quarter year
19x6 19x7 19x8
actuals
Estimated trend
197
STUD Y TEX T
Regression, Time Series and Forecasting
Table 5.19 Errors for CD Plc
Date Qtr
no
Qty
sold
000
A
Seasonal
component,
000 S
Trend
component,
000 T
T×S
Errors
A/(T×S)
A-(T×S)
Jan-Mar
19x6 70 1.116 66.0 73.7 0.95 -3.7
Apr–Jun 66 0.907 67.3 61.0 1.08 +5.0
July–
Sep 65 0.922 68.7 63.3 1.03 + 1.7
Oct-Dec 71 1.055 70.0 73.9 0.96 - 2.9
Jan-Mar
19x7 79 1.116 71.4 79.7 0.99 - 0.7
Apr-Jun 66 0.907 72.8 06.0 1.00 0
Jul- Sep 67 0.922 74.1 68.3 0.98 1.3
Oct-Dec 82 1.055 75.5 79.7 1.03 + 2.3
Jan-Mar
19x8 84 1.116 76.8 85.7 0.98 - 1.7
Apr- Jun 69 0.907 78.2 70.9 0.97 - 1.9
July-Sep 72 0.922 79.6 73.3 0.98 - 1.3
Oct-Dec 87 1.055 80.9 85.4 1.02 + 1.6
Jan-Mar
19x9 94 1.116 82.3 91.9 1.02 + 2.1
The errors are high in the first year, as we could have guessed from the plot of the deseasonalised
figures. However, from the January-March 19 x 7 quarter, the errors are all 2 or 3% of the actual
values and the model looks reasonably satisfactory.
Forecasting with the multiplicative component model
Forecasting with either model assumes that we can fit an equation to the trend values. In both of
the examples used we have been lucky. The trend has been clearly linear. If the trend had been a
curve, we would have had to guess the relationship, and use some of the techniques mentioned
in the previous chapter for dealing with non- linear relationships. Once we have established the
trend equation, the calculations of the forecast is straightforward. The forecast is:
F=TxS
Where:
T= 64.6 + 1.36 X quarter number (000 units per 3 months)
1 9 8 quantitative techniques
STUD Y TEX T
And the seasonal component ratios are 1.116 for quarter 2, 0.922 for quarter 3, and 1.055 for
quarter 4. The next quarter is April- June19 x 9, which is quarter 14 in the series and quarter 2 in
the year. The forecast sales are:
F=TXS
= (64.6 + 1.36 X 14) X 0.907
= 83.64 X 0.907 = 75.9 (000 units per quarter)
Given the errors for the model, we would hope that this estimate will be within 2% or 3% actual
value. Similarly, the forecast for October-December 19 x 9 is found using quarter number is 16
and the seasonal component for quarter 4:
F= T X S
= (64.6 + 1.36 X16) X 1.055
= 86.64 X 1.055 = 91.1 (000 units per quarter)
We expect the error on this forecast to be larger than previous one because it is further into the
future.
CHAPTER SUMMARY
A time series is any set of data which is recorded over time. It may, for example, be annual,
quarterly, monthly or weekly data. Models use the historical pattern of the time series to forecast
how the variable will behave in the future. The short term forecast will be more accurate than the
longer term ones. The further ahead we forecast, the less likely it is that the historical pattern will
remain unchanged, and the larger will be the errors.
There are two basic models. In both cases, it is assumed that the value of the variable is made
up of a number of components. The series may contain trend –general movement in the value of
the variable, seasonal variation – short term periodic fluctuations in the variable values: cyclical
variation – long term periodic fluctuations in the variable values, and error components –
residual
term. Data sets enough to include the cyclical component were not considered in this text.
The component models are:
Addictive A = T + S + E
Multiplicative A = T × S × E
In both cases, moving averages are used to deseasonalise the time series. These deseasonalised
data are used to set up a model to describe the trend. The model is used to forecast the best
model. Two measures give guidance on how well a model fits the past data. These are:
Mean absolute deviation (MAD) =
n
Et Σ
Mean square error (MSE) =
()
n
Et Σ 2
199
STUD Y TEX T
Regression, Time Series and Forecasting
Chapter Quiz
1. A data set, in which the independent variable is time, is referred to as…………..
2. …………. is the process of estimation in unknown situations.
3. V ariance is that part of the actual observation that we cannot explain using the model.
(a) True
(b) False
4. I s the following component model right? A + T= S + E
5. T he formula for mean absolute deviation is
(a) True
(b) False
n
Et 2
2 0 0 quantitative techniques
STUD Y TEX T
Answers
1. T ime series
2. F orecasting
3. (b) False. This is called Error or residual
4. Wrong. It is A – T = S + E
5. False. MAD =
n
Et Σ
Questions from previous exams
December 2000 Question 5
An increasing number of organisations operate sophisticated corporate planning models. Quite
often these models feature integrated models for the production, marketing and financing
subsystems.
While linkage between these models is important, forecasts of the external environment
are also needed and econometric models are frequently used to provide these forecasts.
Required
a) Under production, marketing and finance sub-systems, what information about the
environment might be provided by the econometric models? (9 marks)
b) T he monthly electricity bill at the Chez Paul Restaurant over the past 12 months has
been as follows:
Month Amount (Sh.)
December 30,660
January 27,190
February 30,570
March 30,640
April 29,730
May 31,530
June 29,720
July 33,070
August 30,010
September 27,550
October 30,130
November 27,940
201
STUD Y TEX T
Regression, Time Series and Forecasting
Paul is considering using exponential α = 0.5 to forecast future electricity sales (index α
smoothing with.)
Required:
Determine next January’s forecast. (11 marks)
(Total: 20 marks)
June 2002 Question 5
The following regression equation was calculated for a class of 24 CPA II students.
Ŷ = 3.1 + 0.021X1 + 0.075X2 + 0.043X3
Standard error (0.019) (0.034) (0.018)
Whereby: Y = S tandard score on a theory examination
X1
= S tudents rank (from the bottom) in high school
X2
= S tudents verbal aptitude score
X3
= a measure of the students character
Required:
a) Calculate the t ratio and the 95% confidence interval for each regression coefficient.
(8 marks)
b) What assumptions did you make in (a) above? How reasonable are they?
(4 marks)
c) Which regressor gives the strongest evidence of being statistically discernible?
(2 marks)
d) In writing up a final report, should one keep the first regressor in the equation, or drop
it? Why? (6 marks)
(Total 20 marks)
June 2003 Question Five
a) F or an additional time series model, what does the term “residual variation” mean?
Describe briefly its main constituents. (4 marks)
b) E xplain the moving average centering and why it is employed. (4 marks)
c) Given a time series with trend figures already calculated, describe in words only, the
method for calculating seasonal variation values using the additive model. (4 marks)
d) When projecting a moving average trend, what basis would make the choice of the
following appropriate?
(i) Projecting ‘by eye’. (2 marks)
(ii) U sing the method of semi-averages (1 mark)
(iii) U sing the average change in trend per period from the range (1 mark)
d) Explain how management of an organisation might use seasonal variation figures and
seasonally adjusted data. (4 marks)
(Total: 20 marks)
2 0 2 quantitative techniques
STUD Y TEX T
December 2003 Question 5
a) Explain the purpose of selecting participants from different functional fields to participate
in a Delphi Study. Could the strategy backfire? (5 marks)
b) Business at M and K Sunshine Boutique can be viewed as falling into three distinct
seasons:
(1) C hristmas (November – December)
(2) R ainy Season (April – June)
(3) A ll other times
The average weekly sales in thousands of shillings during each of these seasons in the past four
years has been as follows:
Year
Season 1999 2000 2001 2002
1 1,856 1,995 2,241 2,280
2 2,012 2,168 2,306 2,408
3 985 1,072 1,105 1,120
An approximate software was used to analyse the above data and the following output is
provided:
Period Sales Moving Average
(Sh.)
Seasonal-Irregular
Component (Sh.)
Deseasonalised
Data (Sh.)
1 1,856 - 1.156 1,605.667
2 2,012 1,618 1.225 1,642.648
3 985 1,664 0.586 1,679.630
4 1,995 1,716 1.162 1,716.612
5 2,168 1,745 1.236 1,753.594
6 1,072 1,827 0.599 1,790.576
7 2,241 1,873 1.226 1,827.558
8 2,306 1,884 1.237 1,864.539
9 1,105 1,897 0.581 1,901.521
10 2,280 1,931 1.176 1,938.503
11 2,408 1,936 1.219 1,975.485
12 1,120 - 0.557 2,012.467
Number of moving periods: 3
Number of periods: 12
Mean Absolute Deviation (MAD) = 493.3333
Trend line of the deseasonalised data: T1 = 1,580.11 + 33.96t
203
STUD Y TEX T
Regression, Time Series and Forecasting
Required:
(i) E xplain the above output (3 marks)
(ii) Determine the seasonal indexes and explain what they mean (8 marks)
(iii) Determine the short range forecast for the 13th, 14th, and the 15th periods. (4 marks)
(Total: 20 marks)
June 2004 Question 5
State the principal components of a time series. (2 marks)
a) (i) E xplain the difference between multiplicative and additive models as used in time
series. (2 marks)
(ii) S tate the conditions under which each model is used. (2 marks)
b) T he table below shows the sales of new cars by quarters during a period of three
years.
Year Quarter 1
Sh. “million”
Quarter 2
Sh. “million”
Quarter 3
Sh. “million”
Quarter 4
Sh. “million”
2001 55.0 76.5 61.2 77.8
2002 54.4 65.9 52.7 81.4
2003 59.3 83.2 78.5 93.0
Required:
(i) E xplain the purpose of the seasonal index (2 marks)
(ii) T he seasonal index for each quarter assuming an additive model. (12 marks)
(Total: 20 marks)