Chapter 0
Chapter 0
INTRODUCTION
0.1 PREFACE
Much research is devoted to the topic of modeling, that is trying to describe how
variables are related. For example, an advertising agency might be interested in
modeling the relationship between sales revenue and the amount spent on
advertising. A lecturer might want to know if student performance is related to the
number of hours spent on revision and practicing tutorials. Regression analysis is
one of the most widely used techniques for analyzing relationships among given
variables. Its wide appeal and usefulness comes from conceptually logical process
of using an equation to express the relationship between a variable of interest (also
known as the response or dependent variable, DV) and a set of related predictor
variables (also known as explanatory or independent variables, IV).
Regression analysis is also interesting theoretically due to its elegant underlying
mathematics and a well-developed statistical theory. Successful use of regression
requires an appreciation of both the theory and the practical problems that typically
arise when the technique is applied to real-world data.
This course assumes that the student has taken an introductory courses in statistics
and has the familiarity with hypothesis tests (involving 1 sample, 2 samples and k-
samples) and confidence intervals, and the normal, t, 2 and F distributions. Some
knowledge of matrix algebra is also necessary.
1
volume as the impression is that the data points generally, but no exactly, fall along
a straight line. Figure 1b illustrates this straight-line relationship.
Delivery time, y
Delivery volume, x
Figure 1a Scatter plot/diagram for delivery volume.
Delivery time, y
Delivery volume, x
Figure 1b Straight-line relationship between delivery time and delivery volume
If we let y represent delivery time and x represent delivery volume, then the
equation of a straight line relating these two variables is
y 0 1x
where 0 is the intercept and 1 is the slope. Now the data points do not fall
exactly on a straight line, so the equation above should be modified to account for
this. Let the difference between the observed value of y and the straight line
0 1x be an error . It is convenient to think of as a statistical error; that is, it
is a random variable that accounts for the failure of the model to fit the data
2
exactly. The error may be made up of the effects of others variables on delivery
time, measurement errors, and so forth. Thus, a more plausible model for the
delivery time data is
y 0 1x
Equation above is called a linear regression model. Customarily x is called the
independent variable and y is called the dependent variable. However, this often
causes confusion with the concept of statistical independency, so we refer to x as
the explanatory, predictor or regressor variable and y as the response variable.
Because the equation above involves only one regressor variable, it is called a
simple linear regression model.
To gain some additional insight into the linear regression model, suppose that we
can fix the value of the regressor variable x and observe the corresponding value of
the response y. Now if x is fixed, the random component on the right-hand side
of equation above determines the properties of the y. Suppose that the mean and
variance of are 0 and 2 , respectively. Then the mean response at any value of
the regressor variable is
E y x y|x E 0 1x 0 1x
Notice that this is the same relationship that we initially wrote down following
inspection of the scatter diagram in Figure 1a. The variance of y given any value of
x is
Var y x Var 0 1x Var 2
Thus, the true regression model y|x 0 1x is a line of mean values, that is, the
height of the regression line at any value of x is just the expected value of y for a
unit x. The slope, 1 , can be interpreted as the change in the mean of y for a unit
change in x. Furthermore, the variability of y at a particular value of x is
determined by the variance of the error component of the model, 2 . This implies
that there is a distribution of y values at each x and that the variance of this
distribution is the same (homoscedastic) at each x.
For example, suppose that the true regression model relating delivery time to
delivery volume is y|x 3.5 2 x , and suppose that the variance is 2 2 . Figure
1.2 illustrates this situation. Notice that we have used a normal distribution to
describe the random variation in . Since y is the sum of a constant 0 1x (the
mean) and a normally distributed random variable, Y is a normally distributed
random variable. For example, if X 10 cases, then delivery time y has a normal
distribution with mean 3.5 2 10 23.5 minutes and variance 2. The variance 2
determines the amount of variability or noise in the observations Y on delivery
time. When 2 is small, the observed values of delivery time will fall close to the
line, and when 2 is large, the observed values of delivery time may deviate
considerably from the line.
3
y
Observed values of y for a
given x are sampled from
these distributions
N (0, = 2)
Observed value of y
10 20 30 x
Figure 1.2 How observations are generated in linear regression
Straight-line
approximation
y True relationship
x
Figure 1.3 Linear regression approximation of a complex relationship
4
the parameters 0 , 1, 2 , , k , not because y is a linear function of the x’s. We
shall see subsequently that many models in which y is related to the x’s in a
nonlinear fashion can still be treated as linear regression models as long as the
equation is linear in the 's .
5
problems for the analysis and its interpretation. The following example illustrates
these three methods.
Statistical data come in a variety of types. While some regression methods can be
applied with little or no modification to many different types of data sets, the
special features of some data sets must be accounted for or should be exploited.
A cross-sectional data set consists of a sample of individuals, households, firms,
cities, states, countries, or a variety of other units, taken at a given point in time.
Sometimes, the data on all units do not correspond to precisely the same time
period. Typically, we would ignore any minor timing differences in collecting the
data. For example, if a set of families was surveyed during different weeks of the
same year, we would still view this as a cross-sectional data set. An important
feature of cross-sectional data is that we can often assume that they have been
obtained by random sampling from the underlying population. The random
sampling leads to independent assumption of the data.
A time series data set consists of observations on a variable or several variables
over (a reasonably long period of) time. Since past events can influence future
events and lags in behavior are prevalent in social sciences, economics, finance
and other areas, time is an important dimension in a time series data set. Unlike the
arrangement of cross-sectional data, the chronological ordering of the
observations in a time series conveys potentially important information. A key
feature of time series data that make them more difficult to analyze than cross-
sectional data is the fact that time series observations can rarely, if ever, be
assumed to be independent across time. Most time series are related, often strongly
related, to their recent histories.
6
For example, consider: y 1x12 2 x2 3 log x3 . The model is linear since
y i for i = 1, 2, 3 are independent of the parameters, i .
There exist the true values of 1 and 2 in nature but are unknown to the
researcher. Some values on y are recorded by providing different values to x1 and
x2 . There exists some relationship between y and x1 , x2 which give rise to a
systematically behaved data on y, x1 and x2 . However, such relationship is
unknown to the researcher. To determine the model, we move in the backward
direction in the sense that the collected data is used to determine (estimate) the
parameters, 1 and 2 of the model.
c) Specification of model
The researcher working on the subject should know or need to determine the form
of the model. A general form can be written as follows:
y f x1, x2 , , xk ; 1, 2 , , k
where is the random error reflecting the difference between the observed value
of y and the value of y obtained (predicted) by the model. The tentative model
depends on some unknown parameters. The form of
f x1, x2 , , xk ; 1, 2 , , k can be linear or non-linear, depending on the form
of the parameters, 1, 2 , , k .
y f x1, x2 , , xk ; ˆ1, ˆ2 ,
, ˆk
When the value of y is obtained for the given values of x it is denoted as ŷ and is
called as the fitted value.
The fitted equation is often used for prediction. In this case, Ŷ is termed as
predicted value. Note that the fitted value is where the values used for explanatory
variables correspond to one of the observations in the data whereas predicted value
is the one obtained for any set of values of explanatory variables. It is not generally
recommended to predict the y-values for the set of those values of explanatory
variables which lie for outside the range of data. When the values of explanatory
variables are the future values of explanatory variables, the predicted values are
called as forecasted values.
e) Model validation/diagnostic
The validity of statistical method to be used for regression analysis depends on
various assumptions. These assumptions become essentially the assumptions for
9
the model and the data. The quality of statistical inferences heavily depends on
whether these assumptions are satisfied or not. For making these assumptions to be
valid and to be satisfied, care is needed from the beginning of the experiment. One
has to be careful in choosing the required assumptions and to determine if the
assumptions are valid for any given conditions? It is also important to know the
situations in which the assumptions may not meet.
The validation of the assumptions must be made before drawing any statistical
conclusion. Any departure from validity of assumptions will be reflected in the
statistical inferences. In fact, the regression analysis is an iterative process where
the outputs are used to diagnose, validate, criticize and modify the inputs/model.
Example:
Suppose an analyst wishes to model an executive annual compensation (bonus).
Below are collected data obtained from one branch of the company.
Executive
Variable
1 2 3 4 5
Compensation, y (RM) 85420 61333 107500 59225 98400
Experience, x1 (years) 8 2 7 3 11
College Education, x2 (years) 4 8 6 7 2
Employee supervised, x3 13 6 24 9 4
Corporate assets, x4 (million, RM) 1.6 0.25 3.14 0.10 2.22
Age, x5 (years) 42 30 53 36 51
Board of directors, x6
0 0 1 0 1
(1 if yes, 0 if no)
International responsibility, x7
1 0 1 0 0
(1 if yes, 0 if no)
How large is the sample size, n need to be? Recall that regression analysis aim to
estimate the mean response, E y . The mean response is related to a set of
independent variables through the parameters, of the specified model. The
sample size should be large enough so that the are both estimable and testable.
This will not occur unless, the sample size, n is at least as large as the number of
parameters in the model. To ensure a sufficiently large sample, a good rule of
thumb is to select n greater than or equal to 10 times the number of parameters in
the model.
10