0% found this document useful (0 votes)
32 views10 pages

Chapter 0

chapter 0

Uploaded by

fxiqxxhjxnnxh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views10 pages

Chapter 0

chapter 0

Uploaded by

fxiqxxhjxnnxh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

CHAPTER 0

INTRODUCTION

0.1 PREFACE
Much research is devoted to the topic of modeling, that is trying to describe how
variables are related. For example, an advertising agency might be interested in
modeling the relationship between sales revenue and the amount spent on
advertising. A lecturer might want to know if student performance is related to the
number of hours spent on revision and practicing tutorials. Regression analysis is
one of the most widely used techniques for analyzing relationships among given
variables. Its wide appeal and usefulness comes from conceptually logical process
of using an equation to express the relationship between a variable of interest (also
known as the response or dependent variable, DV) and a set of related predictor
variables (also known as explanatory or independent variables, IV).
Regression analysis is also interesting theoretically due to its elegant underlying
mathematics and a well-developed statistical theory. Successful use of regression
requires an appreciation of both the theory and the practical problems that typically
arise when the technique is applied to real-world data.
This course assumes that the student has taken an introductory courses in statistics
and has the familiarity with hypothesis tests (involving 1 sample, 2 samples and k-
samples) and confidence intervals, and the normal, t,  2 and F distributions. Some
knowledge of matrix algebra is also necessary.

0.2 REGRESSION AND MODEL BUILDING


Regression analysis is a statistical technique for investigating and modeling the
relationship between variables. Applications of regression are numerous and
occur in almost every field, including engineering, the physical and chemical
sciences, economics, management, life and biological sciences, and the social
sciences. In fact, regression analysis may be the most widely used statistical
modeling technique.
As an example of a problem in which regression analysis may be helpful, suppose
that an industrial engineer employed by a soft drink beverage bottler is analyzing
the product delivery and service operations for vending machines. He suspects that
the time required by a route deliveryman to load and service a machine is related to
the number of cases of product delivered. The engineer visits 25 randomly chosen
retail outlets having vending machines, and the in-outlet delivery time (in minutes)
and the volume of product delivered (in cases) are observed for each. The 25
observations are plotted in Figure 1a. This graph is called a scatter plot/diagram.
This display clearly suggests a relationship between delivery time and delivery

1
volume as the impression is that the data points generally, but no exactly, fall along
a straight line. Figure 1b illustrates this straight-line relationship.
Delivery time, y

Delivery volume, x
Figure 1a Scatter plot/diagram for delivery volume.
Delivery time, y

Delivery volume, x
Figure 1b Straight-line relationship between delivery time and delivery volume

If we let y represent delivery time and x represent delivery volume, then the
equation of a straight line relating these two variables is
y   0  1x

where 0 is the intercept and 1 is the slope. Now the data points do not fall
exactly on a straight line, so the equation above should be modified to account for
this. Let the difference between the observed value of y and the straight line
0  1x be an error  . It is convenient to think of  as a statistical error; that is, it
is a random variable that accounts for the failure of the model to fit the data
2
exactly. The error may be made up of the effects of others variables on delivery
time, measurement errors, and so forth. Thus, a more plausible model for the
delivery time data is
y  0  1x  
Equation above is called a linear regression model. Customarily x is called the
independent variable and y is called the dependent variable. However, this often
causes confusion with the concept of statistical independency, so we refer to x as
the explanatory, predictor or regressor variable and y as the response variable.
Because the equation above involves only one regressor variable, it is called a
simple linear regression model.
To gain some additional insight into the linear regression model, suppose that we
can fix the value of the regressor variable x and observe the corresponding value of
the response y. Now if x is fixed, the random component  on the right-hand side
of equation above determines the properties of the y. Suppose that the mean and
variance of  are 0 and  2 , respectively. Then the mean response at any value of
the regressor variable is
E  y x    y|x  E   0  1x      0  1x
Notice that this is the same relationship that we initially wrote down following
inspection of the scatter diagram in Figure 1a. The variance of y given any value of
x is
Var  y x   Var   0  1x     Var      2

Thus, the true regression model  y|x  0  1x is a line of mean values, that is, the
height of the regression line at any value of x is just the expected value of y for a
unit x. The slope, 1 , can be interpreted as the change in the mean of y for a unit
change in x. Furthermore, the variability of y at a particular value of x is
determined by the variance of the error component of the model,  2 . This implies
that there is a distribution of y values at each x and that the variance of this
distribution is the same (homoscedastic) at each x.
For example, suppose that the true regression model relating delivery time to
delivery volume is  y|x  3.5  2 x , and suppose that the variance is  2  2 . Figure
1.2 illustrates this situation. Notice that we have used a normal distribution to
describe the random variation in  . Since y is the sum of a constant 0  1x (the
mean) and a normally distributed random variable, Y is a normally distributed
random variable. For example, if X  10 cases, then delivery time y has a normal
distribution with mean 3.5  2 10   23.5 minutes and variance 2. The variance  2
determines the amount of variability or noise in the observations Y on delivery
time. When  2 is small, the observed values of delivery time will fall close to the
line, and when  2 is large, the observed values of delivery time may deviate
considerably from the line.

3
y
Observed values of y for a
given x are sampled from
these distributions

N (0, = 2)

Observed value of y

10 20 30 x
Figure 1.2 How observations are generated in linear regression

In almost all applications of regression, the regression equation is only an


approximation to the true functional relationship between the variables of interest.
These functional relationships are often based on physical, chemical, or other
engineering or scientific theory, that is, knowledge of the underlying mechanism.
Consequently, these types of models are often called mechanistic models.
Regression models, on the other hand, are thought of as empirical models. Figure
1.3 illustrates a situation where the true relationship between y and x is relatively
complex, yet it may be approximated quite well by a linear regression equation.

Straight-line
approximation

y True relationship

x
Figure 1.3 Linear regression approximation of a complex relationship

In general, the response variable y may be related to k regressors, x1, x2 , , xk , so


that
y  0  1x1   2 x2    k xk  
This is called a multiple linear regression model because more than one regressor
is involved. The adjective linear is employed to indicate that the model is linear in

4
the parameters 0 , 1, 2 , , k , not because y is a linear function of the x’s. We
shall see subsequently that many models in which y is related to the x’s in a
nonlinear fashion can still be treated as linear regression models as long as the
equation is linear in the  's .

An important objective of regression analysis is to estimate the unknown


parameters in the regression model. This process is also called fitting the model to
the data. Several parameter estimation techniques exist and one of these techniques
is the method of least squares (introduced in Chapter 2).
The next phase of a regression analysis is called model adequacy checking, in
which the appropriateness of the model is studied and the quality of the fit
ascertained. Through such analyses the usefulness of the regression model may be
determined. The outcome of adequacy checking may indicate either that the model
is reasonable or that the original fir must be modified. Thus, regression analysis is
an iterative procedure, in which data lead to a model and a fit of the model to the
data is produced. The quality of the fit is then investigates, leading either to
modification of the model or the fit or to adoption of the model. This process will
be illustrated several times in subsequently chapters.
A regression model does not imply a cause-effect relationship between the
variables. Even though a strong empirical relationship may exist between two or
more variables, this cannot be considered evidence that the regressor variables and
the response are related in a cause-effect manner. To establish causality, the
relationship between the regressors and the response must have a basic outside the
sample data, for example the relationship may be suggested by theoretical
considerations. Regression analysis can be used to aid in confirming a cause-effect
relationship, but it cannot be the sole basis of such a claim.
Finally it is important to remember that regression analysis is part of a broader
data-analytical approach to problem solving. That is, the regression equation itself
may not be the primary objective of the study. It is usually more important to gain
insight and understanding concerning the system generating the data.

0.3 DATA TYPE/COLLECTION


An essential aspect of regression analysis is data collection. Any regression
analysis is only as good as the data that it is based on. Three basic methods for
collecting data are
 A retrospective study based on historical data
 An observational study
 A designed experiment
A good data collection scheme can ensure a simplified analysis and a generally
more applicable model. A poor data collection scheme can induce serious

5
problems for the analysis and its interpretation. The following example illustrates
these three methods.
Statistical data come in a variety of types. While some regression methods can be
applied with little or no modification to many different types of data sets, the
special features of some data sets must be accounted for or should be exploited.
A cross-sectional data set consists of a sample of individuals, households, firms,
cities, states, countries, or a variety of other units, taken at a given point in time.
Sometimes, the data on all units do not correspond to precisely the same time
period. Typically, we would ignore any minor timing differences in collecting the
data. For example, if a set of families was surveyed during different weeks of the
same year, we would still view this as a cross-sectional data set. An important
feature of cross-sectional data is that we can often assume that they have been
obtained by random sampling from the underlying population. The random
sampling leads to independent assumption of the data.
A time series data set consists of observations on a variable or several variables
over (a reasonably long period of) time. Since past events can influence future
events and lags in behavior are prevalent in social sciences, economics, finance
and other areas, time is an important dimension in a time series data set. Unlike the
arrangement of cross-sectional data, the chronological ordering of the
observations in a time series conveys potentially important information. A key
feature of time series data that make them more difficult to analyze than cross-
sectional data is the fact that time series observations can rarely, if ever, be
assumed to be independent across time. Most time series are related, often strongly
related, to their recent histories.

0.4 LINEARITY OF REGRESSION MODEL


Linear regression models provide a rich and flexible framework that suits the needs
of many researches. However, linear regression models are not appropriate for all
situations as there are many problems in engineering, sciences (particularly
physics) and some medical areas where the response variable and the explanatory
variables are related through a non-linear function.
A model or relationship is termed as linear if it is linear in parameters and
nonlinear, if it is not linear in parameters. In other words, if all the partial
derivatives of y with respect to each of the parameters, 0 , 1, 2 , , k are
independent of the parameters, then the model is called as a linear model. If any of
the partial derivatives of y with respect to any of the 0 , 1, 2 , , k is not
independent of the parameters, the model is called as nonlinear. Note that the
linearity or non-linearity of the model is not described by the linearity or
nonlinearity of explanatory variables in the model.

6
For example, consider: y  1x12   2 x2  3 log x3   . The model is linear since
y i for i = 1, 2, 3 are independent of the parameters, i .

On the other hand, y  12 x1  sin 2 x2  3 log x3   is a non-linear model since


y 1  21x1 depends on 1 and y  2  cos  2 x2 depends on 2 although
y 3  log x3 are independent of 3 .

Sometime, it is useful to apply transformation to induce linearity in the model


function. A non-linear model that can be transformed to an equivalent linear form
is said to be intrinsically linear. This occurs if the error structure can be assumed to
be multiplicative. For example:
y  1e 2 x taking logarithms produces ln y  ln 1   2 x  ln  .

0.5 ESTIMATION OF REGRESSION MODEL


The statistical linear regression model essentially consists of developing
approaches and tools to determine the parameters, 0 , 1, 2 , , k in the linear
model: Y  0  1 X1  2 X 2   k X k   , given the observations on y and
x1, x2 , , xk .

Different statistical estimation procedures, e.g., method of maximum likelihood,


principal of least squares, method of moments etc. can be employed to estimate the
parameters of the model. The method of maximum likelihood needs further
knowledge of the distribution of whereas the method of moments and the principal
of least squares do not need any knowledge about the distribution of y.
The regression analysis is a tool to determine the values of the parameters given
the data on y and x1, x2 , , xk . The literal meaning of regression is “to move in
the backward direction”. Before discussing and understanding the meaning of
“backward direction”, let us find out which of the following statement is correct:
Statement 1: model generates data, or Statement 2: data generates model.
Most people would think statement 2 is correct. You are wrong!!!
Regression analysis can be broadly thought that the model exists in nature but is
unknown to the researcher. When some values to the explanatory variables are
provided, then the values for the output or study variable are generated
accordingly, depending on the form of the function f and the nature of
phenomenon. So ideally, the pre-existing model gives rise to the data. Our
objective is to determine the functional form of this model.
Now, we move in the backward direction. We propose to first collect the data on
the study and the possible explanatory variables. Then we employ some statistical
techniques and use this data to know the form of the function f. Equivalently, the
7
data from the model is recorded first and then used to determine the parameters of
the model. The regression analysis is a technique which helps in determining the
statistical model by using the data collected.
Consider a simple example to understand the meaning of “regression”. Suppose the
yield of crop, y depends linearly on two explanatory variables, the quality of
fertilizer, x1 and level of irrigation, x2 through the following function:
y  0  1x1   2 x2  

There exist the true values of 1 and 2 in nature but are unknown to the
researcher. Some values on y are recorded by providing different values to x1 and
x2 . There exists some relationship between y and  x1 , x2  which give rise to a
systematically behaved data on y, x1 and x2 . However, such relationship is
unknown to the researcher. To determine the model, we move in the backward
direction in the sense that the collected data is used to determine (estimate) the
parameters, 1 and 2 of the model.

0.6 PROCEDURES OF REGRESSION ANALYSIS


Regression analysis includes the following steps.
a) Statement of the problem under consideration
The first important step in conducting any regression analysis is to specify the
problem and the objectives to be addressed by the regression analysis. The wrong
formulation or the wrong understanding of the problem will give the wrong
statistical inferences. The choice of variables depends upon the objectives of study
and understanding of the problem. For example, if height and weight of children
are related, there can be two issues to be addressed:
i) determination of height for given height, or
ii) determination of weight for given height
In case (i), the height is the response variable whereas weight is the response
variable in case 2. As such the role of explanatory variable is also interchanged in
case 1 and case 2.

b) Choice and collection of relevant variables


Once the problem is carefully formulated and objectives have been decided, the
next question is to choose the relevant variables. It has to be kept in mind that the
correct choice of variables not only determine the correct statistical inferences, but
also to ensure adequacy of the model as well as determining important factors
influencing the response variable.
Once the objective of the study is clearly stated and the variables are chosen, the
next question is to collect the data on the relevant variables. The data is essentially
the measurement on these variables. It is important to know how the data are
8
recorded. Moreover, it is also important to decide whether the data to be collected
as quantitative variables or qualitative variables. The regression methods and
approaches for quantitative and qualitative variables are different.

c) Specification of model
The researcher working on the subject should know or need to determine the form
of the model. A general form can be written as follows:
y  f  x1, x2 , , xk ; 1, 2 , , k   
where  is the random error reflecting the difference between the observed value
of y and the value of y obtained (predicted) by the model. The tentative model
depends on some unknown parameters. The form of
f  x1, x2 , , xk ; 1, 2 , , k  can be linear or non-linear, depending on the form
of the parameters, 1, 2 , , k .

d) Fitting the model


After the model has been defined and the data have been collected, the next task is
to estimate the parameters of the model based on the collected data. This is also
referred to as parameter estimation or model fitting. The most commonly used
method of estimation is called the least squares method. Under certain
assumptions, the least squares method produces estimators with desirable
properties. The other estimation methods are the maximum likelihood method,
ridge method, principal components method etc.
The estimation of unknown parameters using appropriate method provides the
values of the model parameters. Substituting these values in the model/equation
gives us a usable model and enable us to conduct model fitting. The estimated
model is generally written as follows:


y  f x1, x2 , , xk ; ˆ1, ˆ2 , 
, ˆk  
When the value of y is obtained for the given values of x it is denoted as ŷ and is
called as the fitted value.

The fitted equation is often used for prediction. In this case, Ŷ is termed as
predicted value. Note that the fitted value is where the values used for explanatory
variables correspond to one of the observations in the data whereas predicted value
is the one obtained for any set of values of explanatory variables. It is not generally
recommended to predict the y-values for the set of those values of explanatory
variables which lie for outside the range of data. When the values of explanatory
variables are the future values of explanatory variables, the predicted values are
called as forecasted values.

e) Model validation/diagnostic
The validity of statistical method to be used for regression analysis depends on
various assumptions. These assumptions become essentially the assumptions for
9
the model and the data. The quality of statistical inferences heavily depends on
whether these assumptions are satisfied or not. For making these assumptions to be
valid and to be satisfied, care is needed from the beginning of the experiment. One
has to be careful in choosing the required assumptions and to determine if the
assumptions are valid for any given conditions? It is also important to know the
situations in which the assumptions may not meet.
The validation of the assumptions must be made before drawing any statistical
conclusion. Any departure from validity of assumptions will be reflected in the
statistical inferences. In fact, the regression analysis is an iterative process where
the outputs are used to diagnose, validate, criticize and modify the inputs/model.

Example:
Suppose an analyst wishes to model an executive annual compensation (bonus).
Below are collected data obtained from one branch of the company.
Executive
Variable
1 2 3 4 5
Compensation, y (RM) 85420 61333 107500 59225 98400
Experience, x1 (years) 8 2 7 3 11
College Education, x2 (years) 4 8 6 7 2
Employee supervised, x3 13 6 24 9 4
Corporate assets, x4 (million, RM) 1.6 0.25 3.14 0.10 2.22
Age, x5 (years) 42 30 53 36 51
Board of directors, x6
0 0 1 0 1
(1 if yes, 0 if no)
International responsibility, x7
1 0 1 0 0
(1 if yes, 0 if no)

How large is the sample size, n need to be? Recall that regression analysis aim to
estimate the mean response, E  y  . The mean response is related to a set of
independent variables through the parameters,  of the specified model. The
sample size should be large enough so that the  are both estimable and testable.
This will not occur unless, the sample size, n is at least as large as the number of
parameters in the model. To ensure a sufficiently large sample, a good rule of
thumb is to select n greater than or equal to 10 times the number of parameters in
the model.

10

You might also like