CHAPTER 1 - INTRODUCTION - Introduction To Linear Regression Analysis, 5th Edition
CHAPTER 1 - INTRODUCTION - Introduction To Linear Regression Analysis, 5th Edition
PREV NEXT
⏮ ⏭
PREFACE CHAPTER 2: SIMPLE LINEAR REGRESSION
🔎
CHAPTER 1
INTRODUCTION
If we let y represent delivery time and x represent delivery volume, then the
equation of a straight line relating these two variables is
Find answers on the fly, or master something new. Subscribe today. See
pricing options.
where β0 is the intercept and β1 is the slope. Now the data points do not fall
exactly on a straight line, so Eq. (1.1) should be modified to account for this. Let
the difference between the observed value of y and the straight line (β0 + β1x) be
an error ε. It is convenient to think of ε as a statistical error; that is, it is a
random variable that accounts for the failure of the model to fit the data exactly.
The error may be made up of the effects of other variables on delivery time,
measurement errors, and so forth. Thus, a more plausible model for the delivery
time data is
Figure 1.1 (a) Scatter diagram for delivery volume. (b) Straight-line
relationship between delivery time and delivery volume.
To gain some additional insight into the linear regression model, suppose that we
can fix the value of the regressor variable x and observe the corresponding value
of the response y. Now if x is fixed, the random component ε on the right-hand
side of Eq. (1.2) determines the properties of y. Suppose that the mean and
2
variance of ε are 0 and σ . respectively. Then the mean response at any value of
the regressor variable is
Notice that this is the same relationship that we initially wrote down following
inspection of the scatter diagram in Figure 1.1a. The variance of y given any
value of x is
Thus, the true regression model μx|y = β0 + β1x isa line of mean values, that is, the
height of the regression line at any value of x is just the expected value of y for
that x. The slope, β1 can be interpreted as the change in the mean of y for a unit
change in x. Furthermore, the variability of y at a particular value of x is
2
determined by the variance of the error component of the model, a . This implies
that there is a distribution of y values at each x and that the variance of this
Find answers ondistribution
the fly, is theor
samemaster
at each x. something new. Subscribe today. See
pricing options.
Figure 1.2 How observations are generated in linear regression.
For example, suppose that the true regression model relating delivery time to
2
delivery volume is μy|x = 3.5 + 2x, and suppose that the variance is σ = 2. Figure
1.2 illustrates this situation. Notice that we have used a normal distribution to
describe the random variation in ε. Since y is the sum of a constant β0 + β1x (the
mean) and a normally distributed random variable, y is a normally distributed
random variable. For example, if x = 10 cases, then delivery time y has a normal
distribution with mean 3.5 + 2(10) = 23.5 minutes and variance 2. The variance
2
σ determines the amount of variability or noise in the observations y on delivery
2
time. When σ is small, the observed values of delivery time will fall close to the
2
line, and when σ is large, the observed values of delivery time may deviate
considerably from the line.
Generally regression equations are valid only over the region of the regressor
variables contained in the observed data. For example, consider Figure 1.5.
Suppose that data on y and x were collected in the interval x1 ≤ x ≤ x2. Over this
interval the linear regression equation shown in Figure 1.5 is a good
approximation of the true relationship. However, suppose this equation were used
to predict values of y for values of the regressor variable in the region x2 ≤ x ≤ x3.
Clearly the linear regression model is not going to perform well over this range
of x because of model error or equation error.
In general, the response variable y may be related to k regressors, x1, x2,…, xk, so
that
This is called a multiple linear regression model because more than one
regressor is involved. The adjective linear is employed to indicate that the model
is linear in the parameters β0, β1,…,βk, not because y isa linear function of the x's.
We shall see subsequently that many models in which y is related to the x's in a
nonlinear fashion can still be treated as linear regression models as long as the
Find answers onequation
theisfly,
linear or
in themaster
β's. something new. Subscribe today. See
pricing options.
An important objective of regression analysis is to estimate the unknown
parameters in the regression model. This process is also called fitting the model
to the data. We study several parameter estimation techniques in this book. One
of these techmques is the method of least squares (introduced in Chapter 2). For
example, the least-squares fit to the delivery time data is
An observational study
A designed experiment
A good data collection scheme can ensure a simplified and a generally more
applicable model. A poor data collection scheme can result in serious problems
for the analysis and its interpretation. The following example illustrates these
three methods.
Example 1.1
Consider the acetone–butyl alcohol distillation column shown in Figure 1.6. The
The nominal reflux rate is supposed to be constant for this process. Only
infrequently does production change this rate. We now discuss how the three
different data collection strategies listed above could be applied to this process.
Retrospective Study We could pursue a retrospective study that would use either
all or a sample of the historical process data over some period of time to
determine the relationships among the two temperatures and the reflux rate on the
acetone concentration in the product stream. In so doing, we take advantage of
previously collected data and minimize the cost of the study. However, these are
several problems:
Find answers on the fly, or master something new. Subscribe today. See
do not correspond directly. Constructing an approximate
correspondence usually requires a great deal of effort.
pricing
3. Production controls options.
temperatures as tightly as possible to specific
target values through the use of automatic controllers. Since the two
temperatures vary so little over time, we will have a great deal of
difficulty seeing their real impact on the concentration.
4. Within the narrow ranges that they do vary, the condensate
temperature tends to increase with the reboil temperature. As a
result, we will have a great deal of difficulty separating out the
individual effects of the two temperatures. This leads to the
problem of collinearity or multicollinearity, which we discuss
in Chapter 9.
The reliability and quality of the data are often highly questionable.
The nature of the data often may not allow us to address the
problem at hand.
The analyst often tries to use the data in ways they were never
intended to be used.
Using historical data always involves the risk that, for whatever reason, some of
the data were not recorded or were lost. Typically, historical data consist of
information considered critical and of information that is convenient to collect.
The convenient information is often collected with great care and accuracy. The
essential information often is not. Consequently, historical data often suffer from
transcription errors and other problems with data quality. These errors make
historical data prone to outliers, or observations that are very different from the
bulk of the data. A regression analysis is only as reliable as the data on which it is
based.
Just because data are convenient to collect does not mean that these data are
particularly useful. Often, data not considered essential for routine process
monitoring and not convenient to collect do have a significant impact on the
process. Historical data cannot provide this information since they were never
collected. For example, the ambient temperature may impact the heat losses from
our distillation column. On cold days, the column loses more heat to the
environment than during very warm days. The production logs for this acetone–
butyl alcohol column do not record the ambient temperature. As a result,
historical data do not allow the analyst to include this factor in the analysis even
though it may have some importance.
In some cases, we try to use data that were collected as surrogates for what we
really needed to collect. The resulting analysis is informative only to the extent
that these surrogates really reflect what they represent. For example, the nature of
the inlet mixture of acetone and butyl alcohol can significantly affect the
column's performance. The column was designed for the feed to be a saturated
liquid (at the mixture's boiling point). The production logs record the feed
temperature but do not record the specific concentrations of acetone and butyl
alcohol in the feed stream. Those concentrations are too hard to obtain on a
regular basis. In this case, inlet temperature is a surrogate for the nature of the
Find answers oninletthe fly,
mixture. It isor master
perfectly something
possible for the feed to be at the new. Subscribe today. See
correct specific
pricing options.
temperature and the inlet feed to be either a subcooled liquid or a mixture of
liquid and vapor.
In some cases, the data collected most casually, and thus with the lowest quality,
the least accuracy, and the least reliability, turn out to be very influential for
explaining our response. This influence may be real, or it may be an artifact
related to the inaccuracies in the data. Too many analyses reach invalid
conclusions because they lend too much credence to data that were never meant
to be used for the strict purposes of analysis.
Finally, the primary purpose of many analyses is to isolate the root causes
underlying interesting phenomena. With historical data, these interesting
phenomena may have occurred months or years before. Logs and notebooks
often provide no significant insights into these root causes, and memories clearly
begin to fade over time. Too often, analyses based on historical data identify
interesting phenomena that go unexplained.
Observational Study We could use an observational study to collect data for this
problem. As the name implies, an observational study simply observes the
process or population. We interact or disturb the process only as much as is
required to obtain relevant data. With proper planning, these studies can ensure
accurate, complete, and reliable data. On the other hand, these studies often
provide very limited information about specific relationships among the data.
In this example, we would set up a data collection form that would allow the
production personnel to record the two temperatures and the actual reflux rate at
specified times corresponding to the observed concentration of acetone in the
product stream. The data collection form should provide the ability to add
comments in order to record any interesting phenomena that may occur. Such a
procedure would ensure accurate and reliable data collection and would take care
of problems 1 and 2 above. This approach also minimizes the chances of
observing an outlier related to some error in the data. Unfortunately, an
observational study cannot address problems 3 and 4. As a result, observational
studies can lend themselves to problems with collinearity.
Designed Experiment The best data collection strategy for this problem uses a
designed experiment where we would manipulate the two temperatures and the
reflux ratio, which we would call the factors, according to a well-defined
strategy, called the experimental design. This strategy must ensure that we can
separate out the effects on the acetone concentration related to each factor. In the
process, we eliminate any collinearity problems. The specified values of the
factors used in the experiment are called the levels. Typically, we use a small
number of levels for each factor, such as two or three. For the distillation column
example, suppose we use a “high” or +1 and a “low” or –1 level for each of the
factors. We thus would use two levels for each of the three factors. A treatment
combination is a specific combination of the levels of each factor. Each time we
carry out a treatment combination is an experimental run or setting. The
experimental design or plan consists of a series of runs.
For the distillation example, a very reasonable experimental strategy uses every
possible treatment combination to form a basic experiment with eight different
settings for the process. Table 1.1 presents these combinations of high and low
levels.
Figure 1.7 illustrates that this design forms a cube in terms of these high and low
levels. With each setting of the process conditions, we allow the column to reach
equilibrium, take a sample of the product stream, and determine the acetone
Find answers onconcentration.
the fly,Weor thenmaster something
can draw specific new.
inferences about the effect ofSubscribe
these today. See
pricing options.
factors. Such an approach allows us to proactively study a population or process.
TABLE 1.1 Designed Experiment for the Distillation Column
1. Data description
2. Parameter estimation
3. Prediction and estimation
4. Control
Regression models may be used for control purposes. For example, a chemical
engineer could use regression analysis to develop a model relating the tensile
strength of paper to the hardwood concentration in the pulp. This equation could
then be used to control the strength to suitable values by varying the level of
hardwood concentration. When a regression equation is used for control
purposes, it is important that the variables be related in a causal manner. Note
that a cause-and-effect relationship may not be necessary if the equation is to be
used only for prediction. In this case it is only necessary that the relationships
that existed in the original data used to build the regression equation are still
valid. For example, the daily electricity consumption during August in Atlanta,
Georgia, may be a good predictor for the maximum daily temperature in August.
However, any attempt to reduce the maximum temperature by curtailing
electricity consumption is clearly doomed to failure.
Find answers on the fly, or master something new. Subscribe today. See
pricing options.
Figure 1.8 Regression model-building process.
Find answers on the fly, or master something new. Subscribe today. See
pricing options.