0% found this document useful (0 votes)
4 views

Lecture Notes

The document discusses basic concepts in regression analysis, emphasizing the relationship between dependent and independent variables. It introduces the regression model, the method of least squares for estimating parameters, and the distinction between deterministic and statistical relationships. Additionally, it highlights the importance of understanding causality in regression and provides examples to illustrate these concepts.

Uploaded by

justusnyamaiamin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lecture Notes

The document discusses basic concepts in regression analysis, emphasizing the relationship between dependent and independent variables. It introduces the regression model, the method of least squares for estimating parameters, and the distinction between deterministic and statistical relationships. Additionally, it highlights the importance of understanding causality in regression and provides examples to illustrate these concepts.

Uploaded by

justusnyamaiamin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 141

Lecture 1

Basic Concepts in Regression

By Prof. Ann Mwangi


Introduction

I Regression analysis is a statistical technique used to


describe relationships among variables.
I The simplest case to examine is one in which a variable Y
, referred to as the dependent or outcome variable,
may be related to one variable X, called an independent
or explanatory variable, or a regressor
I The purpose of regression is to try to find the best line of
fit or equation that expresses the relationship between Y
and X.
Example
Consider the following data points
X 1 2 3 4 5 6
Y 3 5 7 9 11 13

A graph of the (x, y) pairs would appear as


Introduction
I Regression analysis is not needed to obtain the equation
that describes Y and X because it is readily seen that
Y = 1 + 2X .
I This is an exact or deterministic relationship.
I Deterministic relationships are sometimes (although very
rarely) encountered in business environments. For
example, in accounting:

assets = liabilities + owner equity


total costs = fixed costs + variable costs
I In business and other social science disciplines,
deterministic relationships are the exception rather than
the norm.
Example 2
Data encountered in a business environment are more likely to
appear like the data points in this graph, where Y and X
largely obey an approximately linear relationship, but it is not
an exact relationship:
I Still, it may be useful to describe the relationship in
equation form, expressing Y as X alone
I the equation can be used for forecasting and policy
analysis, allowing for the existence of errors (since the
relationship is not exact).
I So how to fit a line to describe the ”broadly linear”
relationship between Y and X when the (x; y) pairs do
not all lie on a straight line?
Functional relation between variable
I Expressed by mathematical formula Y = f (x)
I Given a particular value of X, the function f indicates the
corresponding value of Y.
I Example
Period Number of units sold Dollar sales
1 75 150
2 25 50
3 130 260

I All pointsfall directly on the line of functional relationship.


This is characteristic of all functional relations.
Statistical relation between variable
I Unlike the functional relationship the relation is not a
perfect one
I The observations do not fall directly on the curve of
relationship
I Eg: Performance evaluations for 10 employees obtained at
midyear and at year-end. Year-end evaluations are taken
as Y, and midyear evaluations as X. Below is the plot
Statistical relations
I Plot suggests that there is a relation between midyear and
year-end evaluations, in the sense that the higher the
midyear evaluation, the higher tends to be the year-end
evaluation.
I The relation is not a perfect one. There is a scattering of
points, suggesting that some of the variation in year-end
evaluations is not accounted for by midyear performance
assessments.
I E.g Two employees had midyear evaluations of X = 80,
yet they received somewhat different year-end evaluations.
Because of the scattering of points in a statistical relation
I The plot is called a scatter diagram or scatter plot. In
statistical terminology, each point in the scatter diagram
represents a trial or a case.
Statistical relations with a line

I Here we have plotted a line of relationship that describes


the statistical relation between X and Y.
I It indicates the general tendency by which Y vary with the
level X
I Most of the points do not fall directly on the line of
statistical relationship.
I Scattering of points around the line represents variation in
Y that is not associated with X and that is usually
considered to be of a random nature.
Basic concepts

A regression model is a formal means of expressing the two


essential ingredients of a statistical relation:
1. A tendency of the response variable Y to vary with the
predictor variable X in a systematic fashion.
2. A scattering of points around the curve of statistical
relationship.
These two characteristics are embodied in a regression model
by postulating that:
I There is a probability distribution of Y for each level of X.

I The means of these probability distributions vary in some


systematic fashion with X.
Regression model

I The year-end evaluation Y is treated in a regression


model as a random variable. For each level of midyear
performance evaluation, there is postulated a probability
distribution of Y.
I The figure above shows such a probability distribution for
X = 90, which is the midyear evaluation for the first
employee.
I The actual year-end evaluation of this employee, Y = 94,
is then viewed as a random selection from this probability
distribution.
Example: Performance Evaluation

I The Figure also shows probability distributions of Y for


midyear evaluation levels X = 50 and X = 70. Note that
the means of the probability distributions have a
systematic relation to the level of X.
I This systematic relationship is called the regression
function of Y on X.
I The graph of the regression function is called the
regression curve. In this case the regression function is
slightly curvilinear. This implies that the increase in the
expected (mean) year-end evaluation with an increase in
midyear performance evaluation is retarded at higher
levels of midyear performance.
Regression Models with More than One Predictor
Variable
I Efficiency study of 67 branch offices of a consumer
finance chain, the response: direct operating cost for the
year just ended. Four predictors: average size of loan
outstanding during the year, average number of loans
outstanding, total number of new loan applications
processed, and an index of office salaries.
I Tractor purchase study, response:volume (in horsepower)
of tractor purchases in a sales territory of a farm
equipment firm. Nine predictor:average age of tractors on
farms in the territory, number of farms in the territory,
and a quantity index of crop production in the territory
etc.
I In a medical study of short children, the response:peak
plasma growth hormone level. 14 predictor variables: age,
gender, height, weight and 10 skinfold measurements etc.
Construction of Regresssion models

Selection of predictor variables: consider extent to which a


variable contributes in explaining variation in Y
Functional form of regression relation Tied to choice of
predictor variables but may also be informed by
relevant theory
Scope of model Determined by the design of the investigation
or range of data at hand
Uses of regression analysis

Description : Tractor example


Control: Efficiency study
Prediction: Growth hormone example
Regression and Causality
I The existence of a statistical relation between the
response variable Y and the explanatory or predictor
variable X does not imply in any way that Y depends
causally on X.
I No matter how strong is the statistical relation between X
and Y, no cause-and-effect pattern is necessarily implied
by the regression model.
I For example, data on size of vocabulary (X) and writing
speed (Y) for a sample of young children aged 5-10 will
show a positive regression relation.
I This relation does not imply, however, that an increase in
vocabulary causes a faster writing speed. Here, other
explanatory variables, such as age of the child and
amount of education, affect both the vocabulary (X) and
the writing speed (Y).
I Older children have a larger vocabulary and a faster
Lecture 2

Simple Linear Regression Model and LSE

By Prof. Ann Mwangi


Simple Linear Regression Model with Unspecified
Dist. of errors
Basic simple linear regression model

Yi = β0 + β1 Xi + i

where
I Yi value of the response variable in the ith trial

I β0 and β1 are parameters

I Xi value of the predictors in the ith trial

I i random error term with mean E i = 0 and variance


σ 2 [i ] = σ 2 ;i and j are uncorrelated so that covariance=0
I i=1,2...,n

The model is linear in parameters and in the predictor variable.


It is also called a first order model
Features of Linear Model
1. The response Yi in the ith trial is the sum of two
components: (1) the constant term β0 + β1 Xi and (2) the
random term i . Hence, Yi is a random variable.
2. Since E [i ] = 0 , it follows that:
E [Yi ] = E [β0 + β1 Xi + i ] = β0 + β1 Xi + E [i ] = β0 + β1 Xi
Thus the response Yi comes from a probability
distribution whose mean is E [Yi ] = β0 + β1 Xi
3. The response Yi in the ith trial exceeds or falls short of
the value of the regression function by the error term
amount i .
4. The error terms i are assumed to have constant variance
σ 2 thus the responses Yi have the same constant
variance:σ 2 [Yi ] = σ 2 . regardless of the level of the
predictor variable X.
5. We assume error terms are uncorrelated. Since i and j
are uncorrelated, hence Yi and Yj are uncorrelated.
Interpreting regression parameters

I The parameters β0 and β1 , in regression model are called


regression coefficients.
I β1 is the slope of the regression line. It indicates the
change in the mean of the probability distribution of Y
per unit increase in X.
I The parameter β0 is the Y intercept of the regression line.
I When the scope of the model includes X = 0, β0 gives
the mean of the probability distribution of Y at X = O.
I When the scope of the model does not cover X = 0, β0
does not have any particular meaning as a separate term
in the regression model.
Example: electrical distributor

The Figureshows the regression function: E [Y ] = 9.5 + 2.1X


I The slope β1 = 2.1 indicates that the preparation of one
additional bid in a week leads to an increase in the mean
of the probability distributionof Y of 2.1 hours.
I The intercept β1 = 9.5 indicates the value of the
regression function at X=0.
Example:Estimation of Regression Function

In a small-scale study of persistence, an experimenter gave


three subjects a very difficult task. Data on the age of the
subject (X) and on the number of attempts to accomplish the
task before giving up (Y) follow

Subject i: 1 2 3
Age Xi 20 55 30
Number of attempts Yi 5 12 10
Notation: n = 3, the observations for the first subject were
(X1 , Y1 ) = (20, 5), and similarly for the other subjects.
Method of least -Squares
I To find ”good” estimators of the regression parameters
β0 and β1 , we employ the method of least squares.
I For the observations (Xi , Yi ) for each case, the method of
least squares considers the deviation of Yi from its
expected value:Yi − (β0 + β1 X )
I The method of least squares requires that we consider the
sum of the n squared deviations. This criterion is denoted
by
X n
Q= (Yi − β0 − β1 X )2
i=1
I According to the method of least squares, the estimators
of β0 and β1 are those values b0 and b1 respectively, that
minimize the criterion Q for the given sample
observations (X1 , Y1 ), (X2 , Y − 2), ..., (Xn , Yn ).
Example: Least Squares Criterion Q

1. Y = 9.0 + 0X hence
Q = (5 − 9)2 + (12 − 9)2 + (10 − 9)2 = 26
2. Y = 2.81 + 0.177X hence
Q = (5 − 6.35)2 + (12 − 12.55)2 + (10 − 8.12)2 = 26
I Thus, a better fit of the regression line to the data
corresponds to a smaller sum Q.
I The objective of the method of least squares is to find
estimates b0 and b1 for β0 and β1 , respectively, for which
Q is a minimum.
Least Squares Estimators

The estimators b0 and b1 that satisfy the least squares


criterion can be found in two basic ways:
1. Numerical search procedures can be used that evaluate in
a systematic fashion the least squares criterion Q for
different estimates bo and bl until the ones that minimize
Q are found.
2. Analytical procedures can often be used to find the values
of bo and bl that minimize Q. The analytical approach is
feasible when the regression model is not mathematically
complex.
Analytical approach
It can be shown for regression model that the values b0 and b1
that minimize Q for any particular set of sample data are given
by the following simultaneous equations also known as normal
equations: X X
Yi = nb0 + b1 Xi
X X X
Xi Yi = b0 Xi + b1 Xi2
b0 and b1 are called point estimators of β0 and β1 ,
respectively.Solving the simultaneous equations
P
(Xi − X̄ )(Yi − Ȳ )
b1 = P
(Xi − X̄ )2
1 X X
b0 = ( Yi − b1 Xi ) = Ȳ − b1 X̄
n
where X̄ and Ȳ are the means of the Xi and the Yi
observations, respectively
Example: Toluca Company

P
(Xi − X̄ )(Yi − Ȳ ) 70690
b1 = P = = 3.57
(Xi − X̄ )2 19800
b0 = Ȳ − b1 X̄ = 312.28 − 3.57(70.0) = 62.37
Example: Fitted Regression

We estimate that the mean number of work hours increases by


3.57 hours for each additional unit produced in the lot. This
estimate applies to the range of lot sizes in thedata from
which the estimates were derived, namely to lot sizes ranging
from about 20 to about 120.
Point estimation of the mean response: Estimated
Regression Function
I Given sample estimators b0 and b1 of the parameters in
the regression function

E [Y ] = β0 + β1 X

where E [Y ] is the mean response


I The estimated regression function

Ŷ = b0 + b1 X

is the value of the estimated regression function at the


level X of the predictor variable.
I Ŷ is an unbiased estimator of E [Y ], with minimum
variance in the class of unbiased linear estimators
I Ŷi = b0 + b1 Xi is the fitted value for the ith case
Toluca Company example
I The least squares estimates of the regression coefficients
are: b0 = 62.37 b1 = 3.5702.
I Hence, the estimated regression function is:
Ŷ = 62.37 + 3.5702X
I To estimate the mean response for any level X of the
predictor variable, we simply substitute that value of X in
the estimated regression function.
I Suppose that we are interested in the mean number of
work hours required when the lot size is X = 65; our
point estimate is: Ŷ = 62.37 + 3.5702(65) = 294.4
I We interpret this to mean that if many lots of 65 units
are produced under the conditions of the 25 runs on
which the estimated regression function is based, the
mean labor time for these lots is about 294 hours
Residuals

I The ith residual is the difference between the observed


value Yi and the corresponding fitted value Ŷi , denoted
by ei
I Defined in general as ei = Yi − Ŷi
I ei = Yi − (b0 + b1 Xi ) = Yi − b0 − b1 Xi
I Toluca example we see that the residual for the first case
is: e1 = Y1 − Ŷ1 = 399 − 347.98 = 51.02
Fitted and Residuals: Toluca example
Illustration of Residuals: Toluca example
Properties of Fitted Regression line

P
I The sum of the residuals is zero: ei = 0
The sum of the squared residuals, ei2 , is a minimum
P
I
requirement for LSE
I The sum of the observed
P values
P Y; equals the sum of the
fitted values Ŷ : Yi = Ŷi
I The sum of the weighted residuals is zero when the
residual in the ith trial is weighted by the level of the
predictor variable in the ith
P trial as P
well as by the
response for the ith trial: Xi ei = Yi ei = 0
I The regression line always goes through the point (X̄ , Ȳ ).
Point Estimator of σ 2: Single population

I We know that the variance σ 2 of a single population is


estimated by the sample variance s 2 .

(Yi − Ȳ )2
P
2
s =
n−1
which is an unbiased estimator of the variance σ 2 of an
infinite population.
I The sample variance is often called a mean square,
because a sum of squares has been divided by the
appropriate number of degrees of freedom.
Point Estimator of σ 2:Regression Model

I SSE stands for error sum of squares or residual sum of


squares. X X
SSE = (Yi − Ŷi )2 = ei2
I Residual mean square or error mean square

(Yi − Ŷi )2
P P 2
2 SSE ei
s = MSE = = =
n−2 n−2 n−2
I MSE is an unbiased estimator of σ 2 for the regression
model
Lecture 3

Maximum Likelihood Estimation

By Prof. Ann Mwangi


Estimation of parameters: Method of Least Square
Method of least for linear regression can be extended for
polynomial regression involves minimizing the squared residual
n
X n
X
ei2 = (yi − b1 − b2 xi )2
i=1 i=1

I To find the values of b1 and b2 that lead to the minimum,


n
∂ ni=1 ei2
P X
= −2 (yi − b1 − b2 xi ) = 0 (1)
∂b1 i=1

Pn 2 n
∂ i=1 ei
X
= −2 xi (yi − b1 − b2 xi ) = 0 (2)
∂b2 i=1

I Equations (1) and (2) are known as normal equations.


Solving the two normal equations leads to
n
X n
X n
X
( )b1 + ( xi )b2 = yi
i=1 i=1 i=1

Xn n
X n
X
( xi )b1 + ( xi2 )b2 = xi yi
i=1 i=1 i=1

in matrix notation
 P    P 
Pn P x2i b1
= P yi
xi xi b2 xi yi
Example

Table: Example

Case X Y XY X 2
1 0 2.1
2 1 7.7
3 2 13.6
4 3 27.2
5 4 40.9
6 5 61.1

P P P
n =
P 2 6, x i = 15, y i = 152.6, xi yi = 585.6 and
xi = 55
     
6 15 b1 152.6
=
15 55 b2 585.6
6b1 + 15b2 = 152.6
15b1 + 55b2 = 585.6
solving the two
5(6b1 + 15b2 = 152.6)
2(15b1 + 55b2 = 585.6)
we get b2 = 11.66 and b1 = −3.72
Introduction

I When the functional form of the probability distribution


of the error terms is specified,estimators of the
parameters β0 + β1 and σ 2 can be obtained by the
method of maximum likelihood.
I Essentially, the method of maximum likelihood chooses as
estimates those values of the parameters that are most
consistent with the sample data.
Single population

I Consider a normal population whose standard deviation is


known to be σ 2 = 10 and whose mean is unknown.
I A random sample of n = 3 observations is selected from
the population and yields the results
Y1 = 250, Y2 = 265, Y3 = 259.
I We now wish to ascertain which value of µ, is most
consistent with the sample data.
Normal distribution with µ = 230 and σ = 10
I Consider µ = 230. The figure below shows the normal
distribution with µ = 230 and σ = 10 ; also shown there
are the locations of the three sample observations.

I Sample observations would be in the right tail of the


distribution if µ = 230,
I Since these are unlikely occurrences, µ = 230, is not
consistent with the sample data.
Normal distribution with µ = 259 and σ = 10
I Figure below shows the population and the locations of
the sample data if µ = 259

I Now the observations would be in the center of the


distribution and much more likely.
I Hence, µ = 259 is more consistent with the sample data
than µ = 230
Method of maximum likelihood
I The method of maximum likelihood uses the density of
the probability distribution at Yi (i.e., the height of the
curve at Yi ) as a measure of consistency for the
observation Yi .
I Consider observation Yi in our example. If Yi is in the
tail, as in Figure µ = 230, the height of the curve will be
small. If Yi is nearer to the center of the distribution, as
in Figure µ = 259, the height will be larger.
I Using the density function for a normal probability
distribution in
"  2 #
1 1 x −µ
f (x) = p exp −
(2π)σ 2 2 σ
Densities
We find the densities for Y1 denoted by f1 for the two cases of
µ, in the figures as follows:

The densities for all three sample observations for the two
cases of µ, are as follows:
Method of Maximum likelihood
I The method of maximum likelihood uses the product of
the densities (i.e., here, the product of the three heights)
as the measure of consistency of the parameter value with
the sample data.
I The product is called the likelihood value of the
parameter value µ, and is denoted by L(µ),
I If the value of µ, is consistent with the sample data, the
densities will be relatively large and so will be the product
( L(µ), the likelihood value).
I If the value of µ, is not consistent with the data, the
densities will be small and the product L(µ) will be small.
Maximum likelihood Estimate
I There are two methods of finding maximum likelihood
estimates: by a systematic numerical search and by use of
an analytical solution.
I For some problems, analytical solutions for the maximum
likelihood estimators are available.
I For others, a computerized numerical search must be
conducted.
I The product of the densities viewed as a function of the
unknown parameters is called the likelihood function. For
our example, where σ = 10, the likelihood function is:
" #3 "  2 # "  2 #
1 1 250 − µ 1 265 − µ
p exp − exp −
(2π)102 2 10 2 10
"  2 #
1 259 − µ
∗ exp −
2 10
(3)
Regression model
I The density of an observation Yi for the normal error
regression model Yi = β0 + β1 Xi + i where i are
independent N(0, σ 2 ) is as follows, utilizing the fact that
E [Yi ] = β0 + β1 Xi and σ 2 [Yi ] = σ 2
"  2 #
1 1 Yi − β0 + β1 Xi
fi = p exp −
(2π)σ 2 2 σ
I The likelihood function for n observations Y1 Y2 , ..., Yn is
the product of the individual densities.
I Variance σ 2 of the error terms usually unknown, thus
likelihood is a function of three parameters, β0 , β1 , and σ:
n  
Y 1 1 2
L(β0 , β1 , σ) = exp − 2 (Yi − β0 + β1 Xi )
2 1/2 2σ
i=1 (2πσ )
" n
#
1 1 X 2
= exp − 2 (Yi − β0 + β1 Xi )
(2πσ 2 )n/2 2σ i=1
MLE

I The values of β0 , β1 , and σ that maximize this likelihood


function are the maximum likelihood estimators and are
denoted by β̂0 , β̂1 , and σ̂ , respectively.
I The estimators can be found analytically.
I The estimators β̂0 , β̂1 are similar as those provided by
the method of least squares.
(Yi −Ŷi )2
P
I The maximum likelihood estimator σ = n
is
biased
Finding MLE

I We find the values of β0 , β1 andσ 2 that maximize the


likelihood function L by taking partial derivatives of L
with respect to β0 , β1 andσ 2
I Equating each of the partials to zero, and solving the
system of equations thus obtained.
I We can work with loge L, rather than L, because both L
and loge L are maximized for the same values of β0 , β1
andσ 2
n n 1 X
loge L = − loge 2π − loge σ 2 − 2 (Yi − β0 + β1 Xi )2
2 2 2σ
Finding MLE

Partial differentiation of the logarithm of the likelihood


function is much easier; it yields:
n
∂loge L 1 X
= 2 (Yi − β0 − β1 Xi )
∂β0 σ i=1
n
∂loge L 1 X
= 2 Xi (Yi − β0 − β1 Xi )
∂β1 σ i=1
n
∂loge L n 1 X
2
=− 2 + 4 (Yi − β0 − β1 Xi )2
∂σ 2σ 2σ i=1
Finding MLE

Setting the partial derivatives equal to zero, replacing β0 , β1


and σ 2 by the estimatorsβ̂0 , β̂1 , and σ̂ 2 .
n
X
(Yi − β̂0 − β̂1 Xi ) = 0
i=1

n
X
Xi (Yi − β̂0 − β̂1 Xi ) = 0
i=1
Pn
i=1 (Yi− β̂0 − β̂1 Xi )2
= σ̂ 2
n
These are similar to the least squares normal equations and
biased estimator for σ 2
Properties of MLE
The maximum likelihood estimators β̂0 and β̂1 , are the same
as the least squares estimators b0 and b1 they have the
properties of all least squares estimators:
I They are unbiased.
I They have minimum variance among all unbiased linear
estimators.
In addition, the maximum likelihood estimators b0 and b1 for
the normal error regression model have other desirable
properties:
I They are consistent
I They are sufficient
I They are minimum variance unbiased; that is, they have
minimum variance in the class of all unbiased estimators
(linear or otherwise).
Thus, for the normal error model, the estimators b0 and b1 ,
have many desirable properties.
Exercise
The data below, involving 10 shipments, were collected on the
number of times the carton was transferred from one aircraft
to another over the shipment route (X) and the number of
ampules found to be broken upon arrival (Y). Assume that
simple linear regression model is appropriate.

1. Obtain the estimated regression function. (By hand and


using R) and compare the results
2. Obtain a point estimate of the expected number of
broken ampules when X = 1 transfer is made.
3. Estimate increase in expected number of ampules broken
when there are 2 transfers as compared to 1 transfer.
4. Verify that your fitted regression line goes through the
point (X̄ , Ȳ ).
Lecture 4

Inference in Regression

By Prof. Ann Mwangi


Inference for β1
I We consider the normal error regression model
Yi = β0 + β1 Xi + i
where: β0 and β1 , are parameters, Xi are known
constants ,i ; are independent N(0, (σ 2 )
I We are interested in drawing inferences about β1 the
slope of the regression line in model
I E.g a market research analyst studying the relation
between sales (Y) and advertising expenditures (X) may
wish to obtain an interval estimate of β1 , because it will
provide information as to how many additional sales
dollars, on the average, are generated by an additional
dollar of advertising expenditure.
I Tests concerning β1 of the form are of interest:
H0 : β1 = 0
Ha : β1 6= 0
β1=0

β1 = 0 for the normal error regression model implies


I There is no linear association between Y and X

I There is no relation of any type between Y and X, since


the probability distributions of Y are then identical at all
levels of X
Sampling distribution of β1
I The point estimator b1 was given
P
(Xi − X̄ )(Yi − Ȳ )
b1 = P
(Xi − X̄ )2
P P
(Xi − X̄ )(Yi ) − (Xi − X̄ )(Ȳ )
= P
(Xi − X̄ )2
P
(Xi − X̄ )(Yi )
= P
(Xi − X̄ )2
I Thus b1 is a linear combination of Y
X
b1 = Ki Yi

where Ki = P Xi −X̄ 2
(Xi −X̄ )
Notes

X X Xi − X̄ 1 X
Ki = P = P (Xi − X̄ ) = 0
(Xi − X̄ )2 (Xi − X̄ )2
X X Xi (Xi − X̄ ) 1 X X
2
Xi Ki =
P = P Xi − X̄ Xi
(Xi − X̄ )2 (Xi − X̄ )2
1 X
=P 2
Xi2 − nX̄ 2 = 1
(Xi − X̄ )
  2 2X
X X Xi − X̄ 1
Ki2 = P 2
= P 2
(Xi − X̄ )2
(Xi − X̄ ) (Xi − X̄ )
1
=P
(Xi − X̄ )2
Sampling distribution of β1

I For normal error regression model , the sampling


distribution of b1 is normal, with mean and variance:
hX i X X
E [b1 ] = E Ki Yi = Ki E [Yi ] = Ki E [β0 + β1 Xi ]
X X
= β0 Ki + β1 Ki Xi = β1

and
hX i X
σ 2 [b1 ] = σ 2 Ki Yi = Ki2 σ 2 [Yi ]

2
X σ2
=σ Ki2 =P
(Xi − X̄ )2
Estimated Variance

We can estimate the variance of the sampling distribution of


bl :
2 σ2
σ [b1 ] = P
(Xi − X̄ )2
replacing the parameter σ 2 with MSE, the unbiased estimator
of σ 2 :
MSE
s 2 [b1 ] = P
(Xi − X̄ )2
The point estimator s 2 [b1 ] is an unbiased estimator of σ 2 [b1 ].
Where
(Yi − Ŷi )2
P P 2
SSE ei
MSE = = =
n−2 n−2 n−2
Inference

b1 − β1
∼ t(n − 2)
s[b1 ]
1 − α/2 confidence interval for β1 is

b1 ± t(1 − α/2; n − 2)s[b1 ]


Toluca Example

A cost analyst in the Toluca Company is interested in testing,


using regression model, whether or not there is a linear
association between work hours and lot size, i.e., whether or
not,β1 . The two alternatives then are:

H0 : β1 = 0

Ha : β1 6= 0
The analyst wishes to control the risk of a Type I error at
α = .05.
Test statistic
I An explicit test of the alternatives is based on the test
statistic:
b1
t∗
s[b1 ]
I The decision rule with this test statistic for controlling
the level of significance at α is:

If |t ∗ | ≤ t(1 − α/2; n − 2), concludeH0


If |t ∗ | > t(1 − α/2; n − 2), concludeHa
I For the Toluca Company example,q where
2384
α = .05, b, = 3.5702, and s[b1 ] = ( 19800 ) = .3470,
t(.975; 23) = 2.069.
I Since |t ∗ | = 13.5702/.34701 = 10.29 > 2.069, we
conclude Ha, that,β1 6= 0 or that there is a linear
association between work hours and lot size
Confidence Interval for β1
I For a 95 percent confidence coefficient, we require
t(.975; 23) = 2.609 from the t-distribution tables.
I The 95 percent confidence interval is:

b1 ± t(1 − α/2; n − 2)s[b1 ]

3.5702 − 2.069(.3470) ≤ β1 ≤ 3.5702 + 2.069(.3470)


2.85 ≤ β1 ≤ 4.29
I Thus, with confidence coefficient .95, we estimate that
the mean number of work hours increases by somewhere
between 2.85 and 4.29 hours for each additional unit
I The 95 percent confidence interval for β1 does not
include 0 hence we conclude Ha .
Inferences Concerning β0

I There are few occasions when we wish to make inferences


concerning β0 , the intercept of the regression line.
I These occur when the scope of the model includes X = 0.
I The point estimator b0

b0 = Ȳ − b1 X̄

I The sampling distribution of b0 is normal, with mean and


variance:
E [b0 ] = β0
X̄ 2
 
2 2 1
σ [b0 ] = σ +P
n (Xi − X̄ )2
Estimator for σ 2[b0]

An estimator for σ 2 [b0 ] obtained by replacing σ 2 by its


estimator MSE
X̄ 2
 
2 1
s [b0 ] = MSE +P
n (Xi − X̄ )2

Toluca Example

70.002
 
2 1
s [b0 ] = 2384 + = 685.34
25 19800

s[b0 ] = 685.34 = 26.18
Confidence Interval for β0
The 1 − α/2 percent confidence interval is:

b0 ± t(1 − α/2; n − 2)s[b0 ]

For the Toluca example the 90% confidence interval for β0 is

62.37 − 1.714(26.18) ≤ β0 ≤ 62.37 + 1.714(26.18)

17.5 ≤ β1 ≤ 107.2

I This confidence interval does not necessarily provide


meaningful information.
I For instance, it does not necessarily provide information
about the ”setup” cost (the cost incurred in setting up
the production process for the part) since we are not
certain whether a linear regression model is appropriate
when the scope of the model is extended to X=0.
Lecture 5

Prediction

By Prof. Ann Mwangi


Inference for E [Yh ]
Ŷ is normal, with mean
E [Ŷh ] = E [Yh ]
Thus Ŷh is an unbiased estimator of E [Yh ]
E [Ŷh ] = E [b0 + b1 Xh ] = E [b0 ] + Xh E [b1 ] = β0 + β1 Xh
and variance:
(Xh − X̄ )2
 
2 2 1
σ [Ŷh ] = σ +P
n (Xi − X̄ )2
replacing σ 2 with MSE
(Xh − X̄ )2
 
2 1
s [Ŷh ] = MSE +P
n (Xi − X̄ )2
ei2
P
(Yi −Ŷi )2
P
where MSE = n−2
= n−2
Confidence Interval for E Yh
The 1 − α/2 confidence limits for E [Yh ] are:

Ŷh ± t(1 − α/2; n − 2)s[Ŷh ]

Example Toluca Company: Find a 90 percent confidence


interval for E [Yh ] when the lot size is Xh = 65 units.

Ŷh = b0 + b1 Xh = 62.37 + 3.5702 ∗ 65 = 294.4


(Xh − X̄ )2
 
2 1
s [Ŷh ] = MSE +P
n (Xi − X̄ )2
(65 − 70)2
 
1
= 23.84 + = 98.37
25 19800
Confidence interval

p
s[Yh ] = (s 2 [Yh ]) = 9.918
The 1 − α/2 confidence limits for E [Yh ] are:

294.4 ± 1.714 ∗ 9.918

277.4, 311.4
We conclude with confidence coefficient .90 that the mean
number of work hours required when lots of 65 units are
produced is somewhere between 277.4 and 311.4 hours. We
see that our estimate of the mean number of work hours is
moderately precise.
Exercise I

Suppose the Toluca Company wishes to estimate E [Yh ] for


lots with Xh = 100 units with a 90 percent confidence interval.
Provide the confidence interval for the mean.
Prediction of New Observation
We consider now the prediction of a new observation Y
corresponding to a given level X of the predictor variable.
Examples:
I In the Toluca Company example, the next lot to be
produced consists of 100 units and management wishes to
predict the number of work hours for this particular lot.
I An economist has estimated the regression relation
between company sales and no. of persons > 16 years old
from data. Using a reliable demographic projection of the
number of persons > 16 years for next year, the
economist wishes to predict next year’s company sales.
I An admissions officer at a university has estimated the
regression relation between high school average GPA of
admitted students and first-year college grade (GPA). The
officer wishes to predict first-year college grade for an
applicant whose high school GPA is 3.5 as part of the
Prediction Interval for Yh(new ) when Parameters
Unknown
The 1 − α/2 confidence limits for a new observation Yh(new )
are:
Ŷh ± t(1 − α/2; n − 2)s[Ŷh ]
where s 2 [pred]

s 2 [pred] = MSE + s 2 [Ŷh ]


(Xh − X̄ )2
 
1
= MSE + MSE +P
n (Xi − X̄ )2
(Xh − X̄ )2
 
1
= MSE 1 + + P
n (Xi − X̄ )2
Example

The company is interested to see whether the regression


relationship is useful for predicting the required work hours for
individual lots. Suppose that the next lot to be produced
consists of Xh = 100 units and that a 90 percent prediction
interva1 is desired. Solution

Ŷh = b0 + b1 Xh = 62.37 + 3.5702 ∗ 100 = 419.4


(Xh − X̄ )2
 
2 1
s [pred] = MSE 1 + + P
n (Xi − X̄ )2
(100 − 70)2
 
1
= 23.84 1 + + = 2587.72
25 19800
Confidence interval
p
s[Yh ] = (s 2 [Yh ]) = 50.87
The 1 − α/2 confidence limits for E [Yh ] are:

419.4 ± 1.714 ∗ 50.87

277.4, 311.4

I With confidence coefficient .90, we predict that the


number of work hours for the next production run of 100
units will be somewhere between 332 and 507 hours.
I This prediction interval is rather wide and may not be too
useful for planning worker requirements for the next lot.
I The interval can still be useful for control purposes,
though
Lecture 7

Analysis of Variance Approach to Linear Regression

By Prof. Ann Mwangi


Partitioning of Total Sum of Squares
I The analysis of variance approach is based on the
partitioning of sums of squares and degrees of freedom
associated with the response variable Y i.e (Yi − Ȳ ).

I The deviations are shown by the vertical lines above


SSTO:Total Sum of Squares

I The measure of total variation, denoted by SSTO , is the


sum of the squared deviations
n
X
SSTO = (Yi − Ȳ )2
i=1

I SSTO is a measure of the uncertainty pertaining to the


work hours required for a lot, when the lot size is not
taken into account.
I If all Yi observations are the same, SSTO = 0
Cont:

I When we utilize the predictor variable X, the variation


reflecting the uncertainty concerning the variable Y is that
of the Yi observations around the fitted regression line: Ŷi

I These deviations are shown by the vertical lines in Figure


above.
SSE:Error Sum of Squares

I The measure of variation in the Yi observations that is


present when the predictor variable X is taken into
account is the sum of the squared deviations, which is the
familiar SSE n
X
SSE = (Yi − Ŷi )2
i=1
I If all Y; observations fall on the fitted regression line, SSE
= O.
I The greater the variation of the Yi observations around
the fitted regression line, the larger is SSE.
SSR:Regression Sum of Squares
I What accounts for the substantial difference between
these two sums of squares is another sum of squares:
n
X
SSR = (Ŷi − Ȳ )2
i=1

I SSR is a sum of squared deviations, the deviations being


Ŷi − Ȳ
I If the regression line is horizontal so that Ŷi − Ȳ = 0,
then SSR = O. Otherwise, SSR is positive
I SSR may be considered a measure of that part of the
variability of the Yi which is associated with the
regression line.
I The larger SSR is in relation to SSTO, the greater is the
effect of the regression relation in accounting for the total
variation in the Yi observations
Formal Development of Partitioning
.
I The total deviation Yi − Ȳ , used in the measure of the
total variation of the observations Yi without taking the
predictor variable into account, can be decomposed into
two components:
1. The deviation of the fitted value Yi around the mean Ȳ .
2. The deviation of the observation Yi around the fitted
regression line.
Yi − Ȳ = (Ŷi − Ȳ ) + (Yi − Ŷi )
I The sums of these squared deviations have the same
relationship:
n
X n
X n
X
2 2
(Yi − Ȳ ) = (Ŷi − Ȳ ) + (Yi − Ŷi )2
i=1 i=1 i=1

SSTO = SSR + SSE


Proof
n
X n
X
2
(Yi − Ȳ ) = [(Ŷi − Ȳ ) + (Yi − Ŷi )]2
i=1 i=1
n
X
= [(Ŷi − Ȳ )2 + (Yi − Ŷi )2 + 2(Ŷi − Ȳ )(Yi − Ŷi )]
i=1
Xn n
X n
X
= (Ŷi − Ȳ )2 + (Yi − Ŷi )2 + 2 (Ŷi − Ȳ )(Yi −
i=1 i=1 i=1
Now
n
X n
X n
X
2 (Ŷi −Ȳ )(Yi −Ŷi ) = 2 Ŷi (Yi −Ŷi )−2Ȳ (Yi −Ŷi ) = 0
i=1 i=1 i=1
Important formula
n
X
SSR = b12 (Xi − X̄ )2
i=1
Breakdown of Degrees of Freedom
I We have n-1 degrees of freedom associated with SSTO.
I One degree of freedom lost because deviations Yi − Ȳ
are subject to one constraint:they must sum to zero.
I Equivalently, one degree of freedom is lost because the
sample mean Ȳ is used to estimate the population mean.
I SSE has n-2 degrees of freedom associated with it.
I Two degrees of freedom are lost because the two
parameters β0 and β1 are estimated in obtaining the
fitted values Ŷi
I SSR has one degree of freedom associated with it.
I Although there are n deviations (Ŷi − Ȳ ), all fitted
values Yi calculated from same estimated regression line.
I Two degrees of freedom are associated with a regression
line, corresponding to the intercept and the slope.
I One of the two degrees of freedom lost because
(Ŷi − Ȳ ) are subject to constraint: must sum to zero.
I Degrees of freedom are additive: n − 1 = 1 + (n − 2)
Mean Square
I A sum of squares divided by its associated degrees of
freedom is called a mean square (abbreviated MS).
I For instance, an ordinary samplePvariance is a mean
n
square since a sum of squares, i=1 (Yi − Ȳ )2 , is divided
by its associated degrees of freedom, n-1.
I We are interested here in the regression mean square,
denoted by MSR:
SSR
MSR = = SSR
1
I Error mean square, MSE
SSE
MSE =
n−2
I Note: Mean squares are not additive.
ANOVA table

Source SS df MS E {MS }

of variation
Regression SSR =
Pn
i =1 (Ŷi − Ȳ )2 1 MSR = SSR
1 σ 2 + β12
Pn
i =1 (Xi − X̄ )2

Error SSE =
Pn
i =1 (Yi − Ŷi )2 n-2 MSE = SSE
n−2 σ2

Total SSTO =
Pn
i =1 (Yi − Ȳ ) 2
n-1
F Test of β1 = 0 versus β1 6= 0
I The analysis of variance approach provides us with a
battery of highly useful tests for regression models (and
other linear statistical models).
I Simple linear regression case considered here, the analysis
of variance provides us with a test for:

H0 : β1 = 0

Ha : β1 6= 0
I Test statistic: F ∗ It compares MSR and MSE in the
following fashion
MSR
F∗ =
MSE
I Sampling distribution
Decision Rule

I If F ∗ ≤ F (1 − α; 1, n − 2) , conclude H0
I If F ∗ > F (1 − α; 1, n − 2), conclude Ha
where F (1 − α; 1, n − 2) is the (1 − α) 100 percentile of the
appropriate F distribution
Example: Toluca Company

SSTO = 307, 203, SSE = 54825


thus SSR = SSTO − SSR = 307, 203 − 54825 = 252378

MSR = SSR/1 = 252378

MSE = SSE /n − 2 = 54825/23 = 2384


F ∗ = MSR/MSE = 252378/2384 = 105.88
Toluca: ANOVA

I As before, let α = .05.


I Since n = 25, we require F (.95; 1, 23) = 4.28.
I In this case 105.88 > 4.28 thus we conclude Ha
I Conclusion; There is a linear association between work
hours and lot size. Same result as when we used the t test
General Linear Test Approach
I The analysis of variance test of β1 = 0 versus β1 6= 0 is an
example of the general test for a linear statistical model
I The general linear test approach involves three basic step
1. Fit the full model and obtain the error sum of squares
SSE (F).
2. Fit the reduced model under Ho and obtain the error
sum of squares SSE(R).
3. Use test statistic
SSE (R) − SSE (F ) SSE (F )
F∗ = ÷
dfR − dfF dfF
and decision rule
I If F ∗ ≤ F (1 − α; dfR − dfF , dF ), conclude H0
I If F ∗ > F (1 − α; dfR − dfF , dF ), conclude Ha .
Coefficient of Determination (R 2)
SSR SSE
R2 = =1−
SSTO SSTO
I 0 ≤ R2 ≤ 1
I We may interpret R 2 as the proportionate reduction of
total variation associated with the use of the predictor
variable X.
I The larger R 2 is, the more the total variation of Y is
reduced by introducing the predictor variable X
I When all observations fall on the fitted regression line,
then SSE = 0 and R 2 = 1 predictor variable X accounts
for all variation in the observations Yi
I When the fitted regression line is horizontal so that
b1 = 0 and Yi ≡ Ȳ , then SSE = SSTO and R 2 = O
Here, there is no linear association between X and Y in
the sample data, and the predictor variable X is of no
help in reducing the variation in the observations Yi with
Lecture 8

Correlation in relation to Linear Regression

By Prof. Ann Mwangi


Introduction

Correlation is a statistical technique used to determine the


degree to which two variables are related
Sample dataset
nation immunize under5
Bolivia 77 118
Brazil 69 65
Cambodia 32 184
Canada 85 8
China 94 43
Czech Republic 99 12
Egypt 89 55
Ethiopia 13 208
Finland 95 7
France 95 9
Greece 54 9
India 89 124
Italy 95 10
Japan 87 6
Mexico 91 33
Poland 98 16
Russian Federation 73 32
Senegal 47 145
Turkey 76 87
United Kingdom 90 9
Correlation

Consider the diphtheria, pertussis, and tetanus (DPT)


immunization rates, presented on previous slide.
Now consider the following question:
Is there any association between the proportion of
newborns immunized and the level of infant
mortality?
Example: DPT Immunization and Infant Mortality
Consider the following two-way scatter plot of the under-5
mortality rate on the y axis and the DPT levels (percent of
the population immunized) on the x axis (under five mortality
rate data set).
By simple inspection of the graph it is clear that as the proportion
of infants immunized against DPT increases, the infant mortality
rate decreases.
Now consider:
I X : Percent of infants immunized against DPT
I Y : Infant mortality (number of infants under 5 dying per
1,000 live births)
Pearson Correlation Coefficient ρ
The Pearson’s correlation or product moment correlation
coefficient is a measure of the nature and strength of the
relationship between two quantitative variables
I The sign of r denotes the nature of association

I while the value of r denotes the strength of association.

The correlation coefficient can take values from -1 to +1.


I Positive values of ρ (close to +1) imply a proportional
relationship between x and y .
I Negative values of ρ (close to -1) imply an inversely
proportional relationship between x and y .
I Note that if |ρ| is close to 1, this implies a functional
(perfect) relationship between x and y , meaning that if
we know one of them, it is like knowing the other exactly.
I Independent variables are uncorrelated.
How to calculate Pearson correlation coefficient

P P
P x y
xy − n
r=qP
( y )2 P 2 y )2
P P
(
( x2 − n
)( y − n
)
cov (x, y )
=
(sd of x) ∗ (sd of y )
Example
A sample of 6 children was selected, data about their age in
years and weight in kilograms was recorded as shown in the
following table . It is required to find the correlation between
age and weight.

SNO Age(yrs) Weight (kg)


1 7 12
2 6 8
3 8 12
4 5 10
5 6 11
6 9 13

Applying the formula


Worked Example

SNO Age(X) Weight (Y) XY X2 Y2


1 7 12 84 49 144
2 6 8 48 36 64
3 8 12 96 64 144
4 5 10 50 25 100
5 6 11 66 36 121
6 9 13 117 81 166
X 2 = 291 Y 2 = 742
P P P P P
X = 41 Y = 66 XY = 461
P P
P x y
xy − n
r=qP
( y )2 P 2 ( y )2
P P
( x2 − n
)( y − n
)
461 − 41∗66
6
=q = 0.749
(41)2 (66)2
(291 − 6
)(742 − 6
)
Exercise

Consider the immunization data and compute the corretion


between immunization uptake and infant mortality
Relationship with simple linear regression
coeficients
Recall the simple linear regression equation
Yi = β0 + β1 Xi + i
Using the method of least squares
P
(Xi − X̄ )(Yi − Ȳ )
b1 = P
(Xi − X̄ )2
P P
XY − Xn Y
P
= P P 2
X 2 − ( nX )
One can show that
r ∗ SD of Y
b1 =
SD of X
where r is the correlation coefficient, SD of Y and SD of X
are the standard deviation of Y and X respectively
Example
Consider the example for age and weight. Now we wish to
determine the least squares regression equation for the
relationship between the two variables

SNO Age(X) Weight (Y) XY X2 Y2


1 7 12 84 49 144
2 6 8 48 36 64
3 8 12 96 64 144
4 5 10 50 25 100
5 6 11 66 36 121
6 9 13 117 81 166
X 2 = 291 Y 2 = 742
P P P P P
X = 41 Y = 66 XY = 461

P P
P X Y 41∗66
XY − 461 −
b1 = P Pn 2 = 6
= 0.923
( X) (41)2
X2 − n
291 − 6
Using correlation and SD

Now we know r=0.749


Compute SD of X
r P 2 r
X ( X ) (41)2
SD = ( X2 − ) = 291 − = 3.29
n 6
Compute SD of Y
r P 2 r
X ( Y ) (66)2
SD = ( Y2 − ) = 742 − =4
n 6
Hence
0.749 ∗ 4
b1 = = 0.911
3.28
Lecture 9

Multiple Linear Regression

By Prof. Ann Mwangi


Multiple Regression

I Multiple regression is an extension of the simple


regression situation.
I We are trying to describe the observations Y as a linear
combination of several predictors (X ’s).
I The predictors can be powers of one another

y = α + β1 x1 + β2 x12 + E

i.e., y = α + β1 x1 + β2 x2 + E (where x2 = x12 ),


I or they can be distinct such as
Y = α + β1 x1 + β2 x2 + . . . + βq xq + E.
Recall simple linear regression

In the first case, the graphical representation of the problem is


as follows:
Graphical representation for Multiple linear regression
I In the second case, the model is harder to visualize, and
impossible to do so beyond the two-predictor situation
(when the dimension of the problem rises above three).
I In all cases, the regression surface (notice we have
departed from the simple line) is going to be a hyperplane
(a plane in three or more dimensions).
I The figure shows the two-predictor situation.
The Least-Squares Regression Surface

IThe idea for finding the “best” regression surface is


identical as the simple linear case.
I That is, the best surface is the one that minimizes the
squared deviations of the estimated values from the
observations.
I The least-squares surface is the one that minimizes
n
X Xn n 
X 2
2 2
ei = (yi − ybi ) = yi − α
b − β10 x1i − · · · − βq xqi
b b
i=1 i=1 i=1

As with simple linear regression,


b − βb10 x1i − βb20 x2i − · · · − βbq xqi , i = 1, . . . , n.
ybi = α
Assumptions of Multiple Regression

1. Independence: The Y observations are statistically


independent of each other. Usually this is not the case
when multiple measurements are taken on the same
subject. Other techniques must then be used that
account for this dependency.
2. Linearity: The mean value of Y for each combination of
X1 , X2 , . . . , Xk is a linear combination of them. That is,
E (Yi ) = µy |x1 ,x2 ,...xq = βo + β1 X1i + · · · βq Xqi .
3. Homoscedasticity: The variance of Y is the same for any
fixed combination of X1 , X2 , . . . , Xq . That is, σy2|x1 ,x2 ,...,xq
is constant for all x1 , x2 , . . . , xq .
4. Normality: For any fixed combination of X1 , X2 , . . . , Xq
the variable
 Y is normally distributed.
 That is,
2
y ∼ N µy |x1 ,x2 ,...xq , σy |x1 ,x2 ,...xq .
Explaining Variability

I Our task is to explain the variability in the data.


I Using similar methods as before, we have

n
X n
X n
X
2 2
(yi − ȳ ) = yi − ȳ )
(b + (yi − ybi )2
|i=1 {z } |i=1 {z } |i=1 {z }
Total sum of squares Regression sum of squares Residual sum of squares

Source of Sums of squares df Mean squares F Prob > F


variability (SS) (MS)
Model SSR q MSR=SSR F = MSR p = P(F > Fq,n−q−1;1−α )
q MSE
Residual (error) SSE n−q−1 MSE =SSE
(n−q−1)
Corrected TotalSST = SSR + SSE n − 1
F Test for the Significance of the Overall Regression
Model

I With similar methods as in the simple linear regression case,


we can carry out an overall F test.
I This is based on the statistic F = MSR/MSE .
I This statistic is compared against the tail of the F
distribution with q and n − q − 1 degrees of freedom.
F Test for the Significance of the Overall Regression
Model

The test is constructed as follows:


1. Ho : No linear association between y and x1 , x2 , . . . , xq .
2. Ha : Linear association exists between y and x1 , x2 , . . . , xq .
3. The test is carried out at the (1 − α)% level of
significance.
4. The test statistic is F = MSR
MSE
, where the numerator is
part of the variability that can be explained through the
regression model and the denominator is the unaccounted
for variability or error.
5. Rejection rule: Reject Ho , if F > Fq,n−q−1;1−α . This will
happen if F is far from unity (just like in the ANOVA
case).
Individual and Adjusted Contributions from the
Explanatory Variables
I The regression sum of squares (SSR) receives
contributions from all the predictors. However, not all
contributions are equally important.
I Some predictors (the x’s) may be significant in explaining
the response (y ) and some may not be.
I Another problem involves the fact that the predictors
themselves may be correlated to one another.
I Thus, including one predictor in the model provides some
information about the other predictor as well.
I Then, when the second predictor is included, its individual
contribution (in the presence of the first predictor) may
not be as significant as it would have been if the second
were the only predictor in the model.
I In some cases, predictors may be significant individually
but non-significant in the presence of others.
Inference on Regression Coefficients
A way to test whether the addition of a new variable Xi adds significant
information to the prediction of Y is to use a t test in a similar fashion
as in the simple linear regression. This test is defined as follows:

1. Ho : βi = 0 (i.e., addition of Xi to the model does not add


significantly to the prediction of Y )

 βi 6= 0  Two-sided test
2. Ha : βi > 0
One-sided tests
βi < 0

3. Specify the significance level (1 − α)%


β
4. The test statistic is T = ∼ tn−q−1
bi
s.e.(βbi )
5. Decision rule: Reject the null hypothesis if
 
 T > tn−1−1,1−α/2 or if T < tn−q−1,α/2  (two-sided test Ha : βi 6= 0)
T > tn−q−1,1−α (one-sided test Ha : βi > 0)
T < tn−q−1,α (one-sided test Ha : βi < 0)
 
Example: Head Circumference and Birth Weight of Low
Birth Weight Infants
I We wish to explore the relationship between gestational age
and head circumference in low birth weight infants.
I Another factor that could have impact on the head
circumference may be the birth weight of the infant.
I A plot of birth weight versus head circumference is given
below.

I A positive linear relationship exists between birth weight and


head circumference.
Example (continued):
I We may now wish to ask the question of whether there is a
linear relationship between birth weight together with
gestational age and head circumference.
I Consider the following graph:
Correlation Among the Predictors

We can also investigate these relationships by estimating the


correlation coefficient between birth weight and gestational age

Number of obs = 100


Spearman’s rho = 0.6426
Test of Ho: gestage and birthwt independent
Pr > |t| = 0.0000

The estimated correlation coefficient is 0.6426 with a p value


less than 0.0001. Thus, the correlation of gestational age and
birth weight is statistically significant.
This means that knowing gestational age provides us with a
great deal of information about birth weight and vice versa.
Univariate Analyses: Head Circumference Versus Gestational
Age

Source | SS df MS Number of obs = 100


---------+------------------------------ F( 1, 98) = 152.95
Model | 386.867366 1 386.867366 Prob > F = 0.0000
Residual | 247.882634 98 2.52941463 R-squared = 0.6095
---------+------------------------------ Adj R-squared = 0.6055
Total | 634.75 99 6.41161616 Root MSE = 1.5904

------------------------------------------------------------------------------
headcirc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
gestage | .7800532 .0630744 12.367 0.000 .6548841 .9052223
_cons | 3.914264 1.829147 2.140 0.035 .2843818 7.544146
------------------------------------------------------------------------------

Recall the (simple) regression of head circumference on gestational


age. Notice that this model predicts that each increase of
gestational age by one week, results in 0.78 centimeters increase in
head circumference.
Univariate Analyses: Head Circumference Versus Birth
Weight

. lm(headcirc~birthwt)

Source | SS df MS Number of obs = 100


---------+------------------------------ F( 1, 98) = 172.82
Model | 405.05995 1 405.05995 Prob > F = 0.0000
Residual | 229.69005 98 2.34377602 R-squared = 0.6381
---------+------------------------------ Adj R-squared = 0.6344
Total | 634.75 99 6.41161616 Root MSE = 1.5309

------------------------------------------------------------------------------
headcirc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
birthwt | .0074918 .0005699 13.146 0.000 .0063609 .0086228
_cons | 18.21758 .6446606 28.259 0.000 16.93827 19.49689
------------------------------------------------------------------------------

Here is the (simple) regression of head circumference on birth weight.


The regression model predicts that each increase in birth weight by one
gram will result in an increase in head circumference by 0.007
centimeters (alternatively an increase of a kilo or 1,000 grams will result
Information on the Dependent Variable Adjusted for a
Third Variable

A significant linear relationship exists between gestational age


and birth weight. Thus, when we explore the relationship
between head circumference and gestational age, we already
furnish some information (through gestational age) about
birth weight.
The question that arises then is:
“Does birth weight provide additional information
about head circumference, after all information
about gestational age has been accounted for?”
The Analysis of the Combined Effect on Head Circumference

We carry out the following multiple regression

. lm( headcirc ~ gestage+birthwt)

Source | SS df MS Number of obs = 100


---------+------------------------------ F( 2, 97) = 147.06
Model | 477.326905 2 238.663453 Prob > F = 0.0000
Residual | 157.423095 97 1.6229185 R-squared = 0.7520
---------+------------------------------ Adj R-squared = 0.7469
Total | 634.75 99 6.41161616 Root MSE = 1.2739

------------------------------------------------------------------------------
headcirc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
gestage | .4487328 .067246 6.673 0.000 .3152682 .5821975
birthwt | .0047123 .0006312 7.466 0.000 .0034596 .005965
_cons | 8.308015 1.578943 5.262 0.000 5.174251 11.44178
------------------------------------------------------------------------------
Results
1. The overall F statistic of the regression is associated with a p
value p < 0.0001. This means that there is an significant
linear association overall, between gestational age and birth
weight (the independent variables) and head circumference
(the dependent variable). Note that the F test does not
differentiate about individual variable contributions or
significance.
2. The least-squares equation that describes head circumference
(Y ) in terms of gestational age (X1 ) and birth weight (X2 ) is
as follows:
Y = 8.3030 + 0.4487X1 + 0.0047X2
This means that all else being equal, each week increase in
gestational age results in 0.45 centimeters increase in head
circumference, while (again all else being equal) each increase
of one gram in weight is associated with 0.005 inches increase
in head circumference (or one kilogram - 1,000 grams - in
Results (continued)

3. To decide on the relative contribution of each variable, we


look at each t test. Both t tests associated with gestational
age (gestage) and birth weight (birthwt) are statistically
significant (p < 0.0001). For example, the t test associated
with gestage is as follows:
3.1 The null hypothesis is Ho : β1 = 0 versus Ha : β1 6= 0.
The t statistic is
t = β1 b = 0.4487
0.0672 = 6.68 > t97;0.025 = 1.985
1
b
s.e.(β1 )
3.2 The t statistic associated with birth weight is similarly
calculated as 7.47 and is also statistically significant.
The test associated with birthwt is derived in an identical
manner.
This means that, adjusted for the other, each variable adds
statistically significant information to the estimation of head
circumference.
1
We can derive this with the STATA display command as follows: display
invt(97,0.95)
Discussion

Notice that the estimates of the effect on head circumference of


gestational age adjusted for birth weight and of birth weight
adjusted for gestational age, is nowhere near that derived from the
univariate analysis (0.4487 versus 0.7800 and 0.005 versus 0.007
respectively). This is because inclusion of either variable
contributes information about the other variable.
What is the proper interpretation of the estimates of the two
slopes? The proper interpretation is that for two infants with the
same gestational age, the one that is one gram heavier will have
head circumference 0.005 centimeters longer. Alternatively, for two
infants with identical weight, the one that has one week longer
gestational age will have 0.45 inches longer head circumference on
average.
Note that these estimates of the effect of gestational age and birth
weight on head circumference, are not the estimates of the
contribution of, say, birth weight, “above and beyond” gestational
age.
Model Comparison

As an illustration of some of the criteria used for model


comparison, we compare here the two models containing
gestational age.

Criterion [gestage] [gestage, birthwt]


Coefficient (gestage) 0.7800 0.4487
Standard Error 0.0631 0.0672
Test statistic 6.673 12.367
p-value < 0.0001 < 0.0001
R2 0.6095 0.7520
Adjusted R 2 0.6055 0.7469
Discussion
1. The coefficient of β1 associated with gestational age is decreased
with the inclusion of birth weight in the model. This change may
be due to the large correlation between gestational age and birth
weight or in poor fit of the model. The fact that the standard error
of the estimate does not change indicates that the reason is
probably the former.
2. The test statistics are different but both are extremely significant,
indicating that even adjusted for birth weight, gestational age is a
significant predictor of head circumference.
3. The coefficient of determination (R 2 ) denotes the proportion of the
total variability accounted by the regression model. It will always
increase with the addition of a new variable. That is why it is
inappropriate for comparison between models of different dimension.
4. The adjusted R 2 is a more appropriate measure because it adjusts
for the difference (in this case by one) in the dimension of the
model. Since the adjusted R 2 in the two-predictor model is much
higher than the one-predictor model, we can conclude that the
latter is superior to the former.
Lecture 10

Linear Regression: Matrix Approach

By Prof. Ann Mwangi


Linear regression equation

Yi = β0 + β1 Xi + i

where i = 1, 2....n Thus

Y1 = β0 + β1 X1 + 1
Y2 = β0 + β1 X2 + 2
.
.
Yn = β0 + β1 Xn + n
Matrix

     
Y1 1 X1 1
Y2  1 X2    2 
 β = β0  =  . 
     
Y = .  X = .
 
.

.

 β1  
.
Yn 1 Xn n
Thus the linear regression equation in matrix is

Y = Xβ + 

and
E [Y ] = X β
Normal equation
Recall the normal equations
X X
nb0 + b1 Xi = Yi
X X X
b0 Xi + b1 Xi2 = Xi Yi
In matrix notation

X T Xb = X T Y
where b is the vector
 of the least squares regression
coefficients b =
b0
Thus
b 1

 P    P 
Pn P x2i b1
= P yi
xi xi b2 xi yi
Estimated regression coefficients
b = XTX X Y
 −1  T 

Example Toluca company


   
399 1 80
121 1 30
   
Y = .  X = .



 .  . 
323 1 70

 
1 80
  1 30  
XTX = 1 1 ... 1 25 1750
 
. =
80 30 ...70 
.

 1750 142300
1 70
Cont: Example
 
399
  121  
XTY = 1 1 ...1 7807
 
 . =
80 30 ...70  
 .  617180
323
 
(X T X )−1 =
0.287475 −0.003535
−0.003535 0.00005051
 
b b
= 0 =
b1
(X T X )−1 (X T Y ) =
     
0.287475 −0.003535 1 1 ...1 62.37
=
−0.003535 0.00005051 80 30 ...70 3.5702
Fitted values

Ŷ = Xb
Thus
Ŷ = X (X T X )−1(X T Y )

You might also like