0% found this document useful (0 votes)
49 views

Week 6

The document discusses using the least squares technique to estimate population parameters for a distribution. It explains: 1) The empirical or actual cumulative distribution function (CDF) is constructed from sample data and does not require knowing the population parameters. It provides a non-parametric estimate of the CDF. 2) Many common distributions can be expressed as linear models relating the empirical CDF to transformations of the random variable. 3) The least squares technique fits a linear regression line to the data, minimizing the sum of the squared residuals or errors between the observed and modeled values. This provides estimates for the parameters of the linear model.

Uploaded by

Fahad Almitiry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

Week 6

The document discusses using the least squares technique to estimate population parameters for a distribution. It explains: 1) The empirical or actual cumulative distribution function (CDF) is constructed from sample data and does not require knowing the population parameters. It provides a non-parametric estimate of the CDF. 2) Many common distributions can be expressed as linear models relating the empirical CDF to transformations of the random variable. 3) The least squares technique fits a linear regression line to the data, minimizing the sum of the squared residuals or errors between the observed and modeled values. This provides estimates for the parameters of the linear model.

Uploaded by

Fahad Almitiry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Week 6.

Lecture 1&2
Estimation 2: Least Squares

The second approach to estimating the population parameters for a distribution is to use
the technique of least squares and this requires using the actual or empirical distribution. The
basic idea can be illustrated using the exponential distributions whose cdf is given by

F(x) 1 e x

This can be linearised as follows:

ln[1 - F(x)] x

Thus a plot of ln[1-F(x)] against a sample of x should produce a straight line through
the origin whose slope equals. The slope of a best fit line on such a plot therefore provides a
sample estimate for . This is the least squares technique. However, there is a duality problem
here in that to get values for F(x) to form such a plot a value for is needed, but the least
squares technique is trying to estimate . The trick is to replace F(x) with the empirical or
actual cdf, F(x), that does not require a value for and is found solely from the data itself. Then

ln[1 - F(x)] x

A. The Empirical cdf.

Consider a data set that consists of observations from some unknown cumulative
distribution function F(x). The empirical, often also referred to as the (estimated, sample or
actual), cumulative distribution function is the best guess made from a sample of data of the
true (but unknown) population cumulative distribution function. This actual cumulative
distribution function is given the symbol F(x). To construct this empirical distribution, the data
must first be arranged from smallest to largest. The following the notation is taken from lecture
1. Let x(1) x(2) x(n) represent this ascendingly ordered data set (where n is the
sample size). The bracketed (i) in the ordered series is called the rank index of the particular
data value and so i = 1, n. Consider the following hypothetical values for x(i).

x(i) x(1) =2 x(2) = 6 x(3) = 8 x(4) = 10 x(5) = 12 x(6) = 17


i 1 2 3 4 5 6
F(x) = i/n 0.16667 0.3333 0.5 0.666667 0.83333 1
F(x) =(i-1)/n 0 0.1667 0.3333 0.5 0.66667 0.833333
F(x) =(i-0.5)/n 0.08333 0.25 0.4167 0.583333 0.75 0.916667

Two possible methods can be used to quantify this empirical distribution. First,
F(x(i)) i/n . In which case the empirical probability of observing a value for x less than or
equal to x(1) = 2 is 1/6 and the empirical probability of observing a value for x less than or
equal to x(2) = 6 is 2/6 and so on. The problem with this method is that the empirical probability
of observing a value for x less than or equal to x(n) = 17 is 6/6 or 1. This would suggest that it
is impossible for x to exceed 16 in value. It is however not impossible to recorded an x value
of more than 16 if the sample size were to be increased at a later date. But the formula i/n would
not allow for this possibility.

To avoid this problem the empirical distribution could be defined as F(x(i)) (i - 1)/n .
In which case the empirical probability of observing a value for x less than or equal to x(n) =
17 is (6-1)/6 = 5/6 and the empirical probability of observing a value for x less than or equal to
x(2) = 6 is (2-1)/6 = 1/6 and so on. The problem with this method is that the empirical
probability of observing a value for x less than or equal to x(1) = 2 is 0/6 or 0. This suggests
it is impossible for x to be less than 2 in value. It is however not impossible to record an x
value of less than 2 if the sample size were to be increased at a later date. But the formula (i-
1)/n would not allow for this possibility. As the graph of the previous table illustrates, these
two methods create problems at either end of the empirical distribution.

An obvious solution is to average these two estimators:

F(x(i)) = (i- 0.5)/n

as this formula allows a small chance of observing values of x less than 2 and x more than 17
in future testing. This estimator is called the mean estimator of the population cdf. This
averaging is seen in the grey empirical cdf in the figure below.

1
F(X) = 1/n = i/6
0.9
F(x) = (i-1)/n = (i-1)/6
0.8 F(x) = (i-0.5)/n = (i-0.5)/6

0.7
Empirical cdf, F(x)

0.6

0.5

0.4

0.3

0.2

0.1

0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
x

In this illustration the smallest value x(1) characterises the lowest (1/6)*100 = 16.66% of
the unknown cumulative distribution. The next smallest value x(2) characterises the next
16.66% of the distribution and so on. Since the smallest value represents the proportion of the
distribution between 0% and 16.66% the best guess is to assign it a cumulative probability
exactly in-between these two numbers, i.e. 8.33%. That is, F(x(1)) = 8.33%. Similarly, the next
smallest value represents the proportion of the distribution between 16.66% and 33.33% and
so the best guess as to its cumulative probability is that it lies exactly in-between these two
numbers, i.e. 25%.

An alternative to this mean formula that is sometimes used is the median estimate of the
population cdf

F(x(i)) = (i- 0.3)/(n+0.4)

In all these calculations there is no need to use distribution parameters, such as in the
exponential distribution. As such the empirical distribution is parameter free and so is
sometimes also called a non-parametric estimator of the cdf.

B. Distributions as Linear Models

In what follows let the random variable W be some (possible) transformation of the
empirical cdf and V some (possible) transformation of the random variable X.

The simplest possible relationship that can exist is when w is related to the random
variable v in a proportional manner

w = b1v

A natural extension is to allow for non-proportionality by allowing the line defined by


the equation to be offset from the origin through the inclusion of an additional unknown
constant b0

w =b0 + b1v

b0 is called the intercept and b1 the slope of the line and these parameters can both be either
positive or negative in value. Many of the distributions looked at so far in this module can be
written out in one of these two ways.

i. The Exponential Distribution

The exponential distribution is an example of a proportional linear model. Normally the


cdf is written as a function of the random variable X

F(x) 1 e x

But defining w as

w ln[1 - F(x)]

then gives

w = b1v with v = x and b1 = -.

ii. The Uniform Distribution


. The cdf for a uniform random variable is written as

xa
F(x)
(b a)

But defining w as

w F(x)

then

w= b0 + b1v with v = x, b0 = -[a/(b-a)] and b1 = 1/(b-a).

iii. The Weibull Distribution

For the Weibull distribution, its cdf is normally written as



F(x) 1 e( x)

But defining w as

w ln{-ln[1 - F(x)]}

then gives

w b0 b1v

with

v = ln[x], b0 = ln() and b1 =.

iv. The Normal Distribution

For the Normal distribution, w is the standardised or Z value and is found by reading
off from the Z table the Z value associated with F(x). In Excel w is found using

w = z = NORMSINV(F(x))

But
x 1
z x

where is the population mean of x and the population standard deviation for x.
Thus

w b 0 b1v

with v = x, b0 = - and b1 = .
v. The Log Normal Distribution

If y = ln(x), then for the Log Normal distribution, w is the standardised or Z value and
is found by reading off from the Z table the Z value associated with F(y). In Excel w is found
using

w = z = NORMSINV(F(y))

But
y y y 1
z y
y y y
where y is the population mean of y (the log mean) and y the population standard deviation
for y (the log standard deviation) Thus

w b 0 b1v

with v = y = ln(x), b0 = - yy and b1 = y.

C. Linear Least Squares Method

When using the above linear models to represent the distributions, some complications
emerge. First, F(x) is not known. An obvious solution to this is to replace it with the empirical
cdf, F(x(i)). For example the proportional linear representation of the exponential distribution
then becomes

w(i) ln[1 - F(x(i)]

with

w(i) = b1v(i) and v(i) = x(i) and b1 = -.

With w and v now quantifiable, the second problem is that if their values were plotted
out, they would not all fall on a straight line. The data points will be scattered about a line. This
scatter is accounted for by adding a random error or residual e(i) term to the above linear
models. For example,

w(i) = b1v(i) + e(i) or w(i) =b0 + b1v(i) + e(i)

Each of the e(i) residuals then measures the vertical distance between the drawn line
and the actual data point and as such some residuals will be positive and some negative.

Example 1: Fatigue of ceramic ball bearings

Consider again the ceramic ball bearings data and assume this data has an exponential
distribution. To estimate , the calculations shown in the next table need to be carried out. Look
first at the 2nd column of the table below. It contains that actual fatigue lives these times are
sorted from lowest to highest. Assuming the data has an exponential distribution, x(i) also
equals v(i).

Empirical cdf
Index, i Sorted Fatigue Life, v(i) = x(i) F(x(i)) w(i) = ln[1- F(x(i))]
1 1.67 =(1-0.5)/10 = 0.05 =ln(1-0.05) = -0.05129
2 2.2 =(2-0.5)/10 = 0.15 =ln(1-0.15) = -0.16252
3 2.51 0.25 -0.287682072
4 3 0.35 -0.430782916
5 3.9 0.45 -0.597837001
6 4.7 0.55 -0.798507696
7 7.53 0.65 -1.049822124
8 14.7 0.75 -1.386294361
9 27.8 0.85 -1.897119985
10 37.4 0.95 -2.995732274

Column 1 is the rank index used to calculate the empirical cdf given in column 3. The
empirical probability that x is 1.67 or less is (i- 0.5)/10 = (1 - 0.05)/10 = 0.05 or 5%. Column
3 illustrates some additional calculates for the rest of the data. Assuming that X has an
exponential distribution, w(i) is defined as w(i) ln[1 - F(x(i)] . So for the smallest value of x,
w(1) ln[1 - 0.05] - 0.05129 .The last column in the table above shows additional
calculations for the larger values for X. The graph below then plots column 2 against column
4. No matter how you draw a line out from the origin it is impossible to have all the data points
falling on the line. The line shown in the graph is the best fit to the data given it must start at
the origin. The seventh row of the above table has the biggest vertical distance between the
data point and the best fit line, so e(7) is the largest error it is negative in value because its
below the line.

-0.5
Transformed empirical cdf, w(i)

e(7) =-1.0498 - -0.6061 = -0.44371


-1

-1.5

-2

-2.5

-3 w(i) = -0.0805v(i)

-3.5
0 5 10 15 20 25 30 35 40
v(i)
In what sense is the line shown in the above graph a best fit line? Why is b1 = -0.0805?
Ideally, the line should be drawn so as to best fit the data. Best fit in turn can be defined in
terms of all the residuals, with the best fit line being the one which minimises all of the residuals
once they have been squared hence the name least squares estimator. The squares of the
residuals are taken to stop positive and negative residuals offsetting each other in the
summation. Consider first the proportional linear model. b1 is chosen to minimise the sum of
the squares of the residuals (SSres)


n n n
SS res e(i) 2 w(i) (b 1 v(i)) w(i) 2 2b1 w(i)v(i) b12 v(i) 2
2

i 1 i 1 i 1

where n equals the number of experimental data points available on v and w. From calculus,
b1 minimises SSres if it satisfies the equation

n n n

w(i) (b v(i)) 2
2 w(i)v(i) 2b1 v(i) 0
2

b1 i1
1
i 1 i 1

Solving for b1 gives

w(i)v(i)
b1 i 1
n

v(i)
i 1
2

The predicted value for w, often labelled w(i) , is given by

w(i) b1x(i)

This is often referred to as the regression line or best fit line. The residuals can then be
calculated from

e(i) w(i) - w(i) w(i) - b1v(i)

By definition, a best fit line should go through the middle of all the data points so that
the negative and positive residuals should cancel to zero when summed, meaning the average
value for the residuals should be zero. The variance of the residuals, often called the mean
squared residuals (MSres), can therefore be calculated using the standard formula for a variance

1 n

n 1 i 1
(e(i) e) 2
1 n

n 1 i 1
SS
e(i) 2 res MS res
n 1

Given that the MSres is a measure of variation or scatter in the data about the regression
line, and so not picked up by this best fit line, this statistic is a measure of lack of fit. The
variance in w, called the mean square total (MStotal) in regression analysis, is by definition
given by

s 2 MStotal
n
w(i) w 2
i 1 n 1

It is useful to know what proportion of the variation in W is made up of the residual


variation. This ratio is called the adjusted coefficient of determination, or R2adj for short

MSres
2
R adj 1-
MStotal

As such R2adj measures the percentage of the variation in w that can be explained by the
regression line or line of best fit. R2adj = 1 only when MSres = 0, i.e. when all the data points
fall on the best fit line. Then the regression line explains all the variation in W.

Example 2: Fatigue of ceramic ball bearings

The application of the above formulas to the fatigue life data is shown in sheet Exponential
of the Excel file least squares Excel Workings. The screen shot below is taken from this file
and shows the above formulas executed in Excel.

Range G4:G13 squares the v(i) values and range H4:H13 multiplies v(i) by w(i). In cell
N6, each of these columns are summed using the sum function and the ratio of the two totals
calculated to yield the value for b1. The negative of b1 is the least squares estimate for of the
exponential distribution as shown in cell N9. This is the value shown in the last graph and so
defines that best fit line in the graph. With b1 known it is possible to work out the predicted
values or points on the best fit line and this is done in column range J4:J13. For example, the
first predicted value is given by b1v(1) = -0.08049(1.67) = -0.1344. Thus the first residual is
the difference between this prediction and the actual value for w(i) or -0.05129 - -0.1344 =
0.083. The other residuals are shown in range K4:K13 including the one shown in the above
graph.

In cell N12 these residuals are all squared and these squared residuals added up using the
function SUMSQ. Dividing this by the sample size less gives the mean squared residuals. In
Excel MStotal can be obtain using the VAR function because MStotal is nothing more than the
sample variance of w. Then in cell N13 these formulas are combined to give a R2adj value of
0.9125. That is, the best fit line explains 91.25% of the variation present in w(i). Study this
Excel sheet to become familiar with these calculations.
For the non proportional model:

n n
SSres e(i) 2 w(i) {b0 b1v(i)}
2

i 1 i 1

where n equals the number of experimental data points available on v and w. From calculus,
the values for b0 and b1 that minimise SSres is found by solving both

w(i) (b b1v(i)) 0
2

b 0
0
i 1

n
w(i) (b0 b1v(i)) 2 0
b1 i1

This involves solving two equations simultaneously and it can be shown that the
solution to this is given by

n n
SSvv v(i) v and SS vw w(i) w v(i) v
SSvw
b1
2
where
SSvv i 1 i 1

and

b 0 w b1v

where w , v are the sample mean values w and v respectively. The variance of the residuals,
often called the mean squared residuals (MSres), can therefore be calculated using the formula
for a variance

(e(i) e) 2

n
1 n SS

i 1

n k n k i 1
e(i) 2 res MS res
nk

where k is the number of unknown parameters in the model that require estimation. Notice the
denominator is n k and not the more usually n - 1 for a normal sample variance calculation.
This is because normally only 1 degree of freedom is lost in calculating the average in the
sample variance formula. But in calculating SSres, k parameters have to be estimated so k
degrees of freedom are lost.

Example 3: Fatigue of ceramic ball bearings

Consider again the ceramic ball bearings data and assume this time that the data has a
Weibull distribution. To find and , the calculations shown in the next table need to be carried
out. Look first at the 2nd column of the table below. It contains that actual fatigue lives and
these times are sorted from lowest to highest. Assuming the data has a Weibull distribution,
ln(x(i)) equals v(i) and so in column 3 these log failure times are calculated.
Empirical cdf
Index, i Sorted Fatigue Life, v(i) = x(i) Log Fatigue Life, v(i) = ln[x(i)] F(x(i)) w(i) = ln{-ln[1-F(x(i)]}
1 1.67 LN(1.67) = 0.5128 =(1-0.5)/10 = 0.05 =ln{-ln(1-0.05)} = -2.9702
2 2.2 ln(2.2) = 0.7885 =(2-0.5)/10 = 0.15 =ln{-ln(1-0.05)} = -1.81696
3 2.51 0.920282753 0.25 -1.245899324
4 3 1.098612289 0.35 -0.842150991
5 3.9 1.360976553 0.45 -0.514437136
6 4.7 1.547562509 0.55 -0.225010673
7 7.53 2.018895042 0.65 0.048620745
8 14.7 2.687847494 0.75 0.32663426
9 27.8 3.325036021 0.85 0.640336939
10 37.4 3.621670704 0.95 1.0971887

Assuming that X has an Weibull


distribution w(i) is defined as
w(i) ln{ ln[1 - F(x(i)]} . So for the smallest value of x, w(1) ln{ ln[1 - 0.05] } - 2.9702
. The last column in the table above shows additional calculations for the larger values for X.
The graph below then plots column 3 against column 5. No matter how you draw a line with
an intercept it is impossible to have all the data points falling on the line. The line shown in the
graph is the best fit to the data given a non-zero intercept. The first row of the above table has
the biggest vertical distance between the data point and the best fit line, so e(1) is the largest
error it is negative in value because its below the line.

2
1.5 w(i) = 1.015v(i) - 2.3652
1
Transformed empirical cdf, w(i)

0.5
0
-0.5
-1
-1.5
-2
-2.5 e(1) = -2.970 - -1.845 = -1.13
-3
-3.5
0 0.5 1 1.5 2 2.5 3 3.5 4
v(i)

The application of the above non proportional model formulas to the fatigue life data is
shown in sheet Weibull of the Excel file least squares Excel Workings. The screen shot
below is taken from this file and shows the above formulas executed in Excel and reveals how
the best fit lines shown in the figure above is obtained.
In range E4:E13, the deviations of w(i) from its mean value are computed and in range
F4:F13 the deviations of v(i) from its mean value are computed. Range H4:H13 then multiplies
these two deviations together, whilst range G4:G13 squares the v(i) deviations. The above
formulas show that Svv and Svw are the sums of the numbers in columns G and H respectively,
and so the SUM function is used in cells N3 and N4 to compute these values. The ratio of these
two values yields the value for b1 and this is computed in cell N6. This is also the least squares
estimate for and this is calculated in cell N10. The formulas above show that b0 is then given
by the mean value for v(i) minus the product of b 1 and the mean of w(i). This calculation is
done in cell N7. ln() is then given by the ratio of b0 to b1 and so is calculated this way in cell
N9. These values for b0 and b1 are the same as those shown in the last graph and so defines that
best fit line in that graph.

With b0 and b1 known it is possible to work out the predicted values or points on the best
fit line and this is done in range J4:J13. For example, the first predicted value is given by b0 +
b1v(1) =-2.3652+1.0149(0.5128) = -1.8447. Thus the first residual is the difference between
this prediction and the actual value for w(i) or -2.9702 - -1.8447 = -1.125. The other residuals
are shown in range K4:K13 including the one shown in the above graph.

In cell N12 these residuals are all squared and these squared residuals added up using the
function SUMSQ. Dividing this by the sample size less k (in this case with two parameters, k
= 2) gives the mean squared residuals. In Excel MStotal can be obtain using the VAR function
because MStotal is nothing more than the variance of w. Thus in cell N13 these formulas are
combined to give a R2adj value of 0.7991. That is, the best fit line explains 79.91% of the
variation present in w(i). R2, or the coefficient of determination, does not adjust for the degrees
of freedom and so

MS res (n k)
R 2 1-
MS total (n 1)

Note then when b0 = 0, R2 = R2adj because k (the number of estimated parameters) = 1.


Study this Excel sheet to become familiar with these calculations.

Sheets Normal and Log Normal of the Excel file least squares Excel Workings
shows how to obtain the least squares estimates of the parameters of the normal and log normal
distributions. Study them in detail and relate them to the formulas above.

D. Probability Plots
By plotting various types of cumulative distributions (such as the normal, log normal or
Weibull) against the actual or empirical distribution, it is possible to see which one best
describes the sample of data being analyzed. Such a cross plot is called a probability plot.

Example 4: Fatigue of ceramic ball bearings

Consider a probability plot for the exponential distribution. The cdf for this distribution
is F(x) 1 e x . Substituting in the above least squares estimate for gives F(x) 1 e 0.0805x
. In sheet Exponential of the Excel file least squares Excel Workings this formula is used
to calculate the numbers shown in cells E22:E31. This is often referred to as the modeled cdf.
This is then plotted against the empirical cdf calculated as above, to give the following
probability plot.

1
0.9 Modeled cdf based on lambda = 0.0805

0.8
0.7
0.6
Empirical cdf

0.5

0.4
0.3

0.2
0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Modeled cdf

If the exponential distribution perfectly described this fatigue data all the points on this
plot should fall on the 45 degree line (blue line above) because then the empirical and modelled
cdfs are identical. So large deviations of the data points from the 45 degree line can be taken
as evidence to suggest the selected distribution the exponential in this illustration is not a
good description of the data in this case fatigue life and so another distribution may be
better. To this effect, in cells F22:F31 the absolute deviations of the data points in the
probability plot from the 45 degree line are calculated. The largest of these deviations (shown
in red and called the maximum absolute deviation or MAD for short) can then be used to
compare distributions. The best distribution to use for a given set of data would then be the one
with the smallest maximum absolute deviation. Looking through all the other Sheets in the
Excel file reveals that the Weibull distribution has a smaller maximum absolute deviation than
the exponential, but the log normal distribution has the smallest maximum deviation at 0.133.
Thus fatigue life is best described using the log normal distribution that is fatigue life appears
to have a log normal distribution.
E. Bias and Consistency.

Play around with the Excel file Sim2 to see whether the least squares estimator of
distribution parameters are biased in small and large samples and compare this to the findings
from Excel file Sim1 to see which estimator (least squares or method of moments) is best for
which distribution. Your findings will be required to tackle some of the questions in your 3rd
assignment.

You might also like