0% found this document useful (0 votes)
16 views43 pages

1 The Gaussian or Normal Probability Density Function

The document discusses various probability distributions, focusing primarily on the Gaussian or normal distribution, which describes the population of possible errors in measurements influenced by multiple independent sources of random error. It also covers the Student's t-distribution and the chi-squared distribution, explaining their applications in statistical analysis. Additionally, the document highlights confidence levels associated with the Gaussian distribution, including the empirical rule for standard deviations and the significance of confidence intervals in engineering statistics.

Uploaded by

Nazmican Tetik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views43 pages

1 The Gaussian or Normal Probability Density Function

The document discusses various probability distributions, focusing primarily on the Gaussian or normal distribution, which describes the population of possible errors in measurements influenced by multiple independent sources of random error. It also covers the Student's t-distribution and the chi-squared distribution, explaining their applications in statistical analysis. Additionally, the document highlights confidence levels associated with the Gaussian distribution, including the empirical rule for standard deviations and the significance of confidence intervals in engineering statistics.

Uploaded by

Nazmican Tetik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Analysis of Experimental

Data
The Gaussian or Normal Probability Density Function
Probability Distributions
• Distributions that will be considered in this lessons are these:
1. The Gaussian, or normal, probability distributions.
• When examining experimental data, this distribution is undoubtedly
the first that is considered.
• The Gaussian distribution describes the population of possible errors
in a measurement when many independent sources of error
contribute simultaneously to the total precision error in a
measurement.
Probability Distributions
• These sources of error must be unrelated, random, and of roughly the
same size.
• Although we will emphasize this particular distribution, you must
keep in mind that data do not always abide by the normal
distribution.
• For tabulation and calculation, the Gaussian distribution is recast in a
standard form, sometimes called the z-distribution (see formula and
table).
Probability Distributions
2. Student’s t-distribution. This distribution is used in predicting the
mean value of a Gaussian population when only small sample of
data is available (see formula and table).

where 𝑥 is the sample mean, S is the sample standard deviation, and n


is the number of data points in the sample.  is the population mean
or expected value
Probability Distributions
3. The 2 – distribution. This distribution helps in predicting the width or
scatter of a population’s distribution, in comparing the uniformity of
samples, and in checking the goodness of fit for assumed
distributions.
The Gaussian or Normal Error Distribution-
• Suppose an experimental observation is made and some particular
result is recorded.
• We know (or would strongly suspect) that the observation has been
subjected to many random errors.
• These random errors may make the final reading either too large or
too small, depending on many circumstances which are unknown to
us.
• Assuming that there are many small errors that contribute to the final
error and that each small error is of equal magnitude and equally
likely to be positive or negative, the Gaussian or normal error
distribution may be derived.
Gaussian or normal PDF
• Gaussian or normal PDF – The Gaussian probability density function
(also called the normal probability density function or simply the
normal PDF) is the vertically normalized PDF that is produced from a
signal or measurement that has purely random errors.
• The normal probability density function is
Gaussian or normal PDF
• If the measurement is designated by x, the
Gaussian distribution gives the probability that
the measurement will lie between x and x + dx
and is written

• In this expression xm is the mean reading and σ


is the standard deviation.
• Some may prefer to call P(x) the probability
density.
Gaussian or normal PDF
• Here are some of the properties of this special
distribution:
It is symmetric about the mean.
The mean and median are both equal to μ, the
expected value (at the peak of the
distribution). [The mode is undefined for a smooth,
continuous distribution.]
Its plot is commonly called a “bell curve” because of its
shape.
The actual shape depends on the magnitude of the
standard deviation. Namely, if σ is small, the bell will be
tall and skinny, while if σ is large, the bell will be short
and fat, as sketched.
Standard normal density function
• All of the Gaussian PDF cases, for any mean value and for any
standard deviation, can be collapsed into one normalized curve called
the standard normal density function.
• This normalization is accomplished through the variable
transformations introduced previously, i.e.,
Standard normal density function
• This standard normal density
function is valid for any signal
measurement, with any mean,
and with any standard deviation,
provided that the errors
(deviations) are purely random.
• A plot of the standard normal
(Gaussian) density function was
generated in Excel, using the
above equation for f(z). It is
shown to the right.
Standard normal density function
• It turns out that the probability that variable x lies between some
range x1 and x2 is the same as the probability that the transformed
variable z lies between the corresponding range z1 and z2, where z is
the transformed variable defined above. In other words

• Note that z is dimensionless, so there are no units to worry about, so


long as the mean and the standard deviation are expressed in the
same units.
Standard normal density function
• Furthermore, since
it follows that

• We define A(z) as the area under the curve between 0 and z, i.e., the
special case where z1 = 0 in the above integral, and z2 is simply z. In
other words, A(z) is the probability that a measurement lies between
0 and z, or
Standard normal density function
• Below is a table of A(z), produced using Excel, which has a built-in
error function, ERF(value). Excel has another function that can be
used to calculate A(z), namely A(z)=NORMDIST(ABS(z))-0.5
• To read the value of A(z) at a particular value of z,
• Go down to the row representing the first two digits of z.
• Go across to the column representing the third digit of z.
• Read the value of A(z) from the table.
• Example: At z = 2.54, A(z) = A(2.5 + 0.04) = 0.49446. These values are
highlighted in the above table as an example.
• Since the normal PDF is symmetric, A(-z) = A(z), so there is no need to
tabulate negative values of z.
Linear interpolation:
Special cases:
• If z = 0, obviously the integral A(z) = 0. This means physically that
there is zero probability that x will exactly equal the mean! (To be
exactly equal would require equality out to an infinite number of
decimal places, which will never happen.)
• If z = ∞, A(z) = 1/2 since f(z) is symmetric. This means that there is a
50% probability that x is greater than the mean value. In other words,
z = 0 represents the median value of x.
• Likewise, if z = –∞ , A(z) = 1/2. There is a 50% probability that x is less
than the mean value.
Special cases:
• If z = 1, it turns out that to four significant digits.
This is a special case, since by definition z=(x-μ)/σ . Therefore, z = 1
represents a value of x exactly one standard deviation greater than
the mean.
• A similar situation occurs for z = –1 since f(z) is symmetric, and
to four significant digits. Thus, z = –1 represents a
value of x exactly one standard deviation less than the mean.
Special cases:
• Because of this symmetry, we conclude that the probability that z lies
between –1 and 1 is 2(0.3413) = 0.6826 or 68.26%. In other words,
there is a 68.26% probability that for some measurement, the
transformed variable z lies within ± one standard deviation from the
mean (which is zero for this pdf).
• Translated back to the original measured variable x, P (μ-σ<x< μ+σ )=
68.26% . In other words, the probability that a measurement lies
within ± one standard deviation from the mean is 68.26%.
Confidence level
• The above illustration leads to an important concept called
confidence level. For the above case, we are 68.26% confident that
any random measurement of x will lie within ± one standard
deviation from the mean value.
• I would not bet my life savings on something with a 68% confidence
level. A higher confidence level is obtained by choosing a larger z
value. For example, for z = 2 (two standard deviations away from the
mean), it turns out that to four significant digits.
Confidence level
• Again, due to symmetry, multiplication by two yields the probability
that x lies within two standard deviations from the mean value, either
to the right or to the left. Since 2(0.4772) = 0.9544, we are 95.44%
confident that x lies within ± two standard deviations of the mean.
• Since 95.44 is close to 95, most engineers and statisticians ignore the
last two digits and state simply that there is about a 95% confidence
level that x lies within ± two standard deviations from the mean. This
is in fact the engineering standard, called the “two sigma confidence
level” or the “95% confidence level.”
Confidence level
• For example, when a manufacturer reports the value of a property,
like resistance, the report may state “R = 100 ± 9 Ω (ohms) with 95%
confidence.” This means that the mean value of resistance is 100 Ω,
and that 9 ohms represents two standard deviations from the mean.
• In fact, the words “with 95% confidence” are often not even written
explicitly, but are implied. In this example, by the way, you can easily
calculate the standard deviation. Namely, since 95% confidence level
is about the same as 2 sigma confidence, 2 σ ≈ 9 Ω, or σ = 4.5 Ω.
Confidence level
• For more stringent standards, the confidence level is sometimes
raised to three sigma. For z = 3 (three standard deviations away from
the mean), it turns out that to four significant
digits.
• Multiplication by two (because of symmetry) yields the probability
that x lies within ± three standard deviations from the mean value.
Since 2(0.4987) = 0.9974, we are 99.74% confident that x lies within±
three standard deviations from the mean.
Confidence level
• Most engineers and statisticians round down and state simply that
there is about a 99.7% confidence level that x lies within ± three
standard deviations from the mean. This is in fact a stricter
engineering standard, called the “three sigma confidence level” or the
“99.7% confidence level.”
• Summary of confidence levels: The empirical rule states that for any
normal or Gaussian PDF,
Confidence level
• Approximately 68% of the values fall within 1 standard deviation from
the mean in either direction.
• Approximately 95% of the values fall within 2 standard deviations
from the mean in either direction. [This one is the standard “two
sigma” engineering confidence level for most measurements.]
• Approximately 99.7% of the values fall within 3 standard deviations
from the mean in either direction. [This one is the stricter “three
sigma” engineering confidence level for more precise measurements.]
• More recently, many manufacturers are striving for “six sigma”
confidence levels.
Example:
• Given: The same 1000 temperature measurements used in a previous
example for generating a histogram and a PDF. The data are provided
in an Excel spreadsheet
• To do: (a) Compare the normalized PDF of these data to the normal
(Gaussian) PDF. Are the measurement errors in this sample purely
random? (b) Predict how many of the temperature measurements are
greater than 33.0oC, and compare with the actual number.
Solution:
• We plot the experimentally generated
PDF (blue circles) and the theoretical
normal PDF (red curve) on the same
plot. The agreement is excellent,
indicating that the errors are very nearly
random. Of course, the agreement is not
perfect – this is because n is finite. If n
were to increase, we would expect the
agreement to get better (less scatter and
difference between the experimental
and theoretical PDFs).
Solution:
• For this data set, we had calculated the sample mean to be 𝑥 =
31.009 and sample standard deviation to be S = 1.488. Since n = 1000,
the sample size is large enough to assume that expected value μ is
nearly equal to x , and standard deviation 𝑥 is nearly equal to S. At the
given value of temperature (set x = 33.0oC), we normalize to obtain z,
namely,
Solution:
• We calculate area A(z), either by
interpolation from the above table or by
direct calculation. The table yields A(z) =
0.40955
• This means that 40.955% of the
measurements are predicted to lie between
the mean (31.009oC) and the given value of
33.0oC (red area on the plot). The percentage
of measurements greater than 33.0oC is 50%
- 40.9552% = 9.0448% (blue area on the
plot).
Solution:
• Since n = 1000, we predict that
0.090448*1000 = 90.448 of the
measurements exceed 33.0oC.
• Rounding to the nearest integer, we
predict that 90 measurements are
greater than 33.0oC. Looking at the
actual data, we count 81 temperature
readings greater than 33.0oC.
Discussion:
• The percentage error between actual and
predicted number of measurements is around -
10%. This error would be expected to decrease if
n were larger.
• If we had asked for the probability that T lies
between the mean value and 33.0 oC, the result
would have been 0.4096 (to four digits), as
indicated by the red area in the above plot.
• However, we are concerned here with the
probability that T is greater than 33.0 oC, which is
represented by the blue area on the plot.
Discussion:
• This is why we had to subtract from 50% in the
above calculation (50% of the measurements are
greater than the mean), i.e., the probability that T
is greater than 33.0 oC is 0.5000 – 0.4096 = 0.0904.
• Excel’s built-in NORMSDIST function returns the
cumulative area from -∞ to z, the orange-colored
area in the plot to the right. Thus, at z = 1.338,
NORMSDIST(z) = 0.909552. This is the entire area
on the left half of the Gaussian PDF (0.5) plus the
area labeled A(z) in the above plot. The desired
blue area is therefore equal to 1 - NORMSDIST(z).
Confidence level and level of significance
• Confidence level, c, is defined as the
probability that a random variable lies
within a specified range of values. The
range of values itself is called the
confidence interval. For example, as
discussed above, we are 95.44% confident
that a purely random variable lies within ±
two standard deviations from the mean.
We state this as a confidence level of c =
95.44%, which we usually round off to
95% for practical engineering statistical
analysis.
Confidence level and level of significance
• Level of significance, 𝛼, is defined as
the probability that a random variable
lies outside of a specified range of
values. In the above example, we are
100 – 95.44 = 4.56% confident that a
purely random variable lies either
below or above two standard
deviations from the mean. (We usually
round this off to 5% for practical
engineering statistical analysis.)
Confidence level and level of significance
• Mathematically, confidence level and level of significance must add to
1 (or in terms of percentage, to 100%) since they are complementary,
i.e., 𝛼+c=1 or c=1- 𝛼.
• Confidence level is sometimes given the symbol c% when it is
expressed as a percentage; e.g., at 95% confidence level, c = 0.95, c%
= 95%, and 𝛼 = 1 – c = 0.05.
• Both 𝛼 and confidence level c represent probabilities, or areas under
the PDF, as sketched above for the normal or Gaussian PDF.
Confidence level and level of significance
• The blue areas in the above plot are
called the tails. There are two tails, one
on the far left and one on the far right.
The two tails together represent all the
data outside of the confidence interval,
as sketched.
• Caution: The area of one of the tails is
only 𝛼 /2, not 𝛼. This factor of two has
led to much grief, so be careful that
you do not forget this
Expected fraction of population Approximate frequency for daily
Range Approximate expected frequency outside range
inside range event
μ ± 0.5σ 0.382924923 3 in 5 Four or five times a week
μ±σ 0.682689492 1 in 3 Twice a week
μ ± 1.5σ 0.866385597 1 in 7 Weekly
μ ± 2σ 0.954499736 1 in 22 Every three weeks
μ ± 2.5σ 0.987580669 1 in 81 Quarterly
μ ± 3σ 0.997300204 1 in 370 Yearly
μ ± 3.5σ 0.999534742 1 in 2149 Every 6 years
μ ± 4σ 0.999936658 1 in 15787 Every 43 years (twice in a lifetime)
Every 403 years (once in the modern
μ ± 4.5σ 0.999993205 1 in 147160 era)
Every 4776 years (once in recorded
μ ± 5σ 0.999999427 1 in 1744278 history)
Every 72090 years (thrice in history
μ ± 5.5σ 0.999999962 1 in 26330254 of modern humankind)
Every 1.38 million years (twice in
μ ± 6σ 0.999999998 1 in 506797346 history of humankind)
Every 34 million years (twice since
μ ± 6.5σ 0.99999999992 1 in 12450197393 the extinction of dinosaurs)

Every 1.07 billion years (four


μ ± 7σ 0.99999999999744 1 in 390682215445 occurrences in history of Earth)

You might also like