1 The Gaussian or Normal Probability Density Function
1 The Gaussian or Normal Probability Density Function
Data
The Gaussian or Normal Probability Density Function
Probability Distributions
• Distributions that will be considered in this lessons are these:
1. The Gaussian, or normal, probability distributions.
• When examining experimental data, this distribution is undoubtedly
the first that is considered.
• The Gaussian distribution describes the population of possible errors
in a measurement when many independent sources of error
contribute simultaneously to the total precision error in a
measurement.
Probability Distributions
• These sources of error must be unrelated, random, and of roughly the
same size.
• Although we will emphasize this particular distribution, you must
keep in mind that data do not always abide by the normal
distribution.
• For tabulation and calculation, the Gaussian distribution is recast in a
standard form, sometimes called the z-distribution (see formula and
table).
Probability Distributions
2. Student’s t-distribution. This distribution is used in predicting the
mean value of a Gaussian population when only small sample of
data is available (see formula and table).
• We define A(z) as the area under the curve between 0 and z, i.e., the
special case where z1 = 0 in the above integral, and z2 is simply z. In
other words, A(z) is the probability that a measurement lies between
0 and z, or
Standard normal density function
• Below is a table of A(z), produced using Excel, which has a built-in
error function, ERF(value). Excel has another function that can be
used to calculate A(z), namely A(z)=NORMDIST(ABS(z))-0.5
• To read the value of A(z) at a particular value of z,
• Go down to the row representing the first two digits of z.
• Go across to the column representing the third digit of z.
• Read the value of A(z) from the table.
• Example: At z = 2.54, A(z) = A(2.5 + 0.04) = 0.49446. These values are
highlighted in the above table as an example.
• Since the normal PDF is symmetric, A(-z) = A(z), so there is no need to
tabulate negative values of z.
Linear interpolation:
Special cases:
• If z = 0, obviously the integral A(z) = 0. This means physically that
there is zero probability that x will exactly equal the mean! (To be
exactly equal would require equality out to an infinite number of
decimal places, which will never happen.)
• If z = ∞, A(z) = 1/2 since f(z) is symmetric. This means that there is a
50% probability that x is greater than the mean value. In other words,
z = 0 represents the median value of x.
• Likewise, if z = –∞ , A(z) = 1/2. There is a 50% probability that x is less
than the mean value.
Special cases:
• If z = 1, it turns out that to four significant digits.
This is a special case, since by definition z=(x-μ)/σ . Therefore, z = 1
represents a value of x exactly one standard deviation greater than
the mean.
• A similar situation occurs for z = –1 since f(z) is symmetric, and
to four significant digits. Thus, z = –1 represents a
value of x exactly one standard deviation less than the mean.
Special cases:
• Because of this symmetry, we conclude that the probability that z lies
between –1 and 1 is 2(0.3413) = 0.6826 or 68.26%. In other words,
there is a 68.26% probability that for some measurement, the
transformed variable z lies within ± one standard deviation from the
mean (which is zero for this pdf).
• Translated back to the original measured variable x, P (μ-σ<x< μ+σ )=
68.26% . In other words, the probability that a measurement lies
within ± one standard deviation from the mean is 68.26%.
Confidence level
• The above illustration leads to an important concept called
confidence level. For the above case, we are 68.26% confident that
any random measurement of x will lie within ± one standard
deviation from the mean value.
• I would not bet my life savings on something with a 68% confidence
level. A higher confidence level is obtained by choosing a larger z
value. For example, for z = 2 (two standard deviations away from the
mean), it turns out that to four significant digits.
Confidence level
• Again, due to symmetry, multiplication by two yields the probability
that x lies within two standard deviations from the mean value, either
to the right or to the left. Since 2(0.4772) = 0.9544, we are 95.44%
confident that x lies within ± two standard deviations of the mean.
• Since 95.44 is close to 95, most engineers and statisticians ignore the
last two digits and state simply that there is about a 95% confidence
level that x lies within ± two standard deviations from the mean. This
is in fact the engineering standard, called the “two sigma confidence
level” or the “95% confidence level.”
Confidence level
• For example, when a manufacturer reports the value of a property,
like resistance, the report may state “R = 100 ± 9 Ω (ohms) with 95%
confidence.” This means that the mean value of resistance is 100 Ω,
and that 9 ohms represents two standard deviations from the mean.
• In fact, the words “with 95% confidence” are often not even written
explicitly, but are implied. In this example, by the way, you can easily
calculate the standard deviation. Namely, since 95% confidence level
is about the same as 2 sigma confidence, 2 σ ≈ 9 Ω, or σ = 4.5 Ω.
Confidence level
• For more stringent standards, the confidence level is sometimes
raised to three sigma. For z = 3 (three standard deviations away from
the mean), it turns out that to four significant
digits.
• Multiplication by two (because of symmetry) yields the probability
that x lies within ± three standard deviations from the mean value.
Since 2(0.4987) = 0.9974, we are 99.74% confident that x lies within±
three standard deviations from the mean.
Confidence level
• Most engineers and statisticians round down and state simply that
there is about a 99.7% confidence level that x lies within ± three
standard deviations from the mean. This is in fact a stricter
engineering standard, called the “three sigma confidence level” or the
“99.7% confidence level.”
• Summary of confidence levels: The empirical rule states that for any
normal or Gaussian PDF,
Confidence level
• Approximately 68% of the values fall within 1 standard deviation from
the mean in either direction.
• Approximately 95% of the values fall within 2 standard deviations
from the mean in either direction. [This one is the standard “two
sigma” engineering confidence level for most measurements.]
• Approximately 99.7% of the values fall within 3 standard deviations
from the mean in either direction. [This one is the stricter “three
sigma” engineering confidence level for more precise measurements.]
• More recently, many manufacturers are striving for “six sigma”
confidence levels.
Example:
• Given: The same 1000 temperature measurements used in a previous
example for generating a histogram and a PDF. The data are provided
in an Excel spreadsheet
• To do: (a) Compare the normalized PDF of these data to the normal
(Gaussian) PDF. Are the measurement errors in this sample purely
random? (b) Predict how many of the temperature measurements are
greater than 33.0oC, and compare with the actual number.
Solution:
• We plot the experimentally generated
PDF (blue circles) and the theoretical
normal PDF (red curve) on the same
plot. The agreement is excellent,
indicating that the errors are very nearly
random. Of course, the agreement is not
perfect – this is because n is finite. If n
were to increase, we would expect the
agreement to get better (less scatter and
difference between the experimental
and theoretical PDFs).
Solution:
• For this data set, we had calculated the sample mean to be 𝑥 =
31.009 and sample standard deviation to be S = 1.488. Since n = 1000,
the sample size is large enough to assume that expected value μ is
nearly equal to x , and standard deviation 𝑥 is nearly equal to S. At the
given value of temperature (set x = 33.0oC), we normalize to obtain z,
namely,
Solution:
• We calculate area A(z), either by
interpolation from the above table or by
direct calculation. The table yields A(z) =
0.40955
• This means that 40.955% of the
measurements are predicted to lie between
the mean (31.009oC) and the given value of
33.0oC (red area on the plot). The percentage
of measurements greater than 33.0oC is 50%
- 40.9552% = 9.0448% (blue area on the
plot).
Solution:
• Since n = 1000, we predict that
0.090448*1000 = 90.448 of the
measurements exceed 33.0oC.
• Rounding to the nearest integer, we
predict that 90 measurements are
greater than 33.0oC. Looking at the
actual data, we count 81 temperature
readings greater than 33.0oC.
Discussion:
• The percentage error between actual and
predicted number of measurements is around -
10%. This error would be expected to decrease if
n were larger.
• If we had asked for the probability that T lies
between the mean value and 33.0 oC, the result
would have been 0.4096 (to four digits), as
indicated by the red area in the above plot.
• However, we are concerned here with the
probability that T is greater than 33.0 oC, which is
represented by the blue area on the plot.
Discussion:
• This is why we had to subtract from 50% in the
above calculation (50% of the measurements are
greater than the mean), i.e., the probability that T
is greater than 33.0 oC is 0.5000 – 0.4096 = 0.0904.
• Excel’s built-in NORMSDIST function returns the
cumulative area from -∞ to z, the orange-colored
area in the plot to the right. Thus, at z = 1.338,
NORMSDIST(z) = 0.909552. This is the entire area
on the left half of the Gaussian PDF (0.5) plus the
area labeled A(z) in the above plot. The desired
blue area is therefore equal to 1 - NORMSDIST(z).
Confidence level and level of significance
• Confidence level, c, is defined as the
probability that a random variable lies
within a specified range of values. The
range of values itself is called the
confidence interval. For example, as
discussed above, we are 95.44% confident
that a purely random variable lies within ±
two standard deviations from the mean.
We state this as a confidence level of c =
95.44%, which we usually round off to
95% for practical engineering statistical
analysis.
Confidence level and level of significance
• Level of significance, 𝛼, is defined as
the probability that a random variable
lies outside of a specified range of
values. In the above example, we are
100 – 95.44 = 4.56% confident that a
purely random variable lies either
below or above two standard
deviations from the mean. (We usually
round this off to 5% for practical
engineering statistical analysis.)
Confidence level and level of significance
• Mathematically, confidence level and level of significance must add to
1 (or in terms of percentage, to 100%) since they are complementary,
i.e., 𝛼+c=1 or c=1- 𝛼.
• Confidence level is sometimes given the symbol c% when it is
expressed as a percentage; e.g., at 95% confidence level, c = 0.95, c%
= 95%, and 𝛼 = 1 – c = 0.05.
• Both 𝛼 and confidence level c represent probabilities, or areas under
the PDF, as sketched above for the normal or Gaussian PDF.
Confidence level and level of significance
• The blue areas in the above plot are
called the tails. There are two tails, one
on the far left and one on the far right.
The two tails together represent all the
data outside of the confidence interval,
as sketched.
• Caution: The area of one of the tails is
only 𝛼 /2, not 𝛼. This factor of two has
led to much grief, so be careful that
you do not forget this
Expected fraction of population Approximate frequency for daily
Range Approximate expected frequency outside range
inside range event
μ ± 0.5σ 0.382924923 3 in 5 Four or five times a week
μ±σ 0.682689492 1 in 3 Twice a week
μ ± 1.5σ 0.866385597 1 in 7 Weekly
μ ± 2σ 0.954499736 1 in 22 Every three weeks
μ ± 2.5σ 0.987580669 1 in 81 Quarterly
μ ± 3σ 0.997300204 1 in 370 Yearly
μ ± 3.5σ 0.999534742 1 in 2149 Every 6 years
μ ± 4σ 0.999936658 1 in 15787 Every 43 years (twice in a lifetime)
Every 403 years (once in the modern
μ ± 4.5σ 0.999993205 1 in 147160 era)
Every 4776 years (once in recorded
μ ± 5σ 0.999999427 1 in 1744278 history)
Every 72090 years (thrice in history
μ ± 5.5σ 0.999999962 1 in 26330254 of modern humankind)
Every 1.38 million years (twice in
μ ± 6σ 0.999999998 1 in 506797346 history of humankind)
Every 34 million years (twice since
μ ± 6.5σ 0.99999999992 1 in 12450197393 the extinction of dinosaurs)