Gaussian Distributions: Overview: This Worksheet Introduces The Properties of Gaussian Distributions, The
Gaussian Distributions: Overview: This Worksheet Introduces The Properties of Gaussian Distributions, The
by
Kevin Lehmann
Department of Chemistry
Princeton University
Princeton, NJ 08544
[email protected]
© Copyright Kevin Lehmann, 1997. All rights reserved. You are welcome to use this document in your
own classes but commercial use is not allowed without the permission of the author. The author
welcomes any constructive criticisms or other comments from either educators or students.
Prerequisites: It is assumed in this worksheet that the reader already knows the
basic principles of probability and integral calculus, as well as how to use Mathcad.
Introduction:
This is one of the predefined functions in Mathcad. This describes the familiar
"Bell Curve", with mean µ and standard deviation σ. The special case of µ=0 and
σ=1 is known as the standard Gaussian. Given any variable x that is distributed
according to a Gaussian distribution with mean µ and standard deviation σ, it is
possible to define a new variable z = (x-µ)/σ, which will be distributed according to
the standard Gaussian distribution.
The probability of finding the independent variable x in the interval [x, x+dx] is
given by dnorm(x,µ,σ)dx. This is only rigorously correct in the limit that dx ->0, but it
is a good approximation for finite dx as long as dx << σ.
For a standard Gaussian, what is the probability that the variable will be
found in the interval between [1,1.01]?
x
2
1 . exp ( x' µ)
pnorm( x , µ , σ ) dx'
2
2 2. σ
2 . π. σ
∞
We express the probability of finding x in the interval [a,b] (a < b), as:
P( a x b ) pnorm( b , µ , σ ) pnorm( a , µ , σ )
Use the rules of integration to show that this expression is correct. (Hint:
remember that one can break an integral from [-∞,b] into two integrals,
between [-∞ ,a] and [a,b]).
Use qnorm to find the value of 'a' such that a random point from a standard
Gaussian distribution has a 95% probability of being less than this value of
'a'. Repeat for a 95% probability of being greater than 'a'.
µ 50
Average Value of the distribution we will select from
These values are not exactly equal to µ and σ due to fluctuations, which reflect that
we only have a finite number of data points. If you select a different distribution and
then select "Calculate Worksheet" from the Math menu, these values will change.
150
100
50
ip
This looks very much like what you would see on an oscilloscope as you looked at
the output of a good amplifier with the gain up high. It is what is known as "White
Noise". Note that it consists of a "band" centered at µ, with a width of ~5σ.
Verify that the width in the above figure is approximately 5σ.
Let us now compare our distribution with the Gaussian function. We do this by
'binning' our data into a histogram. Again, Mathcad makes this easy.
i 0 .. 199 x0 5. σ µ
The "bins" are defined between the
values in vector x. Recall that µ is the
mean. x0 is the location of the first bin,
dx defines the size of each bin. The
δx 0.05. σ xi 1 xi δx
bins are filled automatically. How
many bins are being generated
here? What range of values to the
bins cover?
f hist( x , y ) f is a vector with the number of times points of y fall in each
bin.
2000
f 1500
i
f2
i
1000
500
0
50 0 50 100 150
xi 0.5 . δx
Why does the number of points in each bin not exactly match the
predictions of the Gaussian Distribution Function used to generate the data?
In most applications of statistics to laboratory data, we want to know the true value of
the mean for the distribution. We cannot determine this exactly, but we can find an
estimate by using a finite number of measurements. Let us now consider splitting
our large data set into a set of smaller data sets, so that we can compare how our
estimates do.
Ns 4 Number of data points per set; start with 4
N
Nt floor Number of data sets (floor(x) is the
Ns largest integer < or = to x)
For a Gaussian distribution, the best estimate of µ from a data set is just the mean
of the data. By "best", we mean that this will give on average the lowest root mean
squared deviation from the true value. Next we compute the average for each set of
4 points and create the avg vector.
k 0 .. N t 1
Ns 1
1 .
avg k yN t. j k
Ns
j=0
Express the mean of avg in terms of µ , the standard deviation of the initial
distribution, by selecting the result, and then putting µ in where one would
type units (the black box after the number). Express the stdev in terms of σ,
the standard deviation of initial distribution.
Compare these results for those calculated for the entire distribution above.
Note: the average value of the means is just the mean for the total distribution. The
standard deviation of the distribution of means is the standard deviation of the initial
distribution divided by the square root of the number of data points averaged. This
result will hold for any distribution function as long as the standard deviation is finite.
xi xi 1 σ
f2 i N t. δx. dnorm , µ, Gaussian with width scaled by N s
2
Ns
1500
1000
f
i
f2
i
500
0
50 0 50 100 150
xi 0.5. δx
Notice that the distribution of means is again a normal distribution, but with the width
reduced by the square root of the number of points averaged. For a general
distribution, the distribution of the mean of Ns measurements will not be Gaussian,
but will have a standard deviation smaller than that of the sample by the factor of
square root of the number of data points. Further, for a broad class of distribution
functions, the Central Limit Theorem asserts that the distribution of the mean of Ns
values approaches arbitrarily close to a Gaussian Distribution for sufficiently large
N s.
1 p
qnorm , µ, σ Gives the Confidence Interval for a
2 normal distribution. p is the
Interval( p , µ , σ )
qnorm
1 p
, µ, σ
probability level. You must provide
2 a value for p as shown below.
Why are the first arguments of the qnorm function (1-p)/2 and (1+p)/2 for the
lower and upper limit of the confidence interval?
Let us calculate a few confidence intervals for the standard Gaussian with unit width
and zero mean. The results can be interpreted for a general Gaussian in term of
how many σ we must go from the mean
1.645 1.96
Interval( 0.9 , 0 , 1 ) = Interval( 0.95 , 0 , 1 ) =
1.645 1.96
2.576 3.291
Interval( 0.99 , 0 , 1 ) = Interval( 0.999 , 0 , 1 ) =
2.576 3.291
Thus we see we expect the points to fall within ~3.3σ of the mean 99.9% of the
time. Let us see how well this predicts our selected set of points.
It turns out that we want to first get an estimate for the square of the standard
deviation, which is known as the variance. We get an estimate for this by the
following expression:
Ns 1
2
yN t. j k avgk
j=0
vark
Ns 1
vark is the variance of the k'th set of Ns data points. We can compare the mean of
var with the variance for the original normal distribution:
Calculate the mean of the vector var and express in terms of σ2. The result
should be that mean of var is almost equal to σ2.
Define a new vector, var2k, which is the average of the squared deviation of
the points in the k'th data set from µ :
Ns 1
2
yN t. j k µ
j=0
var2k
Ns
Note that we average the variance, and then take the square root to get the
standard deviation. Show that for our data set, the average of the sample
standard deviations, the square root of the elements of var, does not
average to give a good estimate of σ.
The χ2 function:
We have seen from above that on average the sample variance (with division by
N s -1) gives the variance of the Gaussian distribution from which the data was
selected. This is true for most distribution functions. However, to understand how
good an estimate it is for a single set of data, we need to know what the
distribution of sample variances is. In the theory of statistics, it is shown that the
distribution function for sample variance can be expressed in terms of the χ2
distribution function ('chi-squared function') defined as:
x d
1
2 2
2 e . x
χ ( x, d) Note: χ2 here is a function of real x>=0 and
d 2 positive integer d.
2.Γ
2 This equation is toggled off.
where x is defined on the range [0,∞), and d is an integer parameter known as the
'degrees of freedom'. Γ(d/2) is the Gamma Function. The χ2 distribution function
is predefined in Mathcad as dchisq(x,d).
Let us now compare the observed distribution of sample variance values, vark,
with that predicted by the χ2 distribution:
f hist( x , var )
Put calculated variances into bins
What is contained in the m'th element of vector f?
d Ns 1 "Degrees of freedom"
Why does this expression have the given prefactor for the dchisq
function?
300
fm
200
f2
m
100
0
0 0.5 1 1.5 2 2.5 3 3.5 4
xm 0.5. δx
2
σ
The important lesson to learn from the above graph is that for a sample of
modest size, the estimate of the standard deviation calculated from the
computed variance can be quite different from the true value for the distribution.
In particular, we have a high probability of significantly underestimating the real σ.
Thus, if we compute a confidence interval using this too small σ, we will get an
unrealistically small value for our true uncertainty! This mistake is all too
common!
To demonstrate this effect, let us calculate how often our calculated mean value
(avg) differs from the true mean by more than "2σ", where we use the observed
variance to estimate σ. We expect, as shown above, that the standard deviation
for the the mean of N s measurements to be smaller by N s.
If instead, we use the true value for σ, we of course get a reasonable estimate:
2
1 . σ
avgk µ >2. = 0.04816 This is close to the 5% expected for a
Nt Ns Gaussian distribution.
k
Student's t distribution:
range = 3.182
The reason we use p = 0.975 for the first argument of the qt function instead of 0.95
is that qt returns the value of x (in units of the calculated standard deviation) such
that an observed value will occur less than a fraction p of the time. The second
argument is d, the degrees of freedom. You will see by changing values of the
second argument that the true error confidence interval for a small d is quite a bit
larger than when we can assume we know the value for σ. For d=1, the 95% interval
is greater than six times the traditional "2σ" estimate, and for d=10 it has reduced to
only 10% larger
.
Let us test the fraction of sample means that differ from µ by more than the range
predicted by the student's t distribution.
1 . vark
avgk µ > range. = 0.05016 Much closer to the 5% that we expect!
Nt Ns
k
There are many cases where we need to know the average value of a function,
f(x), of a variable that follows a Gaussian Distribution. This can be expressed as:
∞
2
1 . exp ( x µ)
mean_of( f ) f( x ) . dx
2
2 2.σ
2 . π. σ
∞
In favorable cases, we can compute the integral analytically. In most other cases, it
can easily be calculated by numerical integration. There exists a very efficient
method, known as Gauss- Hermite integration, for the numerical evaluation of the
integral of a smooth function times a Gaussian. Another method that we can use is
to take N points from the Gaussian distribution, xi, and compute the average value of
f( xi). This method is known as Monte-Carlo Integration. The average of an infinite
set of f(xi) value is just mean_of(f), as defined above. We cannot, of course, use an
infinite number of points, but if we use a finite number, their average will provide an
estimate for mean_of(f). However, we will still have statistical fluctuations. The
standard deviation for the distribution of sample means of f is given by:
2 2
square_root_of_distribution_mean_of( f ) mean_of f mean_of( f )
Using our data set, y, calculate the average value for f(x) = |x-µ |. Based
upon the expression for the standard deviation of the distribution of f(x)
values, estimate the likely 2s uncertainty of this estimate.
By the central limit theorem, we can predict that in the limit of large N, the
distribution of the average value of N values of f(xi ) will become a Gaussian
distribution with the above standard deviation divided by N and mean equal to the
mean_of(f). The central limit theorem states that for any distribution function P(x) with
a mean µ and finite standard deviation σ and satisfying some other mathematical
conditions, the distribution function for the mean of N independent points selected at
random P(x) will, in the limit of large N, approach a
Gaussian Distribution with mean µ, and a standard deviation of σ .
N
The normalized fourth central moment is often called the kurtosis. It is always
positive and has a value of 3 for a normal distribution function. A value below 3
means that the distribution dies faster in the wings than a Gaussian. For example,
a uniform distribution (P(x) = constant if x is in [a,b], zero otherwise) has a kurtosis
of 9/5. A kurtosis greater than 3 means that the distribution has a 'longer' tail than a
Gaussian.
Derive the skew and kurtosis of the uniform distribution in the interval [0,1].
Use Mathcad's symbolics to evaluate the necessary integrals.
has a value of 2 . σ = 0.798 σ ratio of the mean absolute deviation to the standard
. The
π
deviation is larger for a distribution more compact than a Gaussian (it equals
0.866 for a uniform distribution), and smaller for a distribution with longer tails.
Since most common statistical tests make the assumption that the sample follows
a Gaussian distribution, it is important to test this assumption to detect at least gross
deviations. One common use for the low order moments is to provide such a test.
Let us consider the samples of Gaussian data points we selected earlier in this
worksheet and let us calculate the distribution of the normalized absolute deviation
as well as the normalized third and fourth central moments. In order for the test to be
at all meaningful, we need at least ~10 points per sample, so go back up to page 8
and change N s to 10. For small sample sizes, we cannot directly use the expected
Gaussian distribution since often we do not know the true σ of the Gaussian but must
instead use the sample variance to estimate σ. This introduces additional error, as
we discussed above when we discussed the student t distribution. As an example,
let us compare the 'moments' calculated using the estimated quantities and the true
values for the initial distribution.
stdev( avg_abs_dev ) = 0.075 This gives an measure of the range of variation in the
computed set of mean absolute deviations. Note
that it is relatively small.
Let us now make the same comparisons for the third and fourth central moments.
First the third:
N 1 3
1. yj µ
= 0.013 Near zero, as it should be for a Gaussian
N 3
j=0 σ
Ns 1 3
1 yN t. j k avgk Third central moments for each set of
skew k .
Ns 1 vark
1.5 N s points computed using sample
j=0 means and standard deviations.
3
mean( skew ) = 1.97 10 The average value is almost the same as above,
indicating that in this case the bias is not large.
stdev( skew ) = 0.506 This show the size of the fluctuations of the computed
third moments, which are not small.
The above results suggest a strategy for testing if a set of points are likely
sampled from a Gaussian distribution. We compute the above moments for the
data, and then compare the results to those found from sets of the same number of
random numbers known to be drawn from a Gaussian distribution. However,
because of fluctuations, we will not be above to decide with certainty. Any finite
set of points has some probability of being observed when we pull points from a
distribution, like the Gaussian, that has nonzero probability density for every value
of x.
In order to tell if the moments calculated from the observed points is likely for
points from a Gaussian distribution, we need to look at the distribution of values
we obtained in our simulation.
Let us make a histogram of the results, first for the mean absolute deviation.
hist( x , avg_abs_dev )
m 0 .. 39 f
N t . 0.025
6
Remember that to
f see the correct
m
4 figure you needed
to change the value
of Ns on page 8.
2
0
0.6 0.8 1 1.2 1.4
x 0.5 . δx
m
Average Absolute Deviation/st.dev.
This distribution is rather 'tight', making for a sensitive test. We see that values of
the mean absolute deviation less than 0.6 and more than 1.0 times the sample
standard deviation are highly unlikely.
From the above results, we can estimate a 90% confidence interval for this
statistic. We do this by putting the set of calculated values, avg_abs_dev , in
ascending order by using the Mathcad sort function. Then determine the value of
the the elements with arguments closest to 0.05Nt and 0.95Nt .
hist( x , skew )
m 0 .. 40 δx 0.1 xm 2 δx. m m 0 .. 39 f
N t . 0.1
Remember that to
fm see the correct
0.5 figure you needed
to change the value
of Ns on page 8.
0
2 1 0 1 2
x 0.5. δx
m
Calculated Third Central Moment/sigma^3
Determine the 90% confidence interval for the skew. By symmetry, the
upper and lower values should be symmetrically placed around zero, but
the values you determine are unlikely to be so. Why did you not get
symmetric values?
hist( x , kurtosis )
m 0 .. 60 δx 0.1 xm 0. δx. m m 0 .. 59 f
N t . 0.1
1.5
f Remember that to
m
1 see the correct
figure you needed
to change the value
0.5 of Ns on page 8.
0
0 1 2 3 4 5 6
x 0.5 . δx
m
Calculated Fourth Central Moment/sigma^4
Thus, we see that we can use the above three statistics to test a data set of N s
points. Note, that even if the computed statistics are well within the range
expected for points drawn from a Gaussian distribution we cannot conclude that
our observed data is drawn from Gaussian distribution with any statistical
confidence. However, we often make the "Null hypothesis" that the observed
data follow a Gaussian distribution and do our analysis of the data accordingly,
unless the data indicates that this assumption is unlikely. Note, that if we reject
the data if it fails any one of the three tests, with 90% confidence intervals, we
will, on average, reject Gaussian data far more often than 10% of the time! If the
tests are independent, we would keep the data only (0.9)3 = 73% of the time. To
be 'safe', we should take the confidence interval for each of the test three times
more stringent than we want for the overall test. We rarely test moments higher
than the fourth, since the fluctuations of such statistics grows very rapidly
.
Problems:
Molecules in a gas or liquid at equilibrium at temperature T have a distribution of
each component of velocity described by a Gaussian distribution with mean zero
and variance equal to RT/M, where R = 8.314 J K-1 mol-1 is the Gas constant, and M
is the molar mass.
2. What is the mean value for vz? root mean value for vz?
4. What fraction of the molecules have -100 m/s < vz < +100 m/s?
Mathcad has predefined the cumulative χ2 distribution function, pchisq(x,d), and the
inverse cumulative χ2 distribution function, qchisq(p,d). d is the degrees of freedom
and p a probability value.
7. Use the pchisq function to determine what fraction of samples of Ns = 10 will have
a sample variance less than 50% of the variance of the Gaussian population from
which the data was selected.
8. Use the qchisq function to determine a 90% confidence interval for the variance of
a Gaussian distribution from knowledge of the sample variance of a single sample of
N s = 10 data points.
10. What is the mean and variance of this uniform distribution function?
11. Calculate the mean of each data set and plot the distribution of mean values.
Compare that plot with a Gaussian of the mean and variance predicted by the central
limit theorem.
12. Calculate the variance of each set of data points, and plot the observed
distribution. Compare to the χ2 form predicted for a Gaussian distribution of the
same variance.
13. Calculate the distribution of normalized mean absolute deviations for the sets of
data points. Plot the results and compare to that found previously for a Gaussian
distribution. What fraction of the data sets have values that fall outside the 90%
confidence interval for a Gaussian defined above.
14. Repeat question 13, but for the skew. What is the true skew of the uniform
distribution?
15. Repeat question 13, but for the kurtosis. What is the kurtosis for the uniform
distribution?
16. Based upon the results of questions 13-15, discuss if one can decide with
reasonable reliability if a data set of 10 points is drawn from a Gaussian and not a
uniform distribution function.