Normal approximation to data
Normal approximation to data
The center is at 𝑥 = 0, and the changes in concavity (that make the elegant bell shape) occur at
𝑥 = −1 and 𝑥 = 1. The horizontal axis units are standard units (z-scores) as discussed in the
previous lesson. We can shift the center and scale the spread to get another normal curve, but
the standard normal curve is the place of reference. When we say “the normal curve” we mean
the curve plotted above.
Perhaps the first to discover something like the normal curve was the French mathematician
Abraham de Moivre (1667-1754) while working out probability approximations (more on that in
a later lesson). Sometimes, a data histogram is similar in shape to the normal curve.
Whalen Statistics 03 Normal Approximation to Data © Whalen
There are several key properties of the normal curve to keep in mind. The total area under the
normal curve and above the x-axis equals 1 (or 100%). This is true for any density curve. We
work on the density scale, so that areas represent percentages. Next, the normal curve never
actually touches the x-axis. It gets really close, so it looks like it touches, but it never does. The
curve stretches off to negative infinity and infinity. Practically speaking, that’s not something to
worry about, because only about 0.000063 (that is, 63 out of 1,000,000) of the total area is
outside of −4 and 4. Finally, the normal curve is perfectly symmetric about its center at 𝑥 = 0.
(The vertical line 𝑥 = 0 is a reflection line for the curve.) For example, the area under the curve
and to the left of 𝑥 = −1.5 is equal to the area under the curve and to the right of 𝑥 = +1.5
and equals about 6.68%.
The so-called empirical rule percentages come from the normal curve:
• The area under the normal curve between −1 and +1 is about 0.6827 or about 68%.
• The area under the normal curve between −2 and +2 is about 0.9545 or about 95%.
• The area under the normal curve between −3 and +3 is about 0.9973 or about 99.7%.
To compute areas under the normal curve required printed tables in the days before
calculators. Nowadays, most handheld calculators and all statistics software have the normal
curve functions built in. To appreciate the mathematics involved, observe that the area under
the normal curve between 𝑥 = 𝑎 and 𝑥 = 𝑏 equals the integral
𝑏
1 −𝑥 2 /2
∫ 𝑒 𝑑𝑥
𝑎 √2𝜋
2
This integral cannot be computed directly using antiderivatives, because 𝑒 −𝑥 has no
2
elementary antiderivative. Instead, the function 𝑒 −𝑥 is expressed as a Taylor series:
∞
−𝑥 2
(−𝑥 2 )𝑘 𝑥2 𝑥4 𝑥6
𝑒 =∑ = 1− + − +⋯
𝑘! 1! 2! 3!
𝑘=0
2
The terms go on forever, but only a few are needed to estimate 𝑒 −𝑥 within several decimal
places. The series can be adjusted to account for the 1/2 coefficient of −𝑥 2 and the constant
multiple 1/√2𝜋. Before calculators, mathematicians had to work out approximations by hand,
but once that tedious work was done, their approximations were published in tables the likes of
which you can still find included in many statistics books. Calculators make the work super-fast,
but they still use an approximation like the one the mathematicians used; computers cannot
magically evaluate the integral.
Whalen Statistics 03 Normal Approximation to Data © Whalen
For example, the women’s heights in the hanes data have an average of about 64 inches and
an SD of about 3 inches. We shift the normal curve so that the center is at 64 inches, then scale
the curve so that the concavity changes at a distance of 3 inches away from average on either
side (that is, at 61 and at 67 inches).
Now we have an approximate histogram of the data. This normal curve, centered at 64 inches
with a spread of 3 inches, gives a smoothed-out picture of the actual histogram (possibly too
smoothed out), and we might use areas under this normal curve to get rough approximations
about the actual percentages of the women with heights in certain intervals. Incidentally, the
equation graphed above is
1 1 𝑥−64 2
𝑦= 𝑒 −2( 3 )
3√2𝜋
If we don’t have a data set, but we do know (i) that the shape is approximately normal, (ii) what
the center (avg) is, and (iii) what the spread (SD) is, then we have all we need to know about
the distribution of data values. Areas under the normal curve give approximations of the areas
of the rectangles in the data set’s true histogram.
Whalen Statistics 03 Normal Approximation to Data © Whalen
If we have the data available, there is really no reason to use normal approximations. Statistical
software can compute the true percentages for us. If you have an especially large data set and
know that the histogram is approximately normal, then you might summarize it using a normal
2
curve just to save on computing power and time. (The Taylor series for 𝑒 −𝑥 is easy for a
computer to approximate in less than a second, much faster than organizing millions of data
values for you.)
They each have mean and sd arguments that default to 0 and 1 , respectively, corresponding
to the standard scale and the standard normal curve. You change these to the average and SD
of your data. You can read more about these functions by executing the ?dnorm command.
We’ll discuss the main aspects of the dnorm and pnorm functions now. The qnorm function
comes up later in this lesson. The rnorm function simulates random sampling, which we’ll
cover after probability.
dnorm
The “d” stands for “density.” Input an x value and dnorm outputs the height of the curve over
that x value. The height of the curve represents the density, just like the height of rectangles in
density histograms.
You will primarily use dnorm to plot the normal curve, not to do calculations.
Whalen Statistics 03 Normal Approximation to Data © Whalen
pnorm
The “p” stands for “probability,” but it is also correct to think of it as standing for “proportion”
or “percentage.” Use pnorm to compute areas under the normal curve. Input a q value and
pnorm outputs the “cumulative probability” of q.
Given a list of numbers, the cumulative relative frequency of a number q is the proportion of
numbers in the list that are less than or equal to q. The previous lesson had several examples.
For instance, the list of probscores has an average of about 84.8. Of those numbers, 47 of
them are less than or equal to 84.8, so the cumulative relative frequency of 84.8 is 47/119 or
about 0.395. In a histogram, this corresponds to the area under the histogram to the left of
84.8.
That’s what the pnorm function does for the normal histogram. It tells you what proportion of
the distribution is less than or equal to a given value, because that proportion corresponds to
the left area under the density curve. From a function that only outputs left areas, you can
easily get right-side areas and in-between areas.
Solution
a) The histogram is shown, and the density scale is used. (You can review the process in the
previous lesson.)
b) The average height is 64.085 inches and the SD is about 3.07 inches. In R, you can store
these for quick reference. The solid line marks the average, and the dashed lines are a
distance of 1 SD from the average.
Whalen Statistics 03 Normal Approximation to Data © Whalen
c) Here is where you use the dnorm function. The normal density curve is plotted on the
existing histogram plot. (Remember, your histogram must be on the density scale, or else
the normal curve will not appear!) Is the curve a good approximation? Not much. The data
have roughly that “mound” or “bell” shape, but obviously the data have a taller peak in the
center and higher proportions of extremely large and small values than the normal curve
would suggest.
Whalen Statistics 03 Normal Approximation to Data © Whalen
d) Here is where you use the pnorm function. Specify the upper bound q as 60 inches, the
mean as the average of the data, and the sd as the SD of the list of heights. The output is
about 0.09175, or about 9%. (If you round the average to 64 and the SD to 3, you’ll get
about 0.09121, still about 9%, and just fine for practical purposes.)
Whalen Statistics 03 Normal Approximation to Data © Whalen
In fact, 6 of these women were 60 inches or shorter. It happens that there are 60 women in the
data set. So, the actual proportion is 6 out of 60; that is, 10%. The graph below shows the data
histogram with the rectangles at 60 inches and below shaded; their area equals 10%. The
normal approximation did a decent job.
In a report, we prefer the 10%, because that is computed from the actual data, while the 9% is
only an approximation. Always give preference to an actual value when it is available.
Whalen Statistics 03 Normal Approximation to Data © Whalen
Example – compare a normal approximation to the actual data over a right-hand interval
a) For the women in the hanes data, about what proportion of the women were 6 feet tall or
taller, according to the normal approximation?
b) What is the actual proportion?
Solution
a) The heights in the data frame are in inches, and 6 feet equals 72 inches. Because we want to
know about women 72 inches tall or taller, we want the area under the normal curve to the
right of 72 inches. The total area equals 1, so we deduct the left area from 1 to get the right
area. (Alternatively, you can set lower.tail=FALSE in the pnorm function.)
Either way, the output is about 0.00498, just under half a percent. If you rounded the
average to 64 inches and the SD to 3 inches, the output is about 0.00383.
Whalen Statistics 03 Normal Approximation to Data © Whalen
b) In fact, 2 of the 60 women were 72 inches tall or taller. The actual proportion equals 2/60,
and in the histogram below, the shaded rectangle over 73 inches has an area of 2/60 (which
is about 0.033). Compared to 3.3%, the normal approximation of about 0.5% is quite an
underestimate.
Whalen Statistics 03 Normal Approximation to Data © Whalen
Solution
a) The pnorm function gives left areas. When we want the area between two values, we use
the subtraction method, as illustrated:
Whalen Statistics 03 Normal Approximation to Data © Whalen
The output is about 0.6711. If you round the average to 64 inches and the SD to 3 inches, the
output is about 0.6827, or about 68%. This is the same 68% as the empirical rule (see the note
below).
Note: The average is about 64 inches, and the SD is about 3 inches. The interval from 61 to 67
inches corresponds to the heights within 1 SD of average. By the empirical rule, the percentage
of values falling in this interval is about 68%. The empirical rule comes from the normal curve,
so using the empirical rule is the same as giving a normal approximation.
Whalen Statistics 03 Normal Approximation to Data © Whalen
b) In fact, 40 of the 60 women were between 61 and 67 inches tall: that’s 2 out of 3. The
normal approximation is remarkably good here. In the histogram below, the rectangles
sticking out past the curve compensate for the rectangles that fall short of the curve.
Whalen Statistics 03 Normal Approximation to Data © Whalen
Solution
a) The area under the normal curve above a single point equals 0%. This happens for any curve
𝑦 = 𝑓(𝑥), because of the calculus fact that, for any real number 𝑎,
𝑎
∫ 𝑓(𝑥) 𝑑𝑥 = 0
𝑎
In fact, one of the women’s heights is recorded as 64 inches; the proportion is 1/60 or about
0.0167.
Whalen Statistics 03 Normal Approximation to Data © Whalen
b) The interval “within half an inch of 64 inches” is the interval from 63.5 to 64.5, so we
compute the area under the normal curve between 63.5 and 64.5 by the subtraction
method. The output is about 0.1293, just under 13%.
In fact, 13 of the 60 women were between 63.5 and 64.5 inches tall, so the true proportion
is 13/60 or about 0.2167, just under 22%. The normal curve does badly, as shown in the
graph. The area under the curve is about 13%, but the area under the actual histogram is
about 22%.
Note: When “continuous” data are involved, it is better to ask about intervals rather than exact
values. The woman whose height was recorded as 64 inches was not exactly 64 inches tall (or
rather, whoever took the measurement could not know her height to that much precision).
What do we mean when we say someone is 64 inches tall? “Around 64 inches, give or take a
little bit.” This is our way of dealing with measurement error, to be discussed later in this
lesson.
Whalen Statistics 03 Normal Approximation to Data © Whalen
For the normal curve, there is no ambiguity. For each proportion 𝑝 in the interval 0 < 𝑝 < 1,
there is precisely one value 𝑞 such that
𝑞
1 2 /2
∫ 𝑒 =𝑥 𝑑𝑥 = 𝑝
−∞ √2𝜋
Mathematically, the qnorm function is the inverse of the pnorm function. Practically speaking,
that means you can specify a percentage and compute the boundary cutoff for that percentage.
The value 𝑞 is often called the 𝑝100%-percentile.
Whalen Statistics 03 Normal Approximation to Data © Whalen
Solution
We don’t have the average and SD of the data, but we don’t need them, because we only need
to give the standard units (z-scores), and we know the data histogram is normal shaped. Use
the normal quantile function qnorm and leave the mean and sd at their default values of 0
and 1 for standard units.
a) The left area of the value we seek is 0.10, and the normal quantile function outputs about
−1.28.
b) The left area of the value we seek is 0.95. From the normal quantile function, we get that
the value, in standard units, rounds to about 1.645.
Whalen Statistics 03 Normal Approximation to Data © Whalen
c) The 90th percentile is the x-value whose left area under the histogram is 0.90. From the
normal quantile function, that value in standard units is about 1.28. Note that, by the
symmetry of the normal curve, the 10th percentile is the negative of the 90th percentile.
d) Recall that standard units are also called z-scores. The z-score of the 25th percentile, for a
normal distribution, is about −0.6745. In particular, the 25th percentile is less than 1 SD
below average.
Whalen Statistics 03 Normal Approximation to Data © Whalen
e) The 50th percentile is the median. The normal curve is symmetric, so the median is the same
as the average. In standard units, the 50th percentile is 0.
Quantiles refer to the boundaries between intervals that have equal percentages of data
values. We’ve already discussed medians in the previous lesson. A median is a boundary
between intervals that each hold 50% of the data values:
Note that there could be more than one median. (For the numbers on a die, 1,2,3,4,5,6, any
number between 3 and 4 inclusive is a median, by the mathematical definition. At least half of
the values are at most the median, and at least half of the values are at least the median.)
Another common quantile choice is quartiles, where each bin has 25% (one quarter) of the data
values:
When we divide the data into percentiles, each bin has 1% of the data values. The quartiles are
then the 25th, 50th, and 75th percentiles.
Whalen Statistics 03 Normal Approximation to Data © Whalen
The quantile functions for continuous density curves have no ambiguity. But for an arbitrary
data set, there are issues. We illustrate some of the issues with the list 1, 2, 3, 4.
Question: What are the 25th, 50th , and 75th percentiles of the list 1,2,3,4?
Answer 1: The 𝑝-quantile is defined to be the smallest data value 𝑞 whose cumulative relative
frequency is at least 𝑝. The cumulative relative frequencies for the list 1, 2, 3, 4 are
Number 1 2 3 4
Cumulative relative
25% 50% 75% 100%
frequency
Answer 2: Define the smallest number to be the 0th-percentile, the biggest number to be the
100th-percentile, and take the average (midpoint) of the values at the discontinuities for all the
percentiles in-between. By this definition, we get the quantiles:
That’s 2 methods. All told, there are 9 different methods of defining the quantile function
included in R, which is a bit ridiculous. For an actual data set (list of numbers), if you want to
work with percentiles, we recommend you “work forward” with cumulative relative
frequencies, which are unambiguous (and do correspond to percentiles), rather than “work
backward” with quantiles.
Whalen Statistics 03 Normal Approximation to Data © Whalen
a) Assuming the normal approximation is valid, the fastest 10% of the runners had a net time
of how many seconds or less? Give the value in minutes, too.
b) Assuming the normal approximation is valid, the slowest 25% of the runners had a net time
of how many seconds or more? Convert to minutes.
Solution
a) Using the normal quantile function, the value with a left area of 0.10 is about 4356 seconds
(or about 72.6 minutes). That is, the fastest 10% of the runners completed the race within
about 72.6 minutes, start to finish.
b) Using the normal quantile function, the value with a right area of 0.25 has a left area of
0.75, and this value is about 6253 seconds (or about 104.2 minutes). That is, the slowest
25% of the runners completed the race in 104.2 minutes or more.
Note: The quantile function in R with method type 1 gives the 10th-percentile as 4409 seconds
and the 75th-percentile as 6169 seconds.
Whalen Statistics 03 Normal Approximation to Data © Whalen
But that’s not all. Lots of powerful mathematical theorems can be proven when it is assumed
that the distributions in question are normal (including 𝑡-tests, 𝐹-tests, and inferential
regression).
It’s no wonder then that investigators want their data to be approximately normal, no matter
how dubious the claim. A histogram of the data with a normal curve plotted over it usually tells
the whole story, but there plenty of fancy looking methods to make the claim “approximately
normal” more believable. (When those don’t work, they sometimes claim that the technique
they want to apply is “robust against departures from normality” after all.)
We’ll discuss some of these methods now. Their purpose is to compare the distribution of a list
of numbers to the normal distribution.
QQ plots
A quantile-quantile plot compares the quantiles of any two distributions. A point (𝑥, 𝑦) pairs
together a value from the first distribution 𝑥 and a value from the second distribution 𝑦 that
both have the same percentile-rank. (For instance, they could both be the 25th percentile of
their respective distributions.) If the distributions are similar, then the points will be close to
the straight line 𝑦 = 𝑥. If the points are close to a different straight line, then one distribution
can be shifted and scaled to look like the other distribution. Otherwise, the points have no
pattern or have a pattern not at all like a line, and we cannot say the distributions have similar
shapes.
Most often, you will compare a real data set to a theoretical distribution to see whether the
theory can be applied to your data set. And the theoretical distribution most often brought into
the discussion is the one described by the normal curve.
If you’re in a hurry, the R function qqnorm delivers a Q-Q plot where the 𝑦-values are the
quantiles of your data set plotted over the corresponding quantiles of the standard normal
distribution, given by the 𝑥-values. We will take a little extra effort and use the qqplot
command (because after all, you might not always want to compare to normal).
Whalen Statistics 03 Normal Approximation to Data © Whalen
To illustrate what a Q-Q plot communicates, we compare the list of numbers 1 through 100 to
the normal distribution.
We mark dashed lines through the point (117.65, 99). Why is this point on the Q-Q plot? The
number 99 is in the “data,” so a point with 𝑦 = 99 is plotted. On what 𝑥-value? Answer: the 𝑥-
value under the corresponding normal curve with the same quantile rank.
The number 99 has a percentile-rank of 99% in the list 1:100. Under the normal curve (with the
same average and SD as 1:100), the 99th-percentile is about 117.65.
The same idea applies for every other number in the “data” 1:100, except for 1 and 100. (Why
do they not have points in the plot?)
The Q-Q plot does not follow a straight line (except near the center), confirming what we know
to be obvious—a histogram of 1:100 is nothing like the normal curve.
Whalen Statistics 03 Normal Approximation to Data © Whalen
Solution
a) The qqplot function with the average and SD specified for the normal distribution gives
the graph below:
Whalen Statistics 03 Normal Approximation to Data © Whalen
b) Taking the effort to make a more detailed Q-Q plot allows us to add the line easily:
The plot suggests what we saw earlier with the histogram. The women’s height distribution
in the hanes data is not perfectly normal shaped, but it is not very different from normal,
except at the extremes.
Whalen Statistics 03 Normal Approximation to Data © Whalen
c) The tallest woman represented in the Q-Q plot has a height of 72.6 inches. She is the
second tallest in the data set. The tallest woman’s height is 72.7 inches, but no point in the
Q-Q plot has this 𝑦-value. Why? The biggest value has a percentile-rank of 100%. But for the
normal curve, there is no value with a percentile-rank of 100%. (Some students might want
to say that it’s “infinity,” but infinity is not a number.) The 72.6 inch value has a quantile
rank of 59/60, or about 98.3%. The value in the corresponding normal distribution with that
quantile rank is about 70.6 inches.
Why make a Q-Q plot when we already saw the histogram with the normal curve for
comparison? One reason might be because a histogram’s appearance depends on the bin
choices, but a Q-Q plot does not. There are choices about how quantiles are defined, however,
so both plots depend on our choices somewhat. Some might say a visual comparison of
histograms is subjective, but in a Q-Q plot we are visually comparing the dots to a straight
line—also subjective.
A histogram with the normal curve and a Q-Q plot give different vantage points of the same
general picture. If you really need to check normality, it never hurts to include both.
Whalen Statistics 03 Normal Approximation to Data © Whalen
The skew of a list of numbers (henceforth, skew) is defined to be the average of the cubes of
the standardized values of the numbers in the list. The kurtosis of a list of numbers (henceforth,
kurtosis) is defined to be the average of the quadruples of the standardized values of the
numbers in a list.
In some contexts, these are called the third standardized moment and the fourth standardized
moment, respectively. There are modifications to these formulas and alternative definitions,
but they are all rhapsodies on this theme.
If a histogram has a long right tail, those big numbers towards the right will have z-scores more
than 1. Numbers more than 1 will be bigger when cubed, making the skew larger positive. On
the other hand, if a histogram has a long left tail, those numbers towards the left will have z-
scores further negative than -1. Such numbers will be further negative when cubed, making the
skew larger negative.
Whether positive or negative, z-scores bigger than 1 in magnitude will be large positive when
raised to the fourth power, raising the kurtosis. When a lot of the data values are farther than 1
SD away from average, the kurtosis will be bigger. Note that kurtosis is never negative.
The normal curve has a skew of exactly 0 and a kurtosis of exactly 3. For density curves, defining
and calculating skew and kurtosis requires an integral. For the normal curve, the skew and
kurtosis integrals are:
∞
1 2 /2
∫ 𝑥3 𝑒 −𝑥 𝑑𝑥 = 0
−∞ √2𝜋
∞
1 2 /2
∫ 𝑥4 𝑒 −𝑥 𝑑𝑥 = 3
−∞ √2𝜋
Whalen Statistics 03 Normal Approximation to Data © Whalen
The women’s heights in the hanes data have a skew of about 0.4088, not too far from 0 but still
a bit positive. Looking at the histogram we made before, we can see why. The heights are
roughly symmetric, but there are a few extremely large values (tall women) on the right tail.
The kurtosis is about 3.637, bigger than the normal curve’s kurtosis of 3. The tails of the data
histogram on both sides are a bit thicker than the normal curve’s tails.
Whalen Statistics 03 Normal Approximation to Data © Whalen
Solution
The skew of the list of endscores equals about −1.47. A histogram with bins of width 2 looks
like this:
The histogram has a long left tail; these extremely low values have large negative z-scores that
contributed large negative terms in the skew calculation. This histogram is said to be left
skewed. Note that the skew does not depend on the choice of bins. The skew depends only on
the data set itself.
The kurtosis is about 5.38, more than the normal curve’s 3 for comparison, representing the
fact that the tails are more predominant for the data than the normal curve would suggest.
Whalen Statistics 03 Normal Approximation to Data © Whalen
Grades often have negative skew; that is, left-skewed histograms. That’s good (for most of the
students), because that suggests that low scores were unusual compared to the bulk of the
data.
Whalen Statistics 03 Normal Approximation to Data © Whalen
Solution
The skew of the list of Income2005 values equals about +3.936. A histogram with 200 bins
looks like:
The histogram has a long right tail, and these values have large positive z-scores. The cubes of
these extremely large z-scores make the skew larger positive. The histogram is right skewed.
Again, the skew calculation depends only on the list of numbers and not on how we set up the
histogram.
The kurtosis is about a whopping 33.05. (Remember, it’s only 3 for the normal curve.) The list
has many more incomes with z-scores outside of -1 to 1 than an approximately normal
distribution.
Whalen Statistics 03 Normal Approximation to Data © Whalen
For fun, we add the normal curve to this histogram to see how non-normal it is:
Income distributions often have positive skew (that is, right-skewed histograms) due to a few
very large incomes.
Notation station
For theoretical distributions, it is common to use the Greek letters mu 𝜇 and sigma 𝜎 for the
average and standard deviation. (Note that mu corresponds to “m” for mean and sigma
corresponds to “s” for “standard deviation” or “spread.”) From the standard normal curve, shift
it over by subtracting 𝜇 from 𝑥, then scale it by dividing the input by 𝜎. Adjust the coefficient so
that the total area is still 1. You get:
1 1 𝑥−𝜇 2
𝑦= 𝑒 −2( 𝜎 )
𝜎√2𝜋
𝑥−𝜇
Note that the transformation is the same as the z-score conversion 𝑧 = 𝜎 and in this case, the
center of the curve is at 𝑥 = 𝜇 and the concavity changes at 𝑥 = 𝜇 ± 𝜎 (at the points 1 SD away
from average).
Whalen Statistics 03 Normal Approximation to Data © Whalen
Practice Problems
The Pearson data set consists of survey data gathered by Karl Pearson and Alice Lee and used
for their important paper about regression published in 1903. We will discuss their work in
more detail in Lesson 05. For now, use the list of the heights of the fathers Pearson$father
in the families they surveyed to practice normal approximations. Note there are 1,078 fathers
represented in this list.
a) What percentage of the father heights are 60 inches or less, according to a normal
approximation?
b) What is the actual percentage in the data?
Whalen Statistics 03 Normal Approximation to Data © Whalen
a) What percentage of the father heights are 72 inches or more, according to a normal
approximation?
b) What is the actual percentage in the data?
Whalen Statistics 03 Normal Approximation to Data © Whalen
a) What percentage of the father heights are between 65 and 70 inches, according to a
normal approximation?
b) What is the actual percentage in the data?
Whalen Statistics 03 Normal Approximation to Data © Whalen
a) Create a QQ-plot, like the one shown above, that compares the father heights data to a
normal distribution. The top-right circle is marked by dashed lines. What are its
coordinates, and what does it represent?
b) Find the coordinates of the bottom-left circle and interpret.
c) The tallest father height is 75.4 inches, and its exact quantile-rank is 1078/1078. Why
isn’t this father represented in the q-q plot?
Whalen Statistics 03 Normal Approximation to Data © Whalen
hist(Pearson$father, breaks=75,
prob=TRUE, col="white",
xlab="Height (inches)",
xaxt='n',
main="The 1,078 fathers in the Pearson data")
axis(1, at=59:76)
b) No. Taller peaks represent more people there. The tall peaks are above the middle
(average) heights. The height of the histogram represents the density, or congestion, of
data values in each bin. The peaks are taller above the more common heights; that is,
there are more men with those heights.
c) The right tail represents the very tall men. The histogram is short there, because there
were not many very tall men.
d) The left tail represents the very short men. The histogram is short there, because there
were not many very short men.
Whalen Statistics 03 Normal Approximation to Data © Whalen
c) 1, or 100%
d) 1, or 100%
Whalen Statistics 03 Normal Approximation to Data © Whalen
b) The shortest father height is 59.0 inches. This is the y-coordinate of the bottom-left
point. Its exact quantile rank is 1/1078, and the value under the normal curve with the
same quantile rank is about 59.14 inches. This is the x-coordinate of the bottom-left
point.
# Check for the shortest heights
Pearson[Pearson$father<60,]
c) The exact quantile rank is 1078/1078 = 1, or 100%. The normal distribution does not
have a 100th-quantile. (The curve never touches the x-axis, so the area under the curve
to the left of any number is less than 100%, even if only by a tiny bit.) For fun, you can
see what R returns for qnorm of p=1.
Whalen Statistics 03 Normal Approximation to Data © Whalen
length(scf2013$INCOME)
The plot is obviously not like a straight line. We conclude that a normal distribution is not like
the distribution of the income values. (Not by a long shot.) You might consider checking a qq-
plot for the log-incomes, though!
Whalen Statistics 03 Normal Approximation to Data © Whalen
calc<-na.omit(calcgrades)
length(calc$endscore)
qqplot(x=qnorm(p=seq(from=1/1368, to=1367/1368, by=1/1368),
mean=mean(calc$endscore),
sd=SD(calc$endscore)),
y=calc$endscore,
xlab="Normal quantiles",
ylab="Data quantiles")
The q-q plot is obviously not like a straight line. The distribution of endscore values does not
look like a normal curve.
Whalen Statistics 03 Normal Approximation to Data © Whalen
[1] In R, you can convert an entire list of numbers (vector) into its standard units rather easily,
because R applies the arithmetic component-wise. Subtract the average of the list from the
vector, then divide that vector by the SD of the vector. (Here, we use our “SD of a list” function.
The result is almost the same if you use the built-in sd function.) Then raise the vector to the
third power, then take the average:
mean(((Pearson$father - mean(Pearson$father))/SD(Pearson$father))^3)
Similarly for kurtosis, but raise to the fourth power instead:
mean(((Pearson$father - mean(Pearson$father))/SD(Pearson$father))^4)
The skew is about −0.088, very close to 0. The Pearson father heights have a nearly
symmetrical distribution.
The kurtosis is about 2.84, close to and a little bit smaller than 3. The heights have a little less
action in the tails than the normal curve.
[2]
The INCOME values in the scf2013 data have a skew of about 16.6 and a kurtosis of about
357.
The skew of 16.6 is large positive, indicating a long right tail. There are a few really big values,
giving the distribution a “positive skew.”
The whopping 357 for kurtosis suggests there is a lot more action in the tails of the data
histogram, on either side, than the normal curve would predict.
Whalen Statistics 03 Normal Approximation to Data © Whalen
[3]
The skew is about 1.9, and the kurtosis is about 6.85. The positive skew corresponds to the long
right tail we see in the histogram. The kurtosis is more than 3, and we see that the data
histogram’s tails are quite different than the normal curve tails:
mean(((scf2013[scf2013$INCOME<400000,]$INCOME -
mean(scf2013[scf2013$INCOME<400000,]$INCOME))/SD(scf2013[scf2013$INCOM
E<400000,]$INCOME))^3)
mean(((scf2013[scf2013$INCOME<400000,]$INCOME -
mean(scf2013[scf2013$INCOME<400000,]$INCOME))/SD(scf2013[scf2013$INCOM
E<400000,]$INCOME))^4)
curve(dnorm(x, mean=mean(scf2013[scf2013$INCOME<400000,]$INCOME),
sd=SD(scf2013[scf2013$INCOME<400000,]$INCOME)), add=TRUE)
Whalen Statistics 03 Normal Approximation to Data © Whalen
[4] The endscore values in the calculus grades have a skew of about −1.47 (negative,
suggesting a long left tail) and a kurtosis of about 5.38 (more than 3, suggesting thicker tails
than the normal curve’s). The normal curve starts to taper down right of 80, but the data
histogram is tapering up towards its peak there.
calc<-na.omit(calcgrades)
mean(((calc$endscore - mean(calc$endscore))/SD(calc$endscore))^3)
mean(((calc$endscore - mean(calc$endscore))/SD(calc$endscore))^4)
hist(calc$endscore,
breaks=seq(from=0, to=100, by=2),
prob=TRUE,
xlab="calculus end scores",
ylab=NULL,
yaxt='n',
main=NULL,
col="white")
curve(dnorm(x, mean=mean(calc$endscore), sd=SD(calc$endscore)),
add=TRUE)