0% found this document useful (0 votes)

2 views

Normal approximation to data

The document discusses the Normal Approximation to Data, focusing on the Standard Normal Density Curve and its properties. It explains how to use the normal curve to approximate data histograms, including the empirical rule and the use of statistical software functions like dnorm and pnorm for calculations. Several examples illustrate the application of normal approximations to real data, comparing approximated values to actual proportions.

Uploaded by

2402 - Pratik Alkutkar

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Normal approximation to data

Uploaded by

2402 - Pratik Alkutkar

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

Whalen Statistics 03 Normal Approximation to Data © Whalen

Normal Approximation to Data

“In theory, there is no difference between theory and practice. In practice, there is.”
—YOGI BERRA

Introducing the Normal Curve

The most famous curve of statistics is the Standard Normal Density Curve, a graph of the
equation:
1 −𝑥 2/2
𝑦= 𝑒
√2𝜋

The center is at 𝑥 = 0, and the changes in concavity (that make the elegant bell shape) occur at
𝑥 = −1 and 𝑥 = 1. The horizontal axis units are standard units (z-scores) as discussed in the
previous lesson. We can shift the center and scale the spread to get another normal curve, but
the standard normal curve is the place of reference. When we say “the normal curve” we mean
the curve plotted above.

Perhaps the first to discover something like the normal curve was the French mathematician
Abraham de Moivre (1667-1754) while working out probability approximations (more on that in
a later lesson). Sometimes, a data histogram is similar in shape to the normal curve.
Whalen Statistics 03 Normal Approximation to Data © Whalen

There are several key properties of the normal curve to keep in mind. The total area under the
normal curve and above the x-axis equals 1 (or 100%). This is true for any density curve. We
work on the density scale, so that areas represent percentages. Next, the normal curve never
actually touches the x-axis. It gets really close, so it looks like it touches, but it never does. The
curve stretches off to negative infinity and infinity. Practically speaking, that’s not something to
worry about, because only about 0.000063 (that is, 63 out of 1,000,000) of the total area is
outside of −4 and 4. Finally, the normal curve is perfectly symmetric about its center at 𝑥 = 0.
(The vertical line 𝑥 = 0 is a reflection line for the curve.) For example, the area under the curve
and to the left of 𝑥 = −1.5 is equal to the area under the curve and to the right of 𝑥 = +1.5
and equals about 6.68%.

The so-called empirical rule percentages come from the normal curve:
• The area under the normal curve between −1 and +1 is about 0.6827 or about 68%.
• The area under the normal curve between −2 and +2 is about 0.9545 or about 95%.
• The area under the normal curve between −3 and +3 is about 0.9973 or about 99.7%.

To compute areas under the normal curve required printed tables in the days before
calculators. Nowadays, most handheld calculators and all statistics software have the normal
curve functions built in. To appreciate the mathematics involved, observe that the area under
the normal curve between 𝑥 = 𝑎 and 𝑥 = 𝑏 equals the integral
𝑏
1 −𝑥 2 /2
∫ 𝑒 𝑑𝑥
𝑎 √2𝜋
2
This integral cannot be computed directly using antiderivatives, because 𝑒 −𝑥 has no
2
elementary antiderivative. Instead, the function 𝑒 −𝑥 is expressed as a Taylor series:
∞
−𝑥 2
(−𝑥 2 )𝑘 𝑥2 𝑥4 𝑥6
𝑒 =∑ = 1− + − +⋯
𝑘! 1! 2! 3!
𝑘=0
2
The terms go on forever, but only a few are needed to estimate 𝑒 −𝑥 within several decimal
places. The series can be adjusted to account for the 1/2 coefficient of −𝑥 2 and the constant
multiple 1/√2𝜋. Before calculators, mathematicians had to work out approximations by hand,
but once that tedious work was done, their approximations were published in tables the likes of
which you can still find included in many statistics books. Calculators make the work super-fast,
but they still use an approximation like the one the mathematicians used; computers cannot
magically evaluate the integral.
Whalen Statistics 03 Normal Approximation to Data © Whalen

Basic normal approximations of data histograms

When using the normal curve to approximate a data histogram, the normal curve is shifted so
that the center lines up with the average of the data. Moreover, the curve is scaled so that the
changes in concavity occur at a distance of 1 SD on either side of the average.

For example, the women’s heights in the hanes data have an average of about 64 inches and
an SD of about 3 inches. We shift the normal curve so that the center is at 64 inches, then scale
the curve so that the concavity changes at a distance of 3 inches away from average on either
side (that is, at 61 and at 67 inches).

Now we have an approximate histogram of the data. This normal curve, centered at 64 inches
with a spread of 3 inches, gives a smoothed-out picture of the actual histogram (possibly too
smoothed out), and we might use areas under this normal curve to get rough approximations
about the actual percentages of the women with heights in certain intervals. Incidentally, the
equation graphed above is
1 1 𝑥−64 2
𝑦= 𝑒 −2( 3 )
3√2𝜋

If we don’t have a data set, but we do know (i) that the shape is approximately normal, (ii) what
the center (avg) is, and (iii) what the spread (SD) is, then we have all we need to know about
the distribution of data values. Areas under the normal curve give approximations of the areas
of the rectangles in the data set’s true histogram.
Whalen Statistics 03 Normal Approximation to Data © Whalen

If we have the data available, there is really no reason to use normal approximations. Statistical
software can compute the true percentages for us. If you have an especially large data set and
know that the histogram is approximately normal, then you might summarize it using a normal
2
curve just to save on computing power and time. (The Taylor series for 𝑒 −𝑥 is easy for a
computer to approximate in less than a second, much faster than organizing millions of data
values for you.)

R has four functions about the normal distribution:

dnorm pnorm qnorm rnorm

They each have mean and sd arguments that default to 0 and 1 , respectively, corresponding
to the standard scale and the standard normal curve. You change these to the average and SD
of your data. You can read more about these functions by executing the ?dnorm command.

We’ll discuss the main aspects of the dnorm and pnorm functions now. The qnorm function
comes up later in this lesson. The rnorm function simulates random sampling, which we’ll
cover after probability.

dnorm
The “d” stands for “density.” Input an x value and dnorm outputs the height of the curve over
that x value. The height of the curve represents the density, just like the height of rectangles in
density histograms.

You will primarily use dnorm to plot the normal curve, not to do calculations.
Whalen Statistics 03 Normal Approximation to Data © Whalen

pnorm
The “p” stands for “probability,” but it is also correct to think of it as standing for “proportion”
or “percentage.” Use pnorm to compute areas under the normal curve. Input a q value and
pnorm outputs the “cumulative probability” of q.

Given a list of numbers, the cumulative relative frequency of a number q is the proportion of
numbers in the list that are less than or equal to q. The previous lesson had several examples.
For instance, the list of probscores has an average of about 84.8. Of those numbers, 47 of
them are less than or equal to 84.8, so the cumulative relative frequency of 84.8 is 47/119 or
about 0.395. In a histogram, this corresponds to the area under the histogram to the left of
84.8.

That’s what the pnorm function does for the normal histogram. It tells you what proportion of
the distribution is less than or equal to a given value, because that proportion corresponds to
the left area under the density curve. From a function that only outputs left areas, you can
easily get right-side areas and in-between areas.

Let’s show off how the functions work with examples.

Whalen Statistics 03 Normal Approximation to Data © Whalen

Example – compare a normal approximation to the actual data

a) Make a histogram of the women’s heights in the hanes data using 1-inch wide bins.
b) Mark the average and the SDs in the plot.
c) Plot the approximating normal curve on the histogram. Is it a good approximation?
d) What proportion of these women were 60 inches or shorter, according to the normal
approximation? What was the actual proportion? Which is the preferred value in a report?

Solution
a) The histogram is shown, and the density scale is used. (You can review the process in the
previous lesson.)

b) The average height is 64.085 inches and the SD is about 3.07 inches. In R, you can store
these for quick reference. The solid line marks the average, and the dashed lines are a
distance of 1 SD from the average.
Whalen Statistics 03 Normal Approximation to Data © Whalen

c) Here is where you use the dnorm function. The normal density curve is plotted on the
existing histogram plot. (Remember, your histogram must be on the density scale, or else
the normal curve will not appear!) Is the curve a good approximation? Not much. The data
have roughly that “mound” or “bell” shape, but obviously the data have a taller peak in the
center and higher proportions of extremely large and small values than the normal curve
would suggest.
Whalen Statistics 03 Normal Approximation to Data © Whalen

d) Here is where you use the pnorm function. Specify the upper bound q as 60 inches, the
mean as the average of the data, and the sd as the SD of the list of heights. The output is
about 0.09175, or about 9%. (If you round the average to 64 and the SD to 3, you’ll get
about 0.09121, still about 9%, and just fine for practical purposes.)
Whalen Statistics 03 Normal Approximation to Data © Whalen

In fact, 6 of these women were 60 inches or shorter. It happens that there are 60 women in the
data set. So, the actual proportion is 6 out of 60; that is, 10%. The graph below shows the data
histogram with the rectangles at 60 inches and below shaded; their area equals 10%. The
normal approximation did a decent job.

In a report, we prefer the 10%, because that is computed from the actual data, while the 9% is
only an approximation. Always give preference to an actual value when it is available.
Whalen Statistics 03 Normal Approximation to Data © Whalen

Example – compare a normal approximation to the actual data over a right-hand interval
a) For the women in the hanes data, about what proportion of the women were 6 feet tall or
taller, according to the normal approximation?
b) What is the actual proportion?

Solution
a) The heights in the data frame are in inches, and 6 feet equals 72 inches. Because we want to
know about women 72 inches tall or taller, we want the area under the normal curve to the
right of 72 inches. The total area equals 1, so we deduct the left area from 1 to get the right
area. (Alternatively, you can set lower.tail=FALSE in the pnorm function.)

Either way, the output is about 0.00498, just under half a percent. If you rounded the
average to 64 inches and the SD to 3 inches, the output is about 0.00383.
Whalen Statistics 03 Normal Approximation to Data © Whalen

b) In fact, 2 of the 60 women were 72 inches tall or taller. The actual proportion equals 2/60,
and in the histogram below, the shaded rectangle over 73 inches has an area of 2/60 (which
is about 0.033). Compared to 3.3%, the normal approximation of about 0.5% is quite an
underestimate.
Whalen Statistics 03 Normal Approximation to Data © Whalen

Example –a normal approximation over an “in-between” interval by the subtraction method

a) For the women in the hanes data, about what proportion of the women were between 61
and 67 inches tall, according to the normal approximation?
b) What is the actual proportion?

Solution
a) The pnorm function gives left areas. When we want the area between two values, we use
the subtraction method, as illustrated:
Whalen Statistics 03 Normal Approximation to Data © Whalen

The output is about 0.6711. If you round the average to 64 inches and the SD to 3 inches, the
output is about 0.6827, or about 68%. This is the same 68% as the empirical rule (see the note
below).

Note: The average is about 64 inches, and the SD is about 3 inches. The interval from 61 to 67
inches corresponds to the heights within 1 SD of average. By the empirical rule, the percentage
of values falling in this interval is about 68%. The empirical rule comes from the normal curve,
so using the empirical rule is the same as giving a normal approximation.
Whalen Statistics 03 Normal Approximation to Data © Whalen

b) In fact, 40 of the 60 women were between 61 and 67 inches tall: that’s 2 out of 3. The
normal approximation is remarkably good here. In the histogram below, the rectangles
sticking out past the curve compensate for the rectangles that fall short of the curve.
Whalen Statistics 03 Normal Approximation to Data © Whalen

Example – using a normal approximation for a specific value

a) For the women in the hanes data, about what proportion of the women were exactly 64
inches tall, using a normal approximation? What is the actual proportion?
b) Give the normal approximation for the proportion of women within half an inch of 64
inches tall. What is the actual proportion?

Solution
a) The area under the normal curve above a single point equals 0%. This happens for any curve
𝑦 = 𝑓(𝑥), because of the calculus fact that, for any real number 𝑎,
𝑎
∫ 𝑓(𝑥) 𝑑𝑥 = 0
𝑎

In fact, one of the women’s heights is recorded as 64 inches; the proportion is 1/60 or about
0.0167.
Whalen Statistics 03 Normal Approximation to Data © Whalen

b) The interval “within half an inch of 64 inches” is the interval from 63.5 to 64.5, so we
compute the area under the normal curve between 63.5 and 64.5 by the subtraction
method. The output is about 0.1293, just under 13%.
In fact, 13 of the 60 women were between 63.5 and 64.5 inches tall, so the true proportion
is 13/60 or about 0.2167, just under 22%. The normal curve does badly, as shown in the
graph. The area under the curve is about 13%, but the area under the actual histogram is
about 22%.

Note: When “continuous” data are involved, it is better to ask about intervals rather than exact
values. The woman whose height was recorded as 64 inches was not exactly 64 inches tall (or
rather, whoever took the measurement could not know her height to that much precision).
What do we mean when we say someone is 64 inches tall? “Around 64 inches, give or take a
little bit.” This is our way of dealing with measurement error, to be discussed later in this
lesson.
Whalen Statistics 03 Normal Approximation to Data © Whalen

The normal quantile function

The qnorm function takes a left area p and outputs the value on the horizontal axis, as
illustrated in the diagram:

For the normal curve, there is no ambiguity. For each proportion 𝑝 in the interval 0 < 𝑝 < 1,
there is precisely one value 𝑞 such that

𝑞
1 2 /2
∫ 𝑒 =𝑥 𝑑𝑥 = 𝑝
−∞ √2𝜋

Note that the normal quantile function is not defined at 𝑝 = 0 or at 𝑝 = 1.

Mathematically, the qnorm function is the inverse of the pnorm function. Practically speaking,
that means you can specify a percentage and compute the boundary cutoff for that percentage.
The value 𝑞 is often called the 𝑝100%-percentile.
Whalen Statistics 03 Normal Approximation to Data © Whalen

Example – use qnorm to compute a percentile

Suppose a data histogram is shaped like the normal curve.
a) Find the value, in standard units, that separates the smallest 10% of the values from the
upper 90% of the values.
b) Find the value, in standard units, that separates the top 5% from the bottom 95%.
c) How many SDs above average is the 90th percentile?
d) What is the z-score of the 25th percentile?
e) What is the 50th percentile?

Solution
We don’t have the average and SD of the data, but we don’t need them, because we only need
to give the standard units (z-scores), and we know the data histogram is normal shaped. Use
the normal quantile function qnorm and leave the mean and sd at their default values of 0
and 1 for standard units.
a) The left area of the value we seek is 0.10, and the normal quantile function outputs about
−1.28.

b) The left area of the value we seek is 0.95. From the normal quantile function, we get that
the value, in standard units, rounds to about 1.645.
Whalen Statistics 03 Normal Approximation to Data © Whalen

c) The 90th percentile is the x-value whose left area under the histogram is 0.90. From the
normal quantile function, that value in standard units is about 1.28. Note that, by the
symmetry of the normal curve, the 10th percentile is the negative of the 90th percentile.

d) Recall that standard units are also called z-scores. The z-score of the 25th percentile, for a
normal distribution, is about −0.6745. In particular, the 25th percentile is less than 1 SD
below average.
Whalen Statistics 03 Normal Approximation to Data © Whalen

e) The 50th percentile is the median. The normal curve is symmetric, so the median is the same
as the average. In standard units, the 50th percentile is 0.

Quantile functions for data

Quantiles refer to the boundaries between intervals that have equal percentages of data
values. We’ve already discussed medians in the previous lesson. A median is a boundary
between intervals that each hold 50% of the data values:

Note that there could be more than one median. (For the numbers on a die, 1,2,3,4,5,6, any
number between 3 and 4 inclusive is a median, by the mathematical definition. At least half of
the values are at most the median, and at least half of the values are at least the median.)

Another common quantile choice is quartiles, where each bin has 25% (one quarter) of the data
values:

When we divide the data into percentiles, each bin has 1% of the data values. The quartiles are
then the 25th, 50th, and 75th percentiles.
Whalen Statistics 03 Normal Approximation to Data © Whalen

The quantile functions for continuous density curves have no ambiguity. But for an arbitrary
data set, there are issues. We illustrate some of the issues with the list 1, 2, 3, 4.
Question: What are the 25th, 50th , and 75th percentiles of the list 1,2,3,4?

Answer 1: The 𝑝-quantile is defined to be the smallest data value 𝑞 whose cumulative relative
frequency is at least 𝑝. The cumulative relative frequencies for the list 1, 2, 3, 4 are

Number 1 2 3 4
Cumulative relative
25% 50% 75% 100%
frequency

The 25th-percentile is 1, the 50th-percentile is 2, and the 75th-percentile is 3. By this definition,

the 0th-percentile is also 1 and the 100th-percentile is 4, as you see in the R output.

Answer 2: Define the smallest number to be the 0th-percentile, the biggest number to be the
100th-percentile, and take the average (midpoint) of the values at the discontinuities for all the
percentiles in-between. By this definition, we get the quantiles:

Quantile Rank 𝑝 0% 25% 50% 75% 100%

Quantile 𝑞 1 1.5 2.5 3.5 4

That’s 2 methods. All told, there are 9 different methods of defining the quantile function
included in R, which is a bit ridiculous. For an actual data set (list of numbers), if you want to
work with percentiles, we recommend you “work forward” with cumulative relative
frequencies, which are unambiguous (and do correspond to percentiles), rather than “work
backward” with quantiles.
Whalen Statistics 03 Normal Approximation to Data © Whalen

Example – approximating quantiles with the normal curve

The Cherry Blossom Ten Mile Race is held in Washington, D.C., most years, and the results are
published. The TenMileRace data set has the records from April 2005. The net column lists
the times, in seconds, from the moment a runner crossed the starting line to the moment the
runner crossed the finish line. We will assume the shape of the histogram is approximately
normal. (What do you think? Look at the graph below.)

a) Assuming the normal approximation is valid, the fastest 10% of the runners had a net time
of how many seconds or less? Give the value in minutes, too.
b) Assuming the normal approximation is valid, the slowest 25% of the runners had a net time
of how many seconds or more? Convert to minutes.

Solution
a) Using the normal quantile function, the value with a left area of 0.10 is about 4356 seconds
(or about 72.6 minutes). That is, the fastest 10% of the runners completed the race within
about 72.6 minutes, start to finish.
b) Using the normal quantile function, the value with a right area of 0.25 has a left area of
0.75, and this value is about 6253 seconds (or about 104.2 minutes). That is, the slowest
25% of the runners completed the race in 104.2 minutes or more.

Note: The quantile function in R with method type 1 gives the 10th-percentile as 4409 seconds
and the 75th-percentile as 6169 seconds.
Whalen Statistics 03 Normal Approximation to Data © Whalen

It’s good to be normal

Normal curves are the ideal histograms. The average equals the median, and the average
equals the exact center of a normal histogram. Moreover, normal curves are symmetric about
their centers. Normal histograms perfectly follow the empirical rule and have only a tiny
fraction of the area more than 3 SDs away from average. There is no issue with quantiles
because the curve is continuous.

But that’s not all. Lots of powerful mathematical theorems can be proven when it is assumed
that the distributions in question are normal (including 𝑡-tests, 𝐹-tests, and inferential
regression).

It’s no wonder then that investigators want their data to be approximately normal, no matter
how dubious the claim. A histogram of the data with a normal curve plotted over it usually tells
the whole story, but there plenty of fancy looking methods to make the claim “approximately
normal” more believable. (When those don’t work, they sometimes claim that the technique
they want to apply is “robust against departures from normality” after all.)

We’ll discuss some of these methods now. Their purpose is to compare the distribution of a list
of numbers to the normal distribution.

QQ plots
A quantile-quantile plot compares the quantiles of any two distributions. A point (𝑥, 𝑦) pairs
together a value from the first distribution 𝑥 and a value from the second distribution 𝑦 that
both have the same percentile-rank. (For instance, they could both be the 25th percentile of
their respective distributions.) If the distributions are similar, then the points will be close to
the straight line 𝑦 = 𝑥. If the points are close to a different straight line, then one distribution
can be shifted and scaled to look like the other distribution. Otherwise, the points have no
pattern or have a pattern not at all like a line, and we cannot say the distributions have similar
shapes.

Most often, you will compare a real data set to a theoretical distribution to see whether the
theory can be applied to your data set. And the theoretical distribution most often brought into
the discussion is the one described by the normal curve.

If you’re in a hurry, the R function qqnorm delivers a Q-Q plot where the 𝑦-values are the
quantiles of your data set plotted over the corresponding quantiles of the standard normal
distribution, given by the 𝑥-values. We will take a little extra effort and use the qqplot
command (because after all, you might not always want to compare to normal).
Whalen Statistics 03 Normal Approximation to Data © Whalen

To illustrate what a Q-Q plot communicates, we compare the list of numbers 1 through 100 to
the normal distribution.

We mark dashed lines through the point (117.65, 99). Why is this point on the Q-Q plot? The
number 99 is in the “data,” so a point with 𝑦 = 99 is plotted. On what 𝑥-value? Answer: the 𝑥-
value under the corresponding normal curve with the same quantile rank.

The number 99 has a percentile-rank of 99% in the list 1:100. Under the normal curve (with the
same average and SD as 1:100), the 99th-percentile is about 117.65.

The same idea applies for every other number in the “data” 1:100, except for 1 and 100. (Why
do they not have points in the plot?)

The Q-Q plot does not follow a straight line (except near the center), confirming what we know
to be obvious—a histogram of 1:100 is nothing like the normal curve.
Whalen Statistics 03 Normal Approximation to Data © Whalen

Example – Q-Q plot

a) Make a Q-Q plot comparing the women’s heights in the hanes data to the normal
quantiles for the corresponding normal curve.
b) Add a straight line 𝑦 = 𝑥 to the plot. What does the Q-Q plot suggest?
c) Try to determine the quantile-ranks and the 𝑥-values corresponding to the tallest woman
represented in the Q-Q plot. Is it the tallest woman in the data set?

Solution
a) The qqplot function with the average and SD specified for the normal distribution gives
the graph below:
Whalen Statistics 03 Normal Approximation to Data © Whalen

b) Taking the effort to make a more detailed Q-Q plot allows us to add the line easily:

The plot suggests what we saw earlier with the histogram. The women’s height distribution
in the hanes data is not perfectly normal shaped, but it is not very different from normal,
except at the extremes.
Whalen Statistics 03 Normal Approximation to Data © Whalen

c) The tallest woman represented in the Q-Q plot has a height of 72.6 inches. She is the
second tallest in the data set. The tallest woman’s height is 72.7 inches, but no point in the
Q-Q plot has this 𝑦-value. Why? The biggest value has a percentile-rank of 100%. But for the
normal curve, there is no value with a percentile-rank of 100%. (Some students might want
to say that it’s “infinity,” but infinity is not a number.) The 72.6 inch value has a quantile
rank of 59/60, or about 98.3%. The value in the corresponding normal distribution with that
quantile rank is about 70.6 inches.

Why make a Q-Q plot when we already saw the histogram with the normal curve for
comparison? One reason might be because a histogram’s appearance depends on the bin
choices, but a Q-Q plot does not. There are choices about how quantiles are defined, however,
so both plots depend on our choices somewhat. Some might say a visual comparison of
histograms is subjective, but in a Q-Q plot we are visually comparing the dots to a straight
line—also subjective.

A histogram with the normal curve and a Q-Q plot give different vantage points of the same
general picture. If you really need to check normality, it never hurts to include both.
Whalen Statistics 03 Normal Approximation to Data © Whalen

Skew and kurtosis

The behavior of the “tails” of a histogram often make or break the normal approximation. Skew
and kurtosis sound like diagnoses you don’t want to get from your doctor, but actually they
quantify the tail behavior using averages of powers of z-scores.

The skew of a list of numbers (henceforth, skew) is defined to be the average of the cubes of
the standardized values of the numbers in the list. The kurtosis of a list of numbers (henceforth,
kurtosis) is defined to be the average of the quadruples of the standardized values of the
numbers in a list.

Skew = average of (z-scores)3

4
Kurtosis = average of (z-scores)

In some contexts, these are called the third standardized moment and the fourth standardized
moment, respectively. There are modifications to these formulas and alternative definitions,
but they are all rhapsodies on this theme.

If a histogram has a long right tail, those big numbers towards the right will have z-scores more
than 1. Numbers more than 1 will be bigger when cubed, making the skew larger positive. On
the other hand, if a histogram has a long left tail, those numbers towards the left will have z-
scores further negative than -1. Such numbers will be further negative when cubed, making the
skew larger negative.

The more symmetric a histogram is, the closer its skew is to 0.

Whether positive or negative, z-scores bigger than 1 in magnitude will be large positive when
raised to the fourth power, raising the kurtosis. When a lot of the data values are farther than 1
SD away from average, the kurtosis will be bigger. Note that kurtosis is never negative.

The normal curve has a skew of exactly 0 and a kurtosis of exactly 3. For density curves, defining
and calculating skew and kurtosis requires an integral. For the normal curve, the skew and
kurtosis integrals are:

∞
1 2 /2
∫ 𝑥3 𝑒 −𝑥 𝑑𝑥 = 0
−∞ √2𝜋

∞
1 2 /2
∫ 𝑥4 𝑒 −𝑥 𝑑𝑥 = 3
−∞ √2𝜋
Whalen Statistics 03 Normal Approximation to Data © Whalen

The women’s heights in the hanes data have a skew of about 0.4088, not too far from 0 but still
a bit positive. Looking at the histogram we made before, we can see why. The heights are
roughly symmetric, but there are a few extremely large values (tall women) on the right tail.

The kurtosis is about 3.637, bigger than the normal curve’s kurtosis of 3. The tails of the data
histogram on both sides are a bit thicker than the normal curve’s tails.
Whalen Statistics 03 Normal Approximation to Data © Whalen

Example – computing the skew and kurtosis of a list

Compute the skew and kurtosis of the endscores in the calcgrades data file (with the NA
entries removed). Make a histogram and discuss the skew value.

Solution
The skew of the list of endscores equals about −1.47. A histogram with bins of width 2 looks
like this:

The histogram has a long left tail; these extremely low values have large negative z-scores that
contributed large negative terms in the skew calculation. This histogram is said to be left
skewed. Note that the skew does not depend on the choice of bins. The skew depends only on
the data set itself.

The kurtosis is about 5.38, more than the normal curve’s 3 for comparison, representing the
fact that the tails are more predominant for the data than the normal curve would suggest.
Whalen Statistics 03 Normal Approximation to Data © Whalen

Just for laughs, we add the normal curve to the histogram:

You wouldn’t call these data approximately normal, would you?

Grades often have negative skew; that is, left-skewed histograms. That’s good (for most of the
students), because that suggests that low scores were unusual compared to the bulk of the
data.
Whalen Statistics 03 Normal Approximation to Data © Whalen

Example – computing the skew and kurtosis of a list

Compute the skew and the kurtosis of the Income2005 values in the nlsy79 data file (with
the NA entries removed). Make a histogram and discuss the skew value in relation to it.

Solution
The skew of the list of Income2005 values equals about +3.936. A histogram with 200 bins
looks like:

The histogram has a long right tail, and these values have large positive z-scores. The cubes of
these extremely large z-scores make the skew larger positive. The histogram is right skewed.
Again, the skew calculation depends only on the list of numbers and not on how we set up the
histogram.

The kurtosis is about a whopping 33.05. (Remember, it’s only 3 for the normal curve.) The list
has many more incomes with z-scores outside of -1 to 1 than an approximately normal
distribution.
Whalen Statistics 03 Normal Approximation to Data © Whalen

For fun, we add the normal curve to this histogram to see how non-normal it is:

Income distributions often have positive skew (that is, right-skewed histograms) due to a few
very large incomes.

Notation station
For theoretical distributions, it is common to use the Greek letters mu 𝜇 and sigma 𝜎 for the
average and standard deviation. (Note that mu corresponds to “m” for mean and sigma
corresponds to “s” for “standard deviation” or “spread.”) From the standard normal curve, shift
it over by subtracting 𝜇 from 𝑥, then scale it by dividing the input by 𝜎. Adjust the coefficient so
that the total area is still 1. You get:
1 1 𝑥−𝜇 2
𝑦= 𝑒 −2( 𝜎 )
𝜎√2𝜋
𝑥−𝜇
Note that the transformation is the same as the z-score conversion 𝑧 = 𝜎 and in this case, the
center of the curve is at 𝑥 = 𝜇 and the concavity changes at 𝑥 = 𝜇 ± 𝜎 (at the points 1 SD away
from average).
Whalen Statistics 03 Normal Approximation to Data © Whalen

Practice Problems
The Pearson data set consists of survey data gathered by Karl Pearson and Alice Lee and used
for their important paper about regression published in 1903. We will discuss their work in
more detail in Lesson 05. For now, use the list of the heights of the fathers Pearson$father
in the families they surveyed to practice normal approximations. Note there are 1,078 fathers
represented in this list.

[Pearson problem 1]:

a) Make a (density) histogram of the fathers’ heights in the Pearson data. Use enough bins
so you can see the shape of the distribution.
b) Do the tall peaks represent the taller men? If not, what do they represent?
c) Who is represented in the right tail of the histogram?
d) Who is represented in the left tail of the histogram?

[Pearson problem 2]:

a) Find the average and SD of the fathers’ heights in the Pearson data. Include units!
b) Paste a normal curve (with the appropriate center and spread) over the histogram. (If it
doesn’t appear, check whether you made a density-scale histogram!)
c) What is the total area under the curve?
d) What is the total area of all the rectangles?

[Pearson problem 3]:

a) What percentage of the father heights are 60 inches or less, according to a normal
approximation?
b) What is the actual percentage in the data?
Whalen Statistics 03 Normal Approximation to Data © Whalen

[Pearson problem 4]:

a) What percentage of the father heights are 72 inches or more, according to a normal
approximation?
b) What is the actual percentage in the data?
Whalen Statistics 03 Normal Approximation to Data © Whalen

[Pearson problem 5]:

a) What percentage of the father heights are between 65 and 70 inches, according to a
normal approximation?
b) What is the actual percentage in the data?
Whalen Statistics 03 Normal Approximation to Data © Whalen

[Pearson problem 6]:

a) Create a QQ-plot, like the one shown above, that compares the father heights data to a
normal distribution. The top-right circle is marked by dashed lines. What are its
coordinates, and what does it represent?
b) Find the coordinates of the bottom-left circle and interpret.
c) The tallest father height is 75.4 inches, and its exact quantile-rank is 1078/1078. Why
isn’t this father represented in the q-q plot?
Whalen Statistics 03 Normal Approximation to Data © Whalen

[scf2013 income qq-plot]:

The scf2013 data set has the results of the Federal Reserve Board’s Survey of Consumer
Finance for 2013. The INCOME column lists the total household incomes of the families
surveyed. Make a Q-Q plot comparing the exact quantiles of the income values with the
corresponding normal curve. What do you conclude?

[calcgrades endscore qq-plot]:

Make a qq-plot comparing the exact quantiles of the endscore values in the calcgrades
data with the theoretical normal curve quantiles. (You can apply na.omit first.) What do you
think?

[Skew and Kurtosis calculations]:

1. Compute the skew and kurtosis of the father heights in the Pearson data.
2. Compute the skew and kurtosis of the income values in the scf2013 data.
3. Compute the skew and kurtosis of the income values under $400,000 in the scf2013 data.
Make a histogram of the data with the corresponding normal curve for comparison.
4. Compute the skew and kurtosis of the endscore values in the calcgrades data (after
applying na.omit). Make a histogram of the data with the corresponding normal curve for
comparison.
5. Write a function that computes the skew and kurtosis of a list of numbers. Experiment with
the print and paste commands to make the result display in the console.
Whalen Statistics 03 Normal Approximation to Data © Whalen

Practice exercise SOLUTIONS

[Pearson problem 1 answer]:

a) The histogram is shown, and the R script is below. We removed the histogram’s default
𝑥-axis and used the axis function to replace it with our choice of numbers and labels.

hist(Pearson$father, breaks=75,
prob=TRUE, col="white",
xlab="Height (inches)",
xaxt='n',
main="The 1,078 fathers in the Pearson data")
axis(1, at=59:76)
b) No. Taller peaks represent more people there. The tall peaks are above the middle
(average) heights. The height of the histogram represents the density, or congestion, of
data values in each bin. The peaks are taller above the more common heights; that is,
there are more men with those heights.
c) The right tail represents the very tall men. The histogram is short there, because there
were not many very tall men.
d) The left tail represents the very short men. The histogram is short there, because there
were not many very short men.
Whalen Statistics 03 Normal Approximation to Data © Whalen

[Pearson problem 2 answers]:

a) The average is about 67.69 inches. The SD is about 2.74 inches.
b) Here is the histogram with the normal curve. Continuing from the problem 1 answer
script, use the line
curve(dnorm(x, 67.69, 2.74), add=TRUE)
It’s even better if you use the average and SD functions, or stored values, directly,
instead of typing the numbers.
Note: You have to make a density histogram of the data for the curve to show up. On a
frequency histogram, the normal curve will not appear.

[Pearson problem 3 answers]:

a) About 0.0025, or about 0.25%.
b) 5/1078, or about 0.5%.
Our computations are below. We include the script to make the histogram with the shaded
rectangles.
pnorm(q=60, mean=67.69, sd=2.74)
length(Pearson[Pearson$father<=60,]$father)/length(Pearson$father)

hist(Pearson$father, breaks=seq(from=58.5, to=76.5, by=0.2),

prob=TRUE,
xlab="Height (inches)",
xaxt='n',
main="",
col=c("white", "lightblue", "white")[cut(seq(from=58.5, to=76.5,
by=0.2), c(-Inf,50,60,Inf))]
)
axis(1, at=59:76)
curve(dnorm(x, 67.69, 2.74), add=TRUE)

[Pearson problem 4 answers]:

a) About 0.05786, or about 5.8%
b) 73/1078, or about 6.8%.
1-pnorm(q=72, mean=67.69, sd=2.74)
length(Pearson[Pearson$father>=72,]$father)/length(Pearson$father)

hist(Pearson$father, breaks=seq(from=58.5, to=76.5, by=0.2),

prob=TRUE,
xlab="Height (inches)",
xaxt='n',
main="",
col=c("white", "lightblue", "white")[cut(seq(from=58.5, to=76.5,
by=0.2), c(-Inf,72,80,Inf))]
)
axis(1, at=59:76)
curve(dnorm(x, 67.69, 2.74), add=TRUE)
Whalen Statistics 03 Normal Approximation to Data © Whalen

[Pearson problem 5 answers]:

a) About 0.63729, or about 64%
b) 686/1078, or about 64%
pnorm(q=70, mean=67.69, sd=2.74)-pnorm(q=65, mean=67.69, sd=2.74)
length(Pearson[Pearson$father>=65 &
Pearson$father<=70,]$father)/length(Pearson$father)

hist(Pearson$father, breaks=seq(from=58.5, to=76.5, by=0.2),

prob=TRUE,
xlab="Height (inches)",
xaxt='n',
main="",
col=c("white", "lightblue", "white")[cut(seq(from=58.5, to=76.5,
by=0.2), c(-Inf,65,70,Inf))]
)
axis(1, at=59:76)
curve(dnorm(x, 67.69, 2.74), add=TRUE)

[Pearson problem 6 answers]:

a) The script is below. The second-tallest father height is 75.2 inches. This is the y-
coordinate of the marked point. Its quantile rank is 1077/1078. The x-coordinate is the
corresponding normal quantile; that is, the height for which the left area under the
normal curve is about 1077/1078. Using qnorm, that height is about 76.23 inches. This is
the x-coordinate of the marked point.
qqplot(x=qnorm(p=seq(from=1/1078, to=1, by=1/1078),
mean=mean(Pearson$father),
sd=SD(Pearson$father)), y=Pearson$father,
xlab="Normal quantiles",
ylab="Data quantiles")

# Check for the biggest heights

Pearson[Pearson$father>75,]

# Find the next-to-biggest height

quantile(p=1077/1078, Pearson$father, type=1)
abline(h=75.2, lty=2)

# Find the corresponding normal quantile rank

qnorm(p=1077/1078,
mean=mean(Pearson$father),sd=SD(Pearson$father))
abline(v=76.22917, lty=2)
Whalen Statistics 03 Normal Approximation to Data © Whalen

b) The shortest father height is 59.0 inches. This is the y-coordinate of the bottom-left
point. Its exact quantile rank is 1/1078, and the value under the normal curve with the
same quantile rank is about 59.14 inches. This is the x-coordinate of the bottom-left
point.
# Check for the shortest heights
Pearson[Pearson$father<60,]

# Mark the smallest height:

abline(h=59, lty=2)

# Smallest height is 59.0 inches

# Corresponding normal quantile:
qnorm(p=1/1078, mean=mean(Pearson$father),sd=SD(Pearson$father))
abline(v=59.14449, lty=2)

c) The exact quantile rank is 1078/1078 = 1, or 100%. The normal distribution does not
have a 100th-quantile. (The curve never touches the x-axis, so the area under the curve
to the left of any number is less than 100%, even if only by a tiny bit.) For fun, you can
see what R returns for qnorm of p=1.
Whalen Statistics 03 Normal Approximation to Data © Whalen

[scf2013 income qq-plot answer]:

Here is the QQ-plot.

length(scf2013$INCOME)

# Turn off scientific notation

options(scipen=999)

qqplot(x=qnorm(p=seq(from=1/6015, to=6014/6015, by=1/6015),

mean=mean(scf2013$INCOME),
sd=SD(scf2013$INCOME)),
y=scf2013$INCOME,
xlab="Normal quantiles",
ylab="Data quantiles")

The plot is obviously not like a straight line. We conclude that a normal distribution is not like
the distribution of the income values. (Not by a long shot.) You might consider checking a qq-
plot for the log-incomes, though!
Whalen Statistics 03 Normal Approximation to Data © Whalen

[calcgrades endscore qq-plot answer]:

Here is a qq-plot:

calc<-na.omit(calcgrades)
length(calc$endscore)
qqplot(x=qnorm(p=seq(from=1/1368, to=1367/1368, by=1/1368),
mean=mean(calc$endscore),
sd=SD(calc$endscore)),
y=calc$endscore,
xlab="Normal quantiles",
ylab="Data quantiles")

The q-q plot is obviously not like a straight line. The distribution of endscore values does not
look like a normal curve.
Whalen Statistics 03 Normal Approximation to Data © Whalen

[Skew and Kurtosis answers]:

[1] In R, you can convert an entire list of numbers (vector) into its standard units rather easily,
because R applies the arithmetic component-wise. Subtract the average of the list from the
vector, then divide that vector by the SD of the vector. (Here, we use our “SD of a list” function.
The result is almost the same if you use the built-in sd function.) Then raise the vector to the
third power, then take the average:
mean(((Pearson$father - mean(Pearson$father))/SD(Pearson$father))^3)
Similarly for kurtosis, but raise to the fourth power instead:
mean(((Pearson$father - mean(Pearson$father))/SD(Pearson$father))^4)

The skew is about −0.088, very close to 0. The Pearson father heights have a nearly
symmetrical distribution.
The kurtosis is about 2.84, close to and a little bit smaller than 3. The heights have a little less
action in the tails than the normal curve.

[2]
The INCOME values in the scf2013 data have a skew of about 16.6 and a kurtosis of about
357.

The skew of 16.6 is large positive, indicating a long right tail. There are a few really big values,
giving the distribution a “positive skew.”

The whopping 357 for kurtosis suggests there is a lot more action in the tails of the data
histogram, on either side, than the normal curve would predict.
Whalen Statistics 03 Normal Approximation to Data © Whalen

[3]
The skew is about 1.9, and the kurtosis is about 6.85. The positive skew corresponds to the long
right tail we see in the histogram. The kurtosis is more than 3, and we see that the data
histogram’s tails are quite different than the normal curve tails:

mean(((scf2013[scf2013$INCOME<400000,]$INCOME -
mean(scf2013[scf2013$INCOME<400000,]$INCOME))/SD(scf2013[scf2013$INCOM
E<400000,]$INCOME))^3)

mean(((scf2013[scf2013$INCOME<400000,]$INCOME -
mean(scf2013[scf2013$INCOME<400000,]$INCOME))/SD(scf2013[scf2013$INCOM
E<400000,]$INCOME))^4)

hist(scf2013[scf2013$INCOME<400000,]$INCOME, breaks=100, prob=TRUE,

main="SCF 2013 household incomes under $400,000",
xlab="Total Household Income (dollars)",
col="white")

curve(dnorm(x, mean=mean(scf2013[scf2013$INCOME<400000,]$INCOME),
sd=SD(scf2013[scf2013$INCOME<400000,]$INCOME)), add=TRUE)
Whalen Statistics 03 Normal Approximation to Data © Whalen

[4] The endscore values in the calculus grades have a skew of about −1.47 (negative,
suggesting a long left tail) and a kurtosis of about 5.38 (more than 3, suggesting thicker tails
than the normal curve’s). The normal curve starts to taper down right of 80, but the data
histogram is tapering up towards its peak there.

calc<-na.omit(calcgrades)
mean(((calc$endscore - mean(calc$endscore))/SD(calc$endscore))^3)
mean(((calc$endscore - mean(calc$endscore))/SD(calc$endscore))^4)

hist(calc$endscore,
breaks=seq(from=0, to=100, by=2),
prob=TRUE,
xlab="calculus end scores",
ylab=NULL,
yaxt='n',
main=NULL,
col="white")
curve(dnorm(x, mean=mean(calc$endscore), sd=SD(calc$endscore)),
add=TRUE)

[5] Here is our function:

skew <- function(v){
SD.v<- sqrt(sum((v-mean(v))^2)/length(v))
skew.v <- mean(((v - mean(v))/SD.v)^3)
kurt.v <- mean(((v - mean(v))/SD.v)^4)
print(paste("skew = ", skew.v))
print(paste("kurtosis =", kurt.v))
}

The Normal Distribution PDF
No ratings yet
The Normal Distribution PDF
12 pages
Repeatable Vehicle - Semester Project
No ratings yet
Repeatable Vehicle - Semester Project
9 pages
Normal Distribution
No ratings yet
Normal Distribution
3 pages
The Normal Distribution: Density Curves
No ratings yet
The Normal Distribution: Density Curves
3 pages
00000chen - Linear Regression Analysis3
No ratings yet
00000chen - Linear Regression Analysis3
252 pages
Chapter6 Stats
No ratings yet
Chapter6 Stats
4 pages
A. The Standard Normal Distribution
No ratings yet
A. The Standard Normal Distribution
9 pages
ErrorAnalysis
No ratings yet
ErrorAnalysis
8 pages
Normality Test
No ratings yet
Normality Test
27 pages
Topic3-NormalCurve
No ratings yet
Topic3-NormalCurve
40 pages
1 - Graphing And Statistical Analysis
No ratings yet
1 - Graphing And Statistical Analysis
9 pages
StatisticsSummary Part2
No ratings yet
StatisticsSummary Part2
3 pages
MMW 101 - Lesson 10 - Z-Scores and Normal Curve
No ratings yet
MMW 101 - Lesson 10 - Z-Scores and Normal Curve
45 pages
RRL M10
No ratings yet
RRL M10
9 pages
Chapter 9 Simple Linear Regression and Correlation (1) (1)
No ratings yet
Chapter 9 Simple Linear Regression and Correlation (1) (1)
56 pages
Nomral Distribution
No ratings yet
Nomral Distribution
93 pages
Nature, Properties and Applications of The Normal Curve
No ratings yet
Nature, Properties and Applications of The Normal Curve
4 pages
Notes ch3 Sampling Distributions
No ratings yet
Notes ch3 Sampling Distributions
20 pages
01_statistics_lesson
No ratings yet
01_statistics_lesson
35 pages
Completed Lab
No ratings yet
Completed Lab
4 pages
Eric C. R. Hehner R. N. S. Horspool: D D D N D B
No ratings yet
Eric C. R. Hehner R. N. S. Horspool: D D D N D B
16 pages
Normal Distribution
No ratings yet
Normal Distribution
15 pages
GNP501S EXPERIMENT 1
No ratings yet
GNP501S EXPERIMENT 1
9 pages
Simple Linear Regression: From Wikipedia, The Free Encyclopedia
No ratings yet
Simple Linear Regression: From Wikipedia, The Free Encyclopedia
10 pages
3 Residual Analysis
No ratings yet
3 Residual Analysis
5 pages
Lecture 6
No ratings yet
Lecture 6
34 pages
US - TMC - 06 - Curve Fitting & Interpolation
No ratings yet
US - TMC - 06 - Curve Fitting & Interpolation
64 pages
Graphical Analysis and Errors - MBL
No ratings yet
Graphical Analysis and Errors - MBL
7 pages
3 - Noman Naseer - Dimentionality - Reduction and Principle Component Analysis
No ratings yet
3 - Noman Naseer - Dimentionality - Reduction and Principle Component Analysis
43 pages
Unc
No ratings yet
Unc
38 pages
Applied Statistics II-SLR
100% (1)
Applied Statistics II-SLR
23 pages
Data Analysis: Measures of Dispersion
No ratings yet
Data Analysis: Measures of Dispersion
6 pages
Topic:: Normal Probability Curve
No ratings yet
Topic:: Normal Probability Curve
20 pages
Meth
No ratings yet
Meth
6 pages
Normal Distribution and Regression Notes
No ratings yet
Normal Distribution and Regression Notes
71 pages
Lecture4 Mech SU
No ratings yet
Lecture4 Mech SU
17 pages
Chapter 03-Normal Distributions
No ratings yet
Chapter 03-Normal Distributions
28 pages
Error and Uncertainty: General Statistical Principles
No ratings yet
Error and Uncertainty: General Statistical Principles
8 pages
Basic Statistics
No ratings yet
Basic Statistics
44 pages
Math EE IB
No ratings yet
Math EE IB
14 pages
Sample Lecture Notes Apr 11 TH
No ratings yet
Sample Lecture Notes Apr 11 TH
3 pages
Statical Distriution function
No ratings yet
Statical Distriution function
8 pages
Module 5
No ratings yet
Module 5
38 pages
Maths
No ratings yet
Maths
5 pages
STA2100-Regression Analysis
No ratings yet
STA2100-Regression Analysis
15 pages
Statistics Notes
No ratings yet
Statistics Notes
44 pages
Data Visualizations: Histograms
No ratings yet
Data Visualizations: Histograms
27 pages
Skewness
No ratings yet
Skewness
9 pages
Apuntes Probabilidad y Estadística Fundamental
No ratings yet
Apuntes Probabilidad y Estadística Fundamental
132 pages
Normal Random Variable
No ratings yet
Normal Random Variable
8 pages
Chapter 5
No ratings yet
Chapter 5
28 pages
Univariate Statistics
No ratings yet
Univariate Statistics
4 pages
FEB. 10 DISCUSSION
No ratings yet
FEB. 10 DISCUSSION
3 pages
Chapter 4 - Summarizing Numerical Data
No ratings yet
Chapter 4 - Summarizing Numerical Data
8 pages
17 21
No ratings yet
17 21
6 pages
Errors Experiment
No ratings yet
Errors Experiment
8 pages
Chapter 5 The Normal Distribution
No ratings yet
Chapter 5 The Normal Distribution
5 pages
Unit I Bbbbbbbbbbbbbba
No ratings yet
Unit I Bbbbbbbbbbbbbba
8 pages
Math 002 Module No.2
No ratings yet
Math 002 Module No.2
13 pages
StatsLecture1 Probability
No ratings yet
StatsLecture1 Probability
4 pages
GCSE Maths Revision: Cheeky Revision Shortcuts
From Everand
GCSE Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (2)
Surface Area and Volume Self Practice Worksheet
No ratings yet
Surface Area and Volume Self Practice Worksheet
2 pages
đề 5 - unit 3 tiếng anh 10 global success
No ratings yet
đề 5 - unit 3 tiếng anh 10 global success
4 pages
KIỂM TRA NGỮ PHÁP IELTS
No ratings yet
KIỂM TRA NGỮ PHÁP IELTS
4 pages
Full Download MCQs and EMQs in Human Physiology 6th Edition Ian C. Roddie PDF
92% (13)
Full Download MCQs and EMQs in Human Physiology 6th Edition Ian C. Roddie PDF
84 pages
LTM 1500-8.1 - Weight of Luffer
No ratings yet
LTM 1500-8.1 - Weight of Luffer
13 pages
Quiz BRM 5
50% (2)
Quiz BRM 5
2 pages
Sensors: A Strain-Based Method To Estimate Slip Angle and Tire Working Conditions For Intelligent Tires Using Fuzzy Logic
No ratings yet
Sensors: A Strain-Based Method To Estimate Slip Angle and Tire Working Conditions For Intelligent Tires Using Fuzzy Logic
17 pages
Clay Mineral Identification Flow Diagram PDF
No ratings yet
Clay Mineral Identification Flow Diagram PDF
1 page
Quarter 4 Module 4a: Original Performance With The Use of Media
100% (1)
Quarter 4 Module 4a: Original Performance With The Use of Media
16 pages
HR Question Bank
0% (1)
HR Question Bank
13 pages
Aptitude Test 1
No ratings yet
Aptitude Test 1
11 pages
Michael Bolton
No ratings yet
Michael Bolton
4 pages
2950 0212 01 - XAS 66 - DD - ASL
No ratings yet
2950 0212 01 - XAS 66 - DD - ASL
108 pages
Hand Out - 2 Case Study + Elements of Strong Justification
No ratings yet
Hand Out - 2 Case Study + Elements of Strong Justification
4 pages
Anthropometric Survey On Male Agricultural Workers of The Middle Euphrates Area, Iraq
No ratings yet
Anthropometric Survey On Male Agricultural Workers of The Middle Euphrates Area, Iraq
7 pages
Pollution Class and Creepage Distance Iec60071-2 (Ed3.0) en - D
No ratings yet
Pollution Class and Creepage Distance Iec60071-2 (Ed3.0) en - D
1 page
Cips L2M5
No ratings yet
Cips L2M5
8 pages
When Moms Attack TE
100% (1)
When Moms Attack TE
26 pages
ICT Acceptable Use Policy
No ratings yet
ICT Acceptable Use Policy
3 pages
Traditional Latvian Cuisine
No ratings yet
Traditional Latvian Cuisine
15 pages
Fuzzy Numbers
No ratings yet
Fuzzy Numbers
16 pages
MNG4801 - JANFEB - 2021 - Online Portfolio Exam - Answer Template
No ratings yet
MNG4801 - JANFEB - 2021 - Online Portfolio Exam - Answer Template
11 pages
1HYB800001-042A-OM Manual (1-Pole) PDF
No ratings yet
1HYB800001-042A-OM Manual (1-Pole) PDF
193 pages
Cognitive Psychology
No ratings yet
Cognitive Psychology
7 pages
Partsbook Pegasus W644, W664
No ratings yet
Partsbook Pegasus W644, W664
200 pages
DFA On AES 192, 256
No ratings yet
DFA On AES 192, 256
36 pages
STI Main 2015
No ratings yet
STI Main 2015
40 pages
Vacuum Pressure Swing Adsorption
0% (1)
Vacuum Pressure Swing Adsorption
5 pages
Semana 6 Entregable Customer Journey Map - Mapa de Empatia - en Customer Journal Map
No ratings yet
Semana 6 Entregable Customer Journey Map - Mapa de Empatia - en Customer Journal Map
14 pages

Normal approximation to data

Uploaded by

Normal approximation to data

Uploaded by

Whalen Statistics 03 Normal Approximation to Data © Whalen

Normal Approximation to Data

Introducing the Normal Curve

Basic normal approximations of data histograms

R has four functions about the normal distribution:

dnorm pnorm qnorm rnorm

Let’s show off how the functions work with examples.

Example – compare a normal approximation to the actual data

Example –a normal approximation over an “in-between” interval by the subtraction method

Example – using a normal approximation for a specific value

The normal quantile function

Note that the normal quantile function is not defined at 𝑝 = 0 or at 𝑝 = 1.

Example – use qnorm to compute a percentile

Quantile functions for data

The 25th-percentile is 1, the 50th-percentile is 2, and the 75th-percentile is 3. By this definition,

Quantile Rank 𝑝 0% 25% 50% 75% 100%

Example – approximating quantiles with the normal curve

It’s good to be normal

Example – Q-Q plot

Skew and kurtosis

Skew = average of (z-scores)3

The more symmetric a histogram is, the closer its skew is to 0.

Example – computing the skew and kurtosis of a list

Just for laughs, we add the normal curve to the histogram:

You wouldn’t call these data approximately normal, would you?

Example – computing the skew and kurtosis of a list

[Pearson problem 1]:

[Pearson problem 2]:

[Pearson problem 3]:

[Pearson problem 4]:

[Pearson problem 5]:

[Pearson problem 6]:

[scf2013 income qq-plot]:

[calcgrades endscore qq-plot]:

[Skew and Kurtosis calculations]:

Practice exercise SOLUTIONS

[Pearson problem 1 answer]:

[Pearson problem 2 answers]:

[Pearson problem 3 answers]:

hist(Pearson$father, breaks=seq(from=58.5, to=76.5, by=0.2),

[Pearson problem 4 answers]:

hist(Pearson$father, breaks=seq(from=58.5, to=76.5, by=0.2),

[Pearson problem 5 answers]:

hist(Pearson$father, breaks=seq(from=58.5, to=76.5, by=0.2),

[Pearson problem 6 answers]:

# Check for the biggest heights

# Find the next-to-biggest height

# Find the corresponding normal quantile rank

# Mark the smallest height:

# Smallest height is 59.0 inches

[scf2013 income qq-plot answer]:

# Turn off scientific notation

qqplot(x=qnorm(p=seq(from=1/6015, to=6014/6015, by=1/6015),

[calcgrades endscore qq-plot answer]:

[Skew and Kurtosis answers]:

hist(scf2013[scf2013$INCOME<400000,]$INCOME, breaks=100, prob=TRUE,

[5] Here is our function:

You might also like