1 The Mathematics Behind Polling
1 The Mathematics Behind Polling
1.1 Introduction
One place where we see inferential statistics used every day is in polling. Opin-
ion polls, like it or not, are part of our political system. Polling is also used
in marketing, sales, and entertainment. The intricacies of polling are far too
complicated for us to treat completely. We can, however. understand the basic
ideas behind this discipline and see how probability theory is used in polling.
The view we take is greatly simpli…ed and will necessarily gloss over some prac-
tical di¢ culties. It will, however, make it easier to interpret the kind of poll
results normally reported in the news.
The basic idea is that in a large population of people a certain percentage
will agree on one particular issue. We would like to know what percentage of
people this is. Asking everyone and computing the exact result is out of the
question given the size of the population. We might at least try to estimate
the percentage by choosing a representative sample from the population and
determining the percentage in the sample. Assuming that our sample is truly
representative of the population, the percentage holding that opinion in the
sample should provide a reasonable estimate of the percentage in the entire
population.
Two natural questions might be, what exactly is "a reasonable estimate"
and how con…dent are you in this estimate. Here is where statistics can use
probability theory to quantify the results.
1
That would mean that the probability that the person chosen will answer "yes"
P0
is p0 = 100 : All we have done is convert the percentage to a ratio of the whole
and changed that ratio into a real number between 0 and 1. Our one rule about
our question is that a good number of people will answer of "yes" and a good
number will answer "no." That means that p0 should not be too close to 0 or to
1. Now that we are thinking of a random experiment where a particular event
has probability p0 , we can consider that experiment as a Bernoulli trial where
an answer of "yes" is a success. We still only know that the probability exists,
and we do not know what it is.
Once we have chosen one person from the population, we do not what to
choose them again. So we will not. Now our other overriding assumption
about the population is that its actual size overwhelms any particular number
of people we pick from it. This means that removing one single person from the
population will have no measurable impact on the percentage who would answer
"yes." If we choose a second person, the probability that they will answer "yes"
is still p0 : The same goes for a third, fourth, or …fth. Thus we reasonably
assume that every time we choose a person from the population, the probability
they will answer "yes" is always p0 : That is to say, choosing n people randomly
from the population amounts to repeating the same Bernoulli trial n times.
We know a lot about the probability model that comes from repeating a
Bernoulli trial a number of times. We also have a quick way of computing
probabilities in that model if the number of repetitions is large. Well we can
if we actually know the probability of success in one trial. We do not just yet,
but let us go on.
If we choose a large sample of people, say n = 100, n = 500, n = 1000 or
more, the probability model should be very close to the normal distribution.
Now as usual in a random experiment, anything can happen, but we still know
what to expect. We expect that the number of successes in the sample will be
close to the mean of the experiment, and that it be within one or two standard
deviations of this mean. Unfortunately we do not know this mean or this
standard deviation, but that does not change where the result would be if we
did know them. We expect that the result will end up somewhere in the middle
part of the normal distribution approximating the probability model
2
We have a chart that quanti…es various choices for de…ning the center of the
distribution:
= np0
2
= np0 (1 p0 )
p
= np0 (1 p0 )
p0 ' 0:675:
d2 = ns(1 s)
= 1000 (0:675) (1 0:675)
= 1000 (0:675) (0:325)
= 219: 38
3
And …nally a sample standard deviation
p p
d = ns(1 s) = 219: 38 = 14: 811:
Since we are 90% con…dent that the sample mean will be no more than 1:645
standard deviations away from the actual mean, we can say we are 90% con…dent
that the actual mean will be no more than 1:645 sample standard deviations
away from the sample mean So we are 90% con…dent that the actual mean
is between
m 1:645d and m 1:645d:
That is to say we get
675 (1:645) (14: 811) 675 + (1:645) (14: 811)
650: 64 699:36::
But we know the relationship between the population mean and the popu-
lation probability, = np0 . Thus
650: 64 1000p0 = 699:36:
So we are 90% con…dent that the actual probability p0 is in the interval
0:65064 p0 0:69936:
Converting this to a percentage and doing a bit of rounding o¤, we have esti-
mated, with 90% con…dence that the percentage of people in the population
who would answer yes to our question is between 65% and 70%. In other
words, the percentage is approximately 65% with a margin of error of 2:5%
and a con…dence level of 90%.
1.3 Examples
Example 1 Suppose you would like to estimate the percentage of people in Ari-
zona that say they enjoy the summer heat. You survey 500 people and …nd
that 267 of them say that they do. This translates into a ratio of the whole
267
of 500 = 0:534 or a percentage of 53%. What is the margin of error in the
estimate if you use a con…dence level of 90%?
First, the sample size is n = 500. We represented the number of people
who answered yes as a ratio of the whole: s = 0:534: This will approximate the
unknown population ratio:
p0 ' s = 0:534:
When we know the population probability, the formulas for the population pa-
rameters are
= np0
2
= np0 (1 p0 )
p
= np0 (1 p0 )
4
We use these and the approximation of p0 to compute a sample mean, a sample
variance, and a sample standard deviation
That is,
49% P0 58%:
We round o¤ making sure to widen the interval so that we do not loose any
con…dence in our estimation interval.
The …nal result we obtain is an estimate of 53:5% within an error of 4:4%
and a con…dence of 90%:
p0 ' s = 0:6508:
= np0
2
= np0 (1 p0 )
p
= np0 (1 p0 )
5
That allows us to compute a sample mean, a sample variance, and a sample
standard deviation
That is,
Using = np0 ,
3199:2 5000p0 3308:8:
So we are 90% con…dent that the actual probability p0 is in the interval
3199:2 3308:8
0:63984 = p0 = 0:66176:
5000 5000
Thus the population percentage P0 is in the interval
63% P0 67%:
6
citizens of a legal age to vote, the resulting sample is not likely to be represen-
tative of the population he wants to study. As a result as much statistics goes
into the process of selecting a random sample as in analyzing the data collected
from the sample.
A …nal note is that we have only seen two simple examples of statistics in
use. There are many, many more. There are methods, not that dissimilar
from the ones above that apply when the problem limits the size of a sample
or test. These use other mathematical distributions other than the normal
distribution. There are also estimation methods that can be used to sharpen
the results of a complete survey of a large population. Rather than beginning
with a small sample and roughly estimating counts in an entire population, these
techniques take the results of a comprehensive survey of a population and adjust
the raw data to account for errors in the counting and tabulation of the data
collected. The US census bureau uses these techniques to strengthen the quality
of the results they report involving demographic information about the country.
However, there is a long-standing controversy that does not allow them to use
these results on the data used by congress to apportion legislative representation
or the distribution of federal funds to states and local communities. While this
debate often revolves about the veracity of the statistical methods, the true
issue is the perceived advantages that using statistics or not using statistics
might have for one party or the other.