Descriptive and Inferential Statistics, VOL 1, NO 1
Descriptive and Inferential Statistics, VOL 1, NO 1
ABSTRACT
This paper introduces two basic concepts in statistics: (i) descriptive statistics and (ii) inferential
statistics. Descriptive statistics is the statistical description of the data set. Common description
include: mean, median, mode, variance, and standard deviation. Inferential statistics is the drawing
of inferences or conclusion based on a set of observations. These observations had been described
by the descriptive statistics. From these descriptive statistics, an inference is made subject to a
predefined limit or error or confidence interval. The error in concluding the inference is called
inferential error. There are two types of inferential errors: (i) Type I error and (ii) Type II error.
Type I error occurs when the researcher accepts the alternative hypothesis despite contrary
evidence. Type II evidence occurs when the researcher rejects the alternative hypothesis despite
supporting evidence.
CITATION:
Sutanapong, C. and Louangrath, P.I. (2015). “Descriptive and Inferential Statistics” Inter. J. Res.
Methodol. Soc. Sci., Vol., 1, No. 1: pp. 22-35. (Jan. –Mar. 2015).
1.0 INTRODUCTION
1.1 Population
Population is defined as a finite totality of the data set. Generally, population size is given by the
symbol N . Statistics is commonly defined as a population study. There are two ways that a
population may be studied: (i) the entire population may be studied in detailed, i.e. census study
where every head is counted, or (ii) sampling a portion of the population and make an inference
from the descriptive statistic obtained from the sample. A census study is generally not
economically feasible if the population is large. Sampling is a common form of population study.
1.2 Sample
Sample n is a portion of a population N where n N : (n N ) . A sample is taken from a population
for the purpose of learning the characteristics of the population through estimation. The estimation
made from a sample is called an inference. The inference is made from the basic descriptive statistic
22
International Journal of Research & Methodology in Social Science
Vol. 1, No. 1, p.23 (Jan. – Mar. 2015). Online Publication
of a sample. These descriptions include: (i) sample size, (ii) sample mean, (iii) sample variance, and
(iv) sample standard deviation. Relevant to the discussion of “sample” is the idea of randomness;
hence, random sampling. The issue of randomness is not insignificant and deserves due attention
and independent treatment.
SEx (A)
n
…where is the estimated population standard deviation and n is the size of the test sample. A
“test sample” is a sample taken for the purpose of preliminary determination of the characteristic of
the population; it is not the minimum sample needed to complete the study or research. If the
population is normal, i.e. N (0,1) with mean zero and variance of 1, equation (A) would become:
1
SEx (B)
n
At this point, researchers would attempt to determine the value for n by assuming that SE 0.05 .
The calculation under equation (B) follows:
1
SE x
n
1
0.05
n
0.05 n 1
1
n
0.05
2
1 1
n
0.05 0.0025
n 400
It is concluded: “therefore,” the minimum sample size is 400. This reasoning and logic is faulty on
several grounds.
Primo, the logic is faulty because the researcher assumes that the data is normally
distributed. This assumption is not reasonable unless it has been tested and verified that the data or
population was normally distributed. This distribution verification may be accomplished by the
Anderson-Darling Test.
Secundo, the logic is faulty because the parameter (symbol) n used in equation (B) is not the
“minimum sample size” within the understanding of sample size needed to prove the condition:
x and S for sample-population inferential analysis.
23
International Journal of Research & Methodology in Social Science
Vol. 1, No. 1, p.24 (Jan. – Mar. 2015). Online Publication
Tertio, the logic is faulty by assuming that SE 0.05 . The parameter SE is known as
standard error, hence the abbreviation SE. This abbreviation and its attendant meaning has been
misinterpreted to mean “sampling error” and that the sampling error is mistakenly limited to an
arbitrary attachment of the value 0.05 by the extension of further faulty reasoning that 0.05 comes
from the precision level or random error level used in normal distribution curve. Further points of
criticism follow this third faulty of logic:
(i) The parameter SE in equation standard for standard error. The word standard
refers to the standard score. The standard score is determined by the measurement unit
that is counted in “standard deviation” or amount of distances counted in standard
deviation unit placed away from the mean. To the right of the mean, it is called +S and
the left of the mean, it is called –S; where S standards for standard deviation. However,
in the equation, instead of S, the Greek symbol sigma ( ) is used. It means further that
this sigma comes from the assumption of normal distribution where the sample and
population are assume to have equivalence statistical information, i.e. t = Z and
[( x ) / Z ] n . For that reason, the expression of SE is followed by a subscript of
the sample mean, thus: SEx . When this reasoning and definition of SE are explained,
then it becomes clear why the attempt of re-defining of SE to mean “sampling error” is
truly erroneous; and
(ii) the use of 0.05 is also faulty. The precision level of random chance error of
0.05 is used in statistical significance test. In order to reach any conclusion of
significance test, there must be a test statistic equation from which the result is used as a
yard-stick to read the critical value from the significance test table, i.e. t-Table, Z-Table,
chi-squared Table, F-table, etc. The value 0.05 in the erroneously interpreted equation
(B) comes from nowhere. It is arbitrary picked and equated to the precision level that is
‘commonly used” in statistics. This type of approach to statistics is spurious.
It is concluded that equations (A) and (B) are not formulae used to determine minimum
sample size. Minimum sample size is given by the following formulae:
N
nY (C)
1 N ( 2 )
where is the population size and α is the error level which is set at 0.05 for 0.95confidence interval.
This equation is known as the Yamane equation. It may be used only when the population
size is known. This is called finite population formula. However, if the population size is not
known, then the Yamane equation is useless. In real life, we are faced with non-finite or unknown
population size.
If the size of the population is not known, the following formula is used;
Z 2 2
n (D)
E2
x x x
…where Z is the standard score determine by: Z for the data set and Z i for an
S/ n S
item within a set. A set is defined as xi : x1 , x2 ,..., xn . The parameter E SE in equation (A). We
will revisit the issue of minimum sample size in Sect. 5, infra.
24
International Journal of Research & Methodology in Social Science
Vol. 1, No. 1, p.25 (Jan. – Mar. 2015). Online Publication
1 n
x xi
n i 1
(1)
where i = item; xi = each item observed; n = totals count of item of xi , i.e. xi : ( x1 , x2 ,..., xn ) ; and
n
i1
= sum of items from 1st to nth term. Therefore, the mean of the data set 195,170,165,165,160
It is said, the mean is of the data set 195,170,165,165,160 is 171. The mean of 171 is an
estimate of all 5 elements in the set. However, this estimate does not give an exact number. Some of
the items in the set may be located above 171 and some may be found below 171. This difference is
called dispersion. This dispersion illustrated by the mean difference. The mean difference is given
by:
xi x (2)
For the data set 195,170,165,165,160 with the mean of 171, the mean difference may be
calculated as:
25
International Journal of Research & Methodology in Social Science
Vol. 1, No. 1, p.26 (Jan. – Mar. 2015). Online Publication
This estimation is not accurate. This inaccuracy is shown through the mean difference
xi x . Since some data point is located above and some data points are located below the mean,
in order to get total possible dispersion per observation, the mean difference is calculated. The total
dispersion among all data points from the mean is determined by the sum of the individual mean
difference square. This sum squared mean difference is illustrated in the table below.
The difference each data point from the mean, i.e. missed estimation, may be located above
the mean, at the mean or below the mean. This dispersion is made uniformed by squaring each
mean difference: xi x
2
.
The total sum of the mean difference is 770. This is the measurement of the total dispersion
of all data points from the mean. This total dispersion is not helpful because it does not given any
information for the individual dispersion in the data set 195,170,165,165,160 ; therefore, it is
necessary to distribute the total dispersion to each data point in the set 195,170,165,165,160 by
dividing the total dispersion by n 5 . This calculation follows the following formula:
1 n
xi x
2
S2 (3)
n i 1
Equation (1.6) is called sample variance. The calculation for the sample variance of the data set
195,170,165,165,160 follows:
1 n 770
xi x
2
S2
n i 1 5
S 2 154
The variance or average dispersion per data point in the set 195,170,165,165,160 with
mean 171 is S 2 154 . The variance represents the error of the estimate. Recall that the estimate
was the mean. The mean value was 171. Comparing 154 variance to the mean of 171, the error of
the estimate appears large. The variance appears to be a large error because the variance is a square
of the dispersion to accommodate for “above” and “below” the estimated value: x 171 .
In order to minimize the error of the estimate, it is necessary to standardize the error into a
standard score called standard deviation. Sample standard deviation is given by:
1 n
xi x
2
S (4)
n i 1
26
International Journal of Research & Methodology in Social Science
Vol. 1, No. 1, p.27 (Jan. – Mar. 2015). Online Publication
Equation (1.4) is the square root of the variance. Standard deviation is defined as the square root of
the variance. If the variance represents “error” of the estimate, standard deviation represents the
‘minimization” of the error. The calculation for the standard deviation of the data set
195,170,165,165,160 follows:
1 n
xi x 154
2
S
n i 1
S 12.41
The standardized error of the estimate is 12.41; it has no unit of measurement because it is a
standard score. The standard deviation represents the common parlance of “given and take” or “plus
or minus” jargons when a person gives a certain value. For example, the average height of students
in this class is 171 plus or minus 12.41.
To make the calculation easier, we generally construct a table to calculate all descriptive
statistics of a sample. This table is produced below.
Variance = 154
The variance gives uniformity of the dispersion, but it does not standardize the dispersion
about (around) the mean (estimate); therefore, to standardize the measurement of the dispersion, the
square root of the variance is taken. This is called standard deviation.
Standard deviation is the standard score, i.e. uniform of dispersion of data, about the mean
of the data set. The standard deviation is used as a correcting value for the estimated mean.
Therefore, the standard deviation is given as plus or minus S about the mean. When the mean is
given, it must be given with plus or minus standard deviation in a form x S because x is an
estimated value and this estimate is not 100% accurate.
27
International Journal of Research & Methodology in Social Science
Vol. 1, No. 1, p.28 (Jan. – Mar. 2015). Online Publication
x
t (5)
S/ n
where …
t = sample standardized score, i.e. how far away from the mean;
x = sample mean;
= ideal mean or assumed population mean;
S = standard deviation of the sample data; and
n = sample size.
The meaning of the t-formula may be described as the probability distribution of the data
points in the set. For example, we are working with the data set 195,170,165,165,160 which may
be illustrated by the histogram below.
300.00
HEIGHT
195.00
200.00 170.00 165.00 165.00 160.00
100.00
-
1 2 3 4 5
INDIVIDUALS: n = 5
The estimated height for the group is 171; this is the mean for the group. However, data set
195,170,165,165,160 shows that each data each point does not equal to 171. The t-equation is a
tool to provide the distance between each data point to the mean in a standard score form.
The assumption for the standard score measurement assumes that if there was an ideal data
distribution, it would have been normally distributed in a perfect bell shaped curved call a normal
curve. This curve is illustrated below.
For the data set 195,170,165,165,160 , the standard score under the t-equation may be
calculated. Using equation (1.5), the standard score is calculated thus:
x
t
S/ n
171
12.41/ 5
The value for is missing. The value of is the ideal height which may be estimated via
the t-equation if the value of t is known. The t-value is called the critical value. The critical value is
given by the t-table. In order to read the t-table, two pieces of information are required: (i) degree of
freedom of the data set and (ii) the level of confidence.
28
International Journal of Research & Methodology in Social Science
Vol. 1, No. 1, p.29 (Jan. – Mar. 2015). Online Publication
The degree of freedom is defined as the range of the data points from its first data point to its
last data point. The degree of freedom is formally defined as: df n 1 . In the present case, n 5 ;
therefore, the degree of freedom is df n 1 5 1 4 . This degree of freedom may be read on the
t-table at the first column.
The confidence level is the percentage distribution of the data within the probability
distribution curve (see Figure 2.0) within which we accept as “normal.” If the observation falls
within this range of confidence, it is said that there is no significance because the data value is the
value that is classified as a normal occurrence. If the data value falls outside of the confidence
range, it is said to be significant because it is not within the range of normal expectation. Generally,
by common practice a range of x 2 S or plus or minus two units of standard deviation about the
mean is used. This range of x 2 S encompasses 0.95 or 95% of the data under the curved shown in
Figure 2.0.
With 4 degrees of freedom at 95% confidence interval level, the critical value of t-score is
2.13. The reading of the t-table is illustrated below.
x
t
S/ n
S
t x
n
12.41
2.13 171
5
12.41
2.13 171
2.24
2.13 5.55 171
11.82 171
159.18
29
International Journal of Research & Methodology in Social Science
Vol. 1, No. 1, p.30 (Jan. – Mar. 2015). Online Publication
The value of 159.18 ; it means that the “expected” average height of the population from
which the sample data set 195,170,165,165,160 is 159.18. However, this number is an estimate.
Like the estimated sample mean, the estimated population is also not 100% accurate. The standard
used in this estimation is 0.95 or 955 confidence interval. Therefore, it is necessary to give the
estimated population height of 159.18 is an interval form. In order to construct an interval, it is
necessary to determine the population standard deviation.
The t-equation gives us the population mean; however, it does not have a population mean.
We need to look for a population standard deviation elsewhere. We have mentioned the term
“assumed population” which we construct as an ideal population. This ideal population must be also
fitted to the 0.95 confidence interval in order to give us a standard score for the population. The
standard score for the population is given by the Z-equation. The Z-equation is given by:
x
Z (6)
/ n
where Z = population standardized score, i.e. how far away from the mean; x = sample mean; =
ideal mean or assumed population mean; = standard deviation of the ideal population; and n =
sample size.
Similar to the exercise we did above in finding the critical value for t, under equation (6)
with the facts given, we need to find the critical value for Z. he Z-table gives the critical value for Z
by a confidence level. We have been using 0.95 confidence level in our calculation in the t-
equation. Using 0.95 as the confidence interval, the Z-critical value is 1.645 or 1.65. The reading of
this value is illustrating below.
At 0.95 confidence interval, the critical value for Z is 1.645. This value may be rounded to
1.65. Throughout this Tutorial Note, the value 1.65 is used as a standard critical value for Z.
Using the Z-equation, the estimated population standard deviation may be calculated. The
calculation for the population standard deviation follows:
30
International Journal of Research & Methodology in Social Science
Vol. 1, No. 1, p.31 (Jan. – Mar. 2015). Online Publication
x
Z
/ n
x
n
Z
171 159.18
5
1.65
11.82
2.24
1.65
7.16(2.24)
16.05
The estimated population standard deviation is 16.05 . Using 0.95 confidence interval,
where 0.95 of the data falls within 2 units of standard deviation about the mean, the range of 0.95
confidence interval for the estimated population may be calculated thus: 2 . The value 2
may be called the upper range and 2 may be called the lower range.
The upper range is: 2 159.18 2(16.05) 159.18 32.10 191.28 and the lower
range is 2 159.18 2(16.05) 159.18 32.10 127.08 . The estimated population mean of
159.18 now is more meaningful between it has a range between 127.08 and 191.28.
Recall that the sample data set was 195,170,165,165,160 , the sample mean was 171 and
the sample standard deviation is 12.41. To construct a range for 0.95 confidence interval, we simply
write x 2 S for the upper range and x 2 S for the lower range. The value of these two end points
of the interval may be calculated thus: x 2 S 171 2(12.41) 171 24.82 195.82 for the upper
end of the range and x 2 S 171 2(12.41) 171 24.82 146.18 for the lower end of the range of
the sample.
31
International Journal of Research & Methodology in Social Science
Vol. 1, No. 1, p.32 (Jan. – Mar. 2015). Online Publication
x x
Given that t and Z , set the two equations equal to come another thus
S/ n / n
t Z , the simplification of the terms follows:
x x
, multiply both sides by one of the denominator:
S/ n / n
/ n Sx / n x 1 x x 1 the term x on both sides is reduce to 1. The only
n i 1
may varied within a set of xi :[ x1 , x2 ,..., xn ] . The only fixed term is the estimated value or the mean:
x . Therefore, the comparison of S also yield another property about normal distribution:
x ; the sample mean is equal to the population mean within 0.95 confidence interval. A
question arises: what sample size would yield this condition: x ?
32
International Journal of Research & Methodology in Social Science
Vol. 1, No. 1, p.33 (Jan. – Mar. 2015). Online Publication
N
nY (7)
1 N ( 2 )
where N = known population size and is the precision level or error level. Conventionally, the
value of is set at 0.05 where the confidence interval is defined as 0.95.
The example that we are working with a data set of 195,170,165,165,160 ; the population
size was not given. Assume that the population size is 2000 people. From the Yamane equation the
minimum sample size may be determined. The calculation under equation (7) follows:
N 2000 2000
nY
1 N ( ) 1 2000(0.05 ) 1 2000(0.0025)
2 2
2000 2000
1 5 6
333.33
nY 333
Z 2 2
n (8)
E2
x
where Z = standard score where Z ; = population standard deviation; and E 2 = standard
/ n
error where E .
n
x S
Recall the t-equation: t , solve for ; thus: t x . From our prior calculation for
S/ n n
the data set: 195,170,165,165,160 we have:
t 2.13
S
S 12.41 solve for t x :
n
x 171.00
33
International Journal of Research & Methodology in Social Science
Vol. 1, No. 1, p.34 (Jan. – Mar. 2015). Online Publication
x
t
S/ n
S
t x
n
12.41
2.13 171
5
12.41
2.13 171
2.24
2.13 5.55 171
11.82 171
159.18
Now solve for the population standard deviation using the Z-equation;
x
Z
/ n
x
n
Z
171 159.18
5
1.65
11.82
2.24
1.65
7.16(2.24)
16.05
16.05 16.05
Recall that E ; therefore, E 7.17 .
n 5 2.24
We are now ready to determine the minimum sample size using equation (8):
n
E2 7.17 2 51.34 51.34
n 13.66
The minimum sample size is about 14 compared to 333 under the Yamane equation. It is clear
from this example that the Yamane equation is not an efficient means for minimum sample size
calculation.
6.0 CONCLUSION
In this introduction to descriptive and inferential statistics, we provide series of illustrations on how
descriptive and inferential statistics are calculated. In addition, the minimum sample size required
for a non-biased representation of the population by the sample was also explained. We also pointed
out that the use of 400 counts as the sample size under the Yamane method is a misuse and
misunderstanding of how to calculate sample size. The Yamane method is reserved for a finite
population scenario. In non-finite case, the Yamane method is not appropriate.
34
International Journal of Research & Methodology in Social Science
Vol. 1, No. 1, p.35 (Jan. – Mar. 2015). Online Publication
REFERENCES
Brewer, Ken (2002). Combined Survey Sampling Inference: Weighing of Basu's Elephants. Hodder
Arnold. p. 6. ISBN 978-0340692295.
Glaser, B. (1965). “The constant comparative method of qualitative analysis.” Social Problems, 12,
436–445.
Michael J. Evans, Jeffrey S. Rosenthal W. H. Freeman (2004). Probability and Statistics: The Science
of Uncertainty. Freeman and Company. p. 267. ISBN 9780716747420.
Kish, Leslie. 1965. Survey Sampling. New York: John Wiley and Sons, Inc.; p. 17.
Onwuegbuzie, A. J., & Leech, N. L. (2007). “A call for qualitative power analyses.” Quality &
Quantity, 41, 105–121. doi:10.1007/s11135-005-1098-1
Sandelowski, M. (1995). “Sample size in qualitative research.” Research in Nursing & Health, 18, 179–
183.
Sudman, Seymour. 1976. Applied Sampling. New York: Academic Press.
Yamane, Taro (1967). Statistics, An Introductory Analysis, 2nd Ed., New York: Harper and Row.
35