0% found this document useful (0 votes)
15 views58 pages

Sampling Methods and Estimation of Sample Size: Known

1. The document discusses different methods of sampling from a population, including probability and non-probability sampling. 2. Probability sampling methods ensure that every member of the population has a known chance of being selected, allowing researchers to calculate sampling error. These include random sampling, systematic sampling, and stratified sampling. 3. Random sampling involves randomly selecting subjects from the entire population. Systematic sampling selects every nth subject from a list. Stratified sampling divides the population into relevant subgroups and then randomly samples from each subgroup.

Uploaded by

urooj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views58 pages

Sampling Methods and Estimation of Sample Size: Known

1. The document discusses different methods of sampling from a population, including probability and non-probability sampling. 2. Probability sampling methods ensure that every member of the population has a known chance of being selected, allowing researchers to calculate sampling error. These include random sampling, systematic sampling, and stratified sampling. 3. Random sampling involves randomly selecting subjects from the entire population. Systematic sampling selects every nth subject from a list. Stratified sampling divides the population into relevant subgroups and then randomly samples from each subgroup.

Uploaded by

urooj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

' unit 15

Sampling Methods and Estimation of


Sample Size
Contents
15.1 lntroduction
15.2 Sampling
15.3 Classification of Sampling Methods
15.4 Sample Size
15.5 Conclusion

Learning Objectives
It i s expected that after reading Unit 12 you would be able to
+:+ Define what i s sampling
*:* Classify sampling methods
*:* Calculate sample size. .

15.1 lntroduction
Unit 15 deals with the procedure of samplinga that helps you arrive at a
. subset of the universe of your research. It discusses the various methods
I of sampling and tells you how to work out a sample size. You will-again
read about sampling in Block 6. This is a subject you will need to master
- carefully as no matter what type of research you wish to carry out, you
will need. to apply your skill of the craft of sampling.

15.2 Sampling
A sample i s a subset of the population that represents the entire group.
When the population (or universe) i s too large for the researcher to
survey all i t s members because of i t s cost, the number of personnel to
I be employed, or the time constraint, a small carefully chosen sample i s
extracted to represent the whole (see Figure 15.1). The sample, as
drawn in Figure 15.1, i s expected to reflect the characteristics of the
population.
A well selected sample may provide superior results. For example, in a

' research where well-trained interviewers are required, it may be possible


to get a few trained interviewers to collect a sample rather than to get
many trained interviewers to investigate the entire population. The
trained interviewers may gather better quality information than non-
trained or less trained interviewers. By contrast, i f the population i s
sufficiently small, the entire population should be studied. When data
are gathered on each and every member of the population, the study i s
known as a census study. The researcher i s expected to clearly define
the target population.
Quantitative and
Survey Methods

paramen ters
Measures describing population
characteristics I -
Statistics estimates
< The parameter

Figure 15.1
Statistics
Measures describing sample
characteristics

Relationship between Population, Parameter, Sample and Statistics

A population may be defined as an aggregate of individuals possessing a


common trait or traits. There are two important factors: first that a
population is the complete group about which knowledge is sought, and
second each and every individual has some certain specified attribute or
attributes.
Let us now complete Reflection and Action 15.1.

r-------------------------- 1
I Reflectibn and Action 15.1 I
I
Work out the relationship between population, parameter, sample and statistics
I to reflect the characteristics of the population of the unit of your research
I
1 project. I
L-,---------,,-,,----,---,-J

15.3 Classification of Sampling Methods


Sampling methods are classified into Probability or Non-probability. If the
purpose of research i s to draw conclusions or make predictions affecting
the population as a whole (as most research usually is), then one must use
probability sampling. But, i f one is only interested in exploring how a
small group, perhaps even a represent3ive group, is doing for purposes of
illustration or explanation, then one may use non-probability sampling.
Let us first discuss probability sampling.
(A) Probability Sampling
In probability samples, each member of the population has a known non-
zero probability of being selected. The key point behind all probabilistic
sampling approaches is random selection. The advantage of probability
sampling i s that sampling error can be calculated, which is the degree to
which a sample might differ from the population. Probability methods
include random sampling, systematic sampling, and stratified sampling.
We shall discuss each of them. Sampling Methods and
Estimation of Sample Size
a) Random sampling i s the purest form of probability sampling. Each
member of the population has an equal and known chance of being selected.
The prerequisite for a random sample i s that each and every item of the
universe has to be identified. Random selection i s effective in a clearly
defined population that i s relatively small and self-contained. When the
population i s large, it i s often difficult or impossible to identify i t s each
and every member, so the assemblage of available subjects becomes biased.
One obtains a l i s t of all residents or the voters l i s t or telephone directory,
and then selects a sample using a sequence of numbers from a random
numbers table. Random numbers can also be created in numerous computer
softwares: See Figure 15.2 that illustrates the selection of sample using
random number table.

Population Simple Random Sample drawn


(dark colour)
using Random number tabte

Figure 15.2 Population Simple Random Sample drawn


(dark colour) using Random Number Table
Source: Fisher, R. A. and F. Yates 1982. Statistical Tables. Longman: New York

b) Systematic sampling i s also called an "Nth-name selection" technique.


After the required sarr~plesize has been calculated, every Nth record i s
selected from a l i s t of population members. As long as the l i s t does not
contain any hidden order, this sampling method i s as good as the random
sampling method. Its only advantage over the random sampling technique
is simplicity. Systematic sampling i s frequently used to select a specified
number of records from a computer file. In Figure 15.3 you can find
elucidation of the systematic random sampling method. The first number
(2) has been selected by random number, followed by the selection of
every 5th item in the series.
Quantitative and
Survey Methods

Population Systematic Random Sample drawn


(dark colour)
Figure 15.3 Systematic Random Sampling Method

c) Stratified sampling is a commonly used probability method that is


superior to random sampling because it reduces the sampling error. A
stratum is a subset of the population that shares at least one common
characteristic. Examples of strata might be males and females, or
managers and non-managers. The researcher first identifies the relevant
strata and their actual representation in the population. Random sampling
is then used to select a 'sufficient' number of subjects from each stratum.
'Sufficient' refers to a sample size large enough for the researcher to be
reasonably confident that the stratum represents the population.
Stratified sampling is most successful when (i) the within variance of
each stratum is less than the overall variance of the population; (ii)
when the strata in the population are of unequal size or have unequal
incidence; and (iii) when sampling is cheaper in the strata. Figure 15.4
shows stratified random sampling method. Samples from the three strata
have been extracted i n proportion to their numbers.

Population Stratified Random Sample drawn


I

(dark colour)
Figure 15.4 Stratified Random Sampling Method
d) Cluster random sampling i s useful when the population is dispersed Sampling Methods and
Estimation of Sample Size
across a wide geographic region. This method allows one to divide the
population into clusters and then select the clusters at random. Thereafter
one can either study all the members of the selected clusters or again
take random (simple or systematic) samples of these sampled clusters. If
the latter system is followed, it is called multi-stage sampling. This
method, for example, could be effective to study a tribal group or a
community that i s dispersed. The villages could be used as clusters and
can be randomly selected. Figure 15.5 shows that five blocks (2, 7, 10
and 14) out of sixteen have been selected by random number. Each
block contains a series of samples, as illustrated.

Population Cluster Sample drawn


(dark colour)
Figure 15.5 Cluster Random Sampling Method

1 Reflection and Action 15.2


( Following the figures in the text, make figures based on the population pertaining
to your research project that you selected while computing R 8 A 13.1 & 13.2 to
1I show
i) population simple random sample drawn in dark colour, using random number
I table
I ii) systematic random sampling method
I iii) stratified random sampling method
1
L
. iv) cluster random sampling method
~ ~ ~ ~ ~ ~ ~ ~ , ~ - - - - ~ , , ~ ~ , - , ~ ~ ,

(B) Non-probability Sampling


In non-probability sampling, members are selected from the population in
some non-random manner. In this method, the degree to which the
sample differs from the population remains unknown. Non-probability
methods include Convenience sampling, Judgment sampling, Quota
sampling and Snowball sampling. Let us now discuss each of the non-
probability sampling methods.
a) Convenience sampling i s used i n exploratory research where the
investigator i s interested in getting an inexpensive approximation of the
fact. As the name implies, the sample i s selected because it i s convenient.
Also called haphazard or accidental, this method is based on using people
who are a captive audience, just happen to be walking by, or show a
special interest in research. The use of volunteers i s an example of
convenience sampling. This method is often used during preliminary
research efforts to get a gross estimate of the results, without incurring
the cost or time required to select a random sample.
b) Judgment sampling i s a common non-probability method. The
researcher selects the sample based on judgment. This i s usually an
extension of convenience sampling. For example, a researcher may decide
to draw the entire sample from one 'representative' village, even though
the population may be distributed over a number of villages. When using
this method, the researcher 'feels' that the chosen sample i s
representative of the entire population.
c) Purposive sampling, much similar to judgment sampling, is where
the researcher targets a group of people believed to be typical or average,
or a group specially picked for some unique purpose. The researcher
never knows i f the sample i s representative of the population, and this
method is largely limited to exploratory research.
d) Quota sampling is the non-probability equivalent of stratified sampling.
Like stratified sampling, the researcher first identifies the strata and their
proportions in the population. Then convenience or judgment sampling
i s used to select the required number of subjects from each stratum.
The researcher resorts to haphazard or accidental sampling, and makes
no effort to contact people who are difficult to reach. This differs from
stratified sampling, where the strata are filled by random sampling,
e) Snowball sampling i s a special non-probability method used when the
desired sample characteristic i s rare. It may be extremely difficult or cost
prohibitive to locate respondents in these situations. Snowball sampling
relies on referrals from initial subjects to generate additional subjects.
In other words, snowball sampling comprises identification of respondents
who i n turn refer researches to other respondents. This technique provides
a means to access relatively invisible and vulnerable social groups. While
this technique can dramatically lower the search costs, it comes at the
expense of introducing bias because the technique itself reduces the
likelihood that the sample will represent a good cross-section of the
population. For example, an investigator finds a rare genetic trait in a
person, and starts tracing his pedigree to understand the origin,
inheritance and etiology of the disease.
You may have heard that only quantitative researches require sampling.
The fact is that qualitative researches use sampling procedures (see Box
15.1).
Sampllng Methods and
Box 15.1 Use of Sampling in Qualitative Research Estlmation of Sample Size
As Berger (1989) and Sarantkos have pointed out , it is fairly common for qualitative
researches to use sampling procedures i n the following manner.
i) Sampling is relatively small, dealing with typical cases.
ii) Use of flexible samples in size not requiring statistical calculations
iii) Use of purposive sampling dealing with non-probability
iv) Use of sampling to achieve suitability rather than representativeness
v) Sampling occurs while the research is in progress, rather than selecting a
sample before starting it.

We would now focus on the procedure of calculating the sample size.

15.4 Sample Size


A prudent choice of the sample size for a particular survey involves many
considerations, among which are the resources in manpower, cost per
sample units and funds available, the number and type of parameters to
be estimated. Obviously, these specifics will vary from one survey to
another. All the same, a framework can be constructed within which
general and viable decisions with respect t o sample size can be taken.
Sampling theory aids in arriving at good estimates of the sample size.
The standard error here too provides the key.
Apart from the size of the universe the sample size may depend on the
following conditions.
i) The confidence limit set up for estimation;
ii) The heterogeneity of the population; and
iii) Frequency/ proportion of the trait/ attribute to be examined.
The estimation of sample size also differs according to the purpose or
the parameter under investigation. For example, whether sample size is
being estimated for calculating mean, or proportion, or for comparing
means. For illustration, let us consider, in Box 15.2 and Box 15.3 , two
cases, namely, the estimation of the mean of a normally distributed
variable and the estimation of a proportion. In these cases there are
two assumptions, first that sampling i s simple, random and without
replacement, and second, the population sampled is infinitely large.

Box 15.2 Case One


:
It is known that the standard error of mean can be calculated frpm the following
formula.
SEx=6 I vn ..........1
Where SExis the standard error of mean, 6 is the standard deviation, and n is the
sample size. Thus one can calculate sample size (n) using the following equation
derived from equation 1.
n = (6 I SEX) ..........
2
Quantitative and Sample size can be calculated using the following steps.
Survey Methods
Step 1: One requires the standard deviation of the universe, which is
unknown. A rough estimate of this measure, however, i s sufficient for
suggesting sample size.
a) In many instances, the experience with similar problems will be a
good guide for making this estimate of the standard deviation.

b) In other instances, an exploratory sarr~plestudy on a small scale


may be conducted ,in order t o arrive at an estimate of 6.
c) To estimate the standard deviation of the universe, the range of
the values in the universe may be estimated and used as a guide.
It i s known that in normal distribution the range i s about six
times the standard deviation. For practical purposes, an estimate
of somewhere around one-fifth of the estimated range i s often
used.
Suppose the range i s roughly 300; that is, the difference between
the lowest value in the universe and i t s highest value i s 300. One-
fifth of this rough estimate i s 60. 'Therefore, one may take 60 as
a rough approximation of 6.
Step 2: It must be decided how precise one wants the future sampling
estimate to-be. Thus, one may state that the estimate of the true mean
is sufficiently precise i f confidence limits of 12 are attached t o it. Such
an answer might be practicable for this particular problem.
Step 3: In this step the researcher has to decide the confidence limit.
He may wish t o be almost certain or be satisfied with, say, a 95%degree
of confidence, that the specified limits will contain the true mean. The
degree of confidence decided upon makes it possible t o translate the
interval decided upon i n step 2 into standard error. If one is t o be
practically certain that true mean will lie within the interval of * I 2
around the sample mean then the interval of * I 2 becomes 3 SEX. *
'Therefore, SEX= 4. If, on the other hand, one i s willing to settle for a
95% degree of confidence, then & I 2 becomes 2 SEX and SEX= 6.
Using equation 2, the sample size for the above example will be
1) Case 1-
At the level of practical certdinty:
Sample size (n) = (60 I 4) = 15'=225
2) Case 2-
At the level of 95%confidence limit:
Sample size (n) = (60 I 6) = lo2=I00
(In case 1 SEX= 4, whereas in case 2 SEX= 6)
Thus, i n the above example, the sample size should be somewhere
around 225 i f one wishes to be practically certain that true mean will
lie with,in an interval of +12; but the sample need contain only 100
Sampling Methods and
items i f one settles for a 95% degree of confidence that true mean will Estimation of Sample Size
lie within an interval of k12.
Sometimes the acceptable difference between the sample and i t s
true mean i s expressed in percentage (say 3%) rather than absolute
(as for example, +12 in step 2 of the above example). Suppose the
expected mean is around 500 then the acceptable interval would
be k15. But this necessitates an approximate knowledge of the
expected mean.

Box 15.3 Case Two


Sample size when sampling for pro~ortion
I
consider the estimation of the proportion of individuals in a population with
some particular attribute, for example those who own tractors for agriculture.
This proportion, though not precisely known t o the investigator, is generally
known to him to an order of magnitude at least; that is to say, he will often know
that owning tractors is quite rare (say, less than 3 in 1,000 persons), somewhat
infrequent (3 in 100 to 3 in 1,000 persons), fairly common (3 in 10 t o 3 in loo), or
very common (more than 3 i n 10). I f owning tractors is known t o be more
infrequent than 3 in 100, a simple random sampling would invariably be much too
inefficient and the other sampling methods appropriate to the estimation of rare
events should be used. To assume random sampling amounts to assuming that the
investigator's interest centers on only those attributes whose frequencies are
at least 3 in 100. Even within these limits it is clear that i f the population proportion
is t o be known exactly, the entire population must be examined. This is
impracticable and generally unnecessary, for the investigator usually does not
require this degree of exactness. His requirements are related, of course, t o
the use to which the estimate (or estimates) is to be put, and thus may vary from
one investigator to another and with the proportion itself.

It i s known that the standard error of a proportion can be calculated


from the following formula.
SE, = v (PQ / n) ........3

Where, SE,is the standard error of proportion, P is the proportion of an


attribute in a population and Q i s = 1 - P, and n is the sample size. Thus
one can calculate sample size (n) using the following equation derived
I from equation 4.

Sample size can be calculated using the following steps:


Step 1: One requires an estimate of P, from which Q follows (Q = 1 - P),
which is, of course, unknown. A rough estimate of this measure, however,
is sufficient for suggesting sample size.

1) In many instances, experience with similar problems will be a


good guide for making this estimate of the proportion.
Quantitative and
Survey Methods 2) In other instances, an exploratory sample study on a small scale
may be conducted in order to arrive at an estimate of proportion.
If, however, neither of these two approaches i s possible, then one can
conservatively assume that P = 50% which leads to a larger sample size
than any other value of P. This i s because in a 50% - 50% break-up, the
numerator (PQ) in the formula in equation 5 (n = PQ I SE, 2), i s the
largest. However, for the following example, let us consider that P = 30%
or 0.3.
Step 2: It must be decided upon how precise one wants the sampling
estimate to be. The researcher may consider an interval of, say, k 6%
around a sample proportion as satisfactory in this situation.
Step 3: In this step the researcher has to decide the confidence limit.
He may wish to be almost certain or be satisfied with, say, a 95%degree
of confidence, that the specified limits will contain the true mean. In
the former case, k 6%will be equal to k 3 SE, and consequently SE, =k 2%,
* *
whereas in the latter case, 6% will be equal to 2 SE, and SE, =* 3%.
Using equation 4, the sample size for the above example will be
1) Case1-
At the level of practical certainty:
Sample size (n) = (0.3'0.7) I (0.02)2 = 0.21 I 0.0004 =525
2) Case2-
At the level of 95% confidence limit:
Sample size (n) = (0.3'0.7) I (0.03)2 = 0.21 I 0.0009 =233
(P = 30% or 0.3; in case 1 SE, = .02, whereas in case 2 SE, = .03)
The use of a formula to obtain an estimate of sample size does not give
us more than a rough approximation. In practice it is advisable to take
the sample-size estimate as a bare minimum, to be increased for safety.
Let us now complete the Reflection and Action 15.3.

r-------------------------- 1
I Reflection and Action 15.3 I
Suppose in your research project you wish to estimate sample size for calculating
I I
mean and the ass,umption is that sampling is simple and the population sampled is
I infinitely Large. Further, you are in the stage of taking the three steps as elaborated
I
I in Case One given in the text, the exercise for you is to work out in detail each I
I step and write i t down in the fashion given just after Box 15.2. I
L-------,----,-------------J

15.5 Conclusion
Unit 15 discussed the important subject of sampling and provided you
with -relevant information on different methods of sampling. Further, it
brought to you the skills of calculating the sample size.
You may like t o keep i n mind what Mitchell (1984: 239) said about
sampling theory in statistics that it "devotes itself to providing numerical
Sampling Methods and
estimates of the likelihood that the population values be within some Ertimation of sample
defined range of that established from the sample - provided that the
sample has been chosen in such a way as t o meet the mathematical
conditions to justify the computation of the probabilities concerned."
Further he clarified about another~typeof inference that is derived
while using quantitative data to support theoretical interpretation and
said, 'The sophistication and elaboration for choosing a 'representative'
sample in this restricted sense has overshadowed the other kind of
inference involved when analytical statements are made from associations
uncovered in a statistical sample. 'This is the inference that the theoretical
relationship among conceptually defined elements in the sample will also
apply in the parent populatik. The basis of an inference of this sort is
the cogency of the theoretical argument linking the elements i n an
intelligible way rather than the statistical representativeness of the
.
sample."

Further ~ e a d i n ~
Burgess, R.G. (ed) 1982. Field Research: A Sourcebook and Field Manual.
(Contemporary Social Research 4). George Allen and Unwin: London
(Read page 76 onward fro discussions of random and non-random
sampling)
Denizen, N. K. (ed.) 1970. Sociological Methods: A Sourcebook.
Butterworths: London (Read page 81 onward for useful information on
sampling techniques).
Unit 16
Measures of Central Tendency
Contents
16.1 Introduction
16.2 Mean
16.3 Median
16.4 Mode
16.5 Relationship between Mean, Mode and Median
16.6 Choosing a Measure of Central Tendency
16 . 7 Conclusion
I \

Learning Objectives
It is expected that after reading Unit 16 you would be able to
*: Understand the procedure of arriving at measures of central
tendency of the data collected
Work out the ways of finding out mean, mode and median
measures of central tendency
Q Decide which of the three measures i s more appropriate in the
case of your data.

16.1 lntroduction
After dealing with the skills of sampling techniques for studying large
complex social groups, we would now discuss the matter of measuring
central tendenc and i t s application.
Unit 16 deals with the basic measures of central tendency and their
application for those of you who may lack a strong background in
mathematics. In doing so, complex mathematical derivations of formulae
have been omitted. Besides a minimal number of essential 'shorthand'
mathematical symbols, and familiar examples drawn from social science
data are presented in a non-mathematical form.

16.2 Mean
~ean@ i s the most common and widely used measure of central tendency.
Each observation in a population may be referred to as X, (read "X sub
,,
i") value. Thus, one observation might be denoted as X another as X ,,
,,
a third as X and so on. The subscript i might be any integer value up
through N, the total number of $values in the population. The mean of
the population i s denoted by the Greek letter p (lower case mu).
Calculating the mean from ungrouped data
Mean (M) is the most familiar and useful measure used to describe the
central tendency average of a distribution of scores for any group of
9489
individuals, objects or events. It is computed by dividing the sum of the
Measures of
scores by the total number of scores.
Central Tendency

M =xxi IN ........1
Where, M is the mean (sample), Xi are the scores, N is the total number
of scores and C is 'the sum of'. See Box 16.1 and Box 16.2 for examples
1 and 2.

Example 1: The Number of Cattle Owned by Members of a Community is


Recorded Below.
12, 11, 13, 20, 16, 18, 19, 17, 22 and 23
ZX,=12+11+13+20+16+18+19+17+22+23=170
N = 10
M=ZX,/N; M=170/10=17
The mean i s the balance point in a distribution such that i f you subtract each
value in the distribution from the mean and add all these deviation scores, the
result will be zero.

Calculating mean from grouped data


Calculation of mean from grouped data is slightly different from calculation
from ungrouped data.

where, M is the mean, Xi are the midpoint of class intervals, Fi are the
number of cases in various intervals, CFi is the total number of scores or
sum of frequencies of various intervals.

Box 16.2
Example 2: Following is the frequency (8, 9, 12, 9, 7, and 5) of households i n a
community owning numbers of chickens, arranged i n six groups (1-3, 4-6, 7-9,
10-12, 13-16 and 16-18).

Number of
Chickens

4-6 45

ZF, *Xi = 376 ZF, = 50


M = ZF, *X, 1 ZF, = 376 1 50 = 7.52
Quantf ative and A short method of calculating mean from grouped data
Survey Methods
There i s a shorter way of calculating mean from grouped data, which
saves time and labour in computation, particularly when one has to deal
with a large number of cases. It involves the assumption of mean and
making a guess at identifying the interval in which the mean probably
falls (generally among the central groups of intervals). A different guess
of the interval alters calculations, but not the mean.
Mean (M) = AM + ((ZFi *Di IZFi))'i
Also,
Di = (AM - Xi) 1 i
Where, M i s the mean, AM = Assumed mean, XI are the midpoint of class
intervals, Fi are the number of cases in various intervals, Z i s the symbol
of sum total, Dl are the deviations of the midpoints of the various classes
from the midpoints of the class having the assumed mean divided by the
size of the class interval (equation 4) and i is the size of the class
intervals. See Box 16.3 for example 3.

Box 16.3 Example 3: Marital Distance (the distance between the villages
of the spouse)
The marital distance was investigated in a community. Following was the frequency
(88, 93, 72, 97, 79, and 54) when the data were arranged in six groups according
to marital distance (25 - 30, 30 - 35, 35 - 40,40 - 45,45 - 50, 50 -55). Let us find the
mean marital distance.

Mean (M) = AM + ((ZFi 'Di IZFi))'i


= 42.5 + (3351 483) '5 = 42.5 + 3.468 = 45.968
*
After the three examples for calculating mean for ungrouped and grouped
data, we would now discuss the technique of finding the Median.
Measures of
16.3 Median Central Tendency
~ e d i a n "is the score that divides the distribution into halves; half of the
scores are above the mediap and the other half are below it when the
data are arranged in a numerical order. Median is also referred to as
the score at the 50th percentile in the distribution.
Calculating median from ungrouped data
Q Arrange the series in numerical order (ascending or descending).
1 Q Find the median location of N numbers by the formula (N + 1) 1
2. When N is an odd number, for example 7 then the value of the
4th item ((7+1)/2 = 4) is the median. For example in the following
I ordered distribution the value of 4th item, i.e. 9 is the median.
I
I
Q Kfhereas, when N is an even number, say 12 then the median is
I
half-way between the 6th and 7th items ((12+1)/2 = 6.5).
1 See Box 16.4 for examples 4 and .5.

Box 16.4 Example 4: Finding the Median


When N is an odd number: Find the median in the distribution of numbers: 1, 13,8,
1
3, 4, 11, and 7.
The median location is (N + 1) I 2 or (7 + 1) I 2 = 4.
The ordered distribution is: 1, 3, 4, 2,8, 11 and 13.
The value of 4'h item in the distribution is 7 and thus median is 7.
Example 5: When N is an even number:
.Find the median in the distribution of numbers: 1, 8, 3, 13, 11, and 7.
The median location is (6 + 1) I 2 = 3.5.
The ordered distribution is 1, 3, a 1 1 and 13.
- The halfway value between the 3rdand 4* item is 7.5 ((7+8) I 2), and thus median
is 7.5.

Calculating median from grouped data


Finding the median score in the frequency distribution below involves
five steps.
I Step 1: Divide the total number (N or ZFi) by two.

I Step 2: Start at the low end of the frequency distribution and sum the scores
in each interval until the i n t e k l containing the median is reached (C. F.).

I Step 3: Subtract the sum obtained in step two above from the number
necessary (calculated at step 1) to reach the median (Nl2 - C. F.).

I
Step 4: Now calculate the proportion of the median interval that must be
added to its lower limit in order to reach the median score. This is done by .
dividing the number obtained in step 3 above by the number of scores
I (f) in the median interval and then multiplying by the size of the class
interval (i), i.e. [(N 12 - C.F.) If] "i.
Quantltative and Step 5: Finally, add the number obtained in step 4 above to the exact
Survey Methods
lower l,imit of the median interval.
Median = L + [(N 12 - C.F.) I f] "i
Where, L = the exact lower limit of the median interval, N = the total
number of scores; C.F. = the sum of the scores in the intervals below the
median interval, f = the number of scores in the median interval; i = the
size of the class interval.
Graphical representation of calculating the median from grouped data
Median
@
I I I IPIlI I I
Class Intervals 10 20 30 40 5 W O 70 80 90 100

Cumulative Frequency 6 4 1 1 + 1 8 + 2 H 2 ~ 1 3 5 ~ 4 ~ 4 4 + + 5 ~

See Box 16.5 for example 6 for finding the media for grouped data.

Box 16.5 Example 6: Find the Median of the following distribution 1


Class 18-21 21-24 24-27 27-30 30-33 33-36 36-39 39-42 42-45 45-48 48-51
brval
Frequency

Class Interval Frequency Cumulative Frequency


18-21 1 1

I
Median = L + [(N 12 - C.F.) I f] *i

N or ?F, I 2 =50/2 -25


Lower 'limit of the median class (L) = 33
Cumulative frequency of the class preceding the median class (C.F.) ='"19
Cumulative frequency of the class preceding the median class (C.F.) = 19 Measures of
Central Tendency
Frequency of the median class (f) = 8
Size of the class interval = 3
Median = 33 + [(25-19) 1 8)] *3 = 33 + 2.25 = 35.25
Let us now complete Reflection and Action 16.1 for checking i f the
calculation methods have now become clearer and easier to perform.
After the Reflection and Action 16.1, you would learn about calculating
mode from ungrouped and grouped data.

,
r-----.---------------------
Reflection and Action 16.1
Following the examples given in the text for calculating the mean and median for
1
I'
I ungrouped and grouped data and the short method of calculating mean of grouped
I
I data, provide your own examples of each of the five calculations in the manner I
I similar to examples in the text. This exercise would provide you an opportunity I
1 of practicing such calculations. These calculation exercises would come in handy I
1 while you would carry out your own mini research project. I
L ~ , - , ~ - - ~ ~ ~ ~ ~ ~ - , ~ - ~ ~ , , ~ , ~ , ~ J

16.4 Mode
ode@ of a distribution i s simply defined as the most frequent or common
score i n the distribution. Mode i s the point (or value) of X that
corresponds t o the highest point on the distribution. If the highest
frequency i s shared by more than one value, the distribution i s said to be
multimodal. It is not uncommon to see distributions that are bimodal
reflecting peaks in scoring at two different points in the distribution.

Calculating mode from ungrouped data


'The most frequent data in the series is the mode. I t can be determined
by viewing the series (if the series is small) or looking at the frequency
distribution (if the series is large). See Box 16.6 for example 7.

Box 16.6 Example 7: Find the Mode of the following Distribution.


Serial 1 2 3 4 5 6 7 8 9 1 0
number
of family
Number 1 2 3 4 3 3 2 1 2 3
of
Children
.
In the above example 3 occurs the maximum number of times (4 times),
and hence 3 is the mode of the distribution. fl

Calculating mode from grouped data


Mode of the grouped data can be calculated using the following steps:
Step 1: Identify the modal class (class with maximum frequency) by
inspectlon or analysls.
Quantitative and Step 2: Apply the following formula
Survey Methods
Mode = L + [(f, - f,) / (f, - f,) + (f, - f,)] * i
0r
Mode = L + [(f, - f,) 1 (2fm- f,- f,)] * i
Where, L = the exact lower limit of the modal interval, f, = frequency of
the modal class, f, = frequency of the class preceding modal class, f, =
frequency of the class succeeding modal class, i = the size of the class
interval.
You can find the graphical representation of mode in grouped data in
Figure 16.1.

10 20 30 40 ) 50 60 70
Mode
Figure 16.1 Graphical Representation of Mode in Grouped Data

The sample mode i s the best estimate of population mode. When one
samples a symmetrical unimodal population, mode i s an unbiased and
consistent estimate of mean and median, but it i s relatively inefficient
and should not be so used. As a measure of central tendency, mode is
affected by skewness less than is mean or median, but it i s affected by
sampling more than these other two measures. Mode, but neither
median nor mean, may be used for data on nominal, as well as the
ordinal, interval, and ratio scales-of measurement. Mode i s not used
often in social or biological researches, although it i s often interesting to
report the number of modes detected in a population, i f there are more
than one. See Box 16.7 for example 8.

1) (1
/I Box 16.7 Example 8: Find the Modal lncome on the Basis of the Following
Data. 1
9

lncome (in Thousands) 5 - 10 10 - 16 16 - 20 20 - 25 25 - 30 30 -35


No. of Households 8 16 29 22 14 12

Income (in Thousands) No. of Households.

Modal Class 15 - 20
- 29 fm
Mode lies i n the (16 - 20) having the maximum frequency (29) Measures! of
Central TenJdency
Lower limit of the modal class = 16
Frequency of the modal class (f,) = 29
Frequency of the class preceding modal class (f,) = 16
Frequency of the class succeeding modal class (f,) = 22
Size of the class interval = 5
Mode = L + [(f, - f,) 1 (2f, - f, - f,)] * i
Mode = 16 + [(29 - 16) I(2*29 - 16 - 22)] *5 = 16 + (14 121) * 5 = 16 +
3.33 = 18.33
The modal income is 18.33 thous'ands.
After learnign about mean, median and mode, we will discuss in Section
16.5 the relationship among the three measures of central tendency.
But before going on t o Section 16.5, let us complete Reflection and
Action 16.2.

r-------------------------- 1
I Reflection and Action 16.2 1
Make a graphical representation of mode i n grouped data of your choice along
the lines of Figure 16.1. You may then use similar type of graphic representation
I
I of grouped data in your own mini research project. I
L-------,,,,,-,------,,,,,,J

16.5 Relationship between mean, mode and


median
Mean, mode and median (the three measures of central tendency) are
related t o each other and can be calculated using the following equation.
Mode = 3 * Median - 2 * Mean
The values of mean, mode and median are the same when the frequency
i s normally distributed, but their values differ when the frequency is
positively or negatively skewed.

Fig. 16.2: Relationship of Mean,


Mcde and Median in various Types of Frequency Distributions 4.554
i
Quantitative and Fig. 16.2 shows the relationship of mean, mode and median in various
Survey Methods
types of frequency distributions: (A) Normal distribution (B) Bimodal
distribution (C) Positively skewed distribution (D) Negatively skewed
distribution. Values of the variables are along x axis and the frequencies
are along y axis.
After learning about the relationship among the three measure of central
tendency, let us find out how to decide which of the three to choose for
one's research.
16.6 Choosing a measure of central tendency
Sometimes the researcher has to decide which of the three measures of
central tendency to use. The following advice may be of help.
Mean is doubtless the most commonly used measure of central tendency.
It is the only one of the three measures which uses all the information
available in a set of data, that is to say, it reflects the value of each score
in a distribution. It has the decided advantage of being capable of combining
with the means of other groups measured on the same variable. For
example, from the average unemployment levels in various states of
India one can compute the overall mean unemployment rate of India.
Since neither the median nor the mode is based on arithmetic, this
useful application i s not possible. The precisely defined mathematical
value of the mean allows the other advanced statistical techniques t o be
based on it too. .

There are occasions, however, when taking into account the value of
every score in a distribution can give a distorted picture of the data. For
example, marriage distance (the distance between the places of residence
of the two partners) i n five cases is 40, 60, 60, 80 and 810. Without the
very atypical score of 810, the mean score of the group is 60 and the
median, likewise, is 60. The effect of introducing the score of 810 is to
pull the mean in the direction of that extreme value. The mean now
becomes 210, a value that is unrepresentative of the series. The median
remains 60, providing a more realistic description of the distribution
than the mean.
With these observations in mind:

Use the mean


1) When the scores in a distribution are more or less symmetrically
grouped about a central point.
ii) When the research problem requires a measure of central tendency
that will also form the basis of other statistics (such as measures
of variability or measures of association).
iii) When the research problem requires the combination of mean
with the means of other groups measured on the same variable.
iv) To measure the central tendency i n a sample of observations
when one needs to estimate the value of a corresponding mean
of the population from which the sample i s taken.
Measurej
Central Te dency
.Of

v) When the interval level or ratio level data providing that the
distribution of scores approximates a normal curve.

Use the median


i) When the research problem calls for knowledge of the exact
midpoint of a distribution.
ii) When extreme scores are there in the series, as they distort the
mean, but not the median. Particularly, when dealing with 'oddly-
shaped' distributions, for example, those i n which a high proportion
of extremely high scores occur as well as a low proportion of
extremely low ones.

Use the mode


i When all that i s required i s a quick and appropriate way of
determining central tendency.
ii) When in referring to what is 'average', the word is used in the
sense of the 'typical' or the 'most usual'. For example, i n talking
about the average take-home pay of the coffee plantation worker,
it is the modal wage that is being alluded to rather than an exact
arithmetic average.

r-------------------------- 1
Reflection and Action 16.3
I Provide examples of data that require mean, median and mode type of calculations
I
for reflecting the central tendency of the data. I
J
-,-,,,,,L
,,-,

16.7 Conclusion
Succinctly, mode would be the appropriate statistic to use as a measure
of the 'most fashionable' or 'most popular' when data are collected
using a nominal scale. Median would generally be associated with the
ordinal level data. Mean will be used with interval level or ratio level data
providing that the distribution of scores approximates a normal curve.
You can take mean to be a mathematical measure and median mode to
be the positional measures. You can always cluster your observations
around a central value. A central value manifests both the distribution
and the comparison of various distributions. It is always useful for a
researcher to provide measures that indicate the average feature of a
frequency distribution. Unit 16 has discussed the three measure of
central tendency and provided skills of basic statistical tools for application
in your research.
It would have become apparent to you that the three measures of the
central tendency, namely, i ) average of all the values in the distribution I

or mean, ii) mid-point of the distribution or median and.iii) highest 957:*l


Quantitative and
Survey Methods
density in the distribution or mode, are not to applied in a mechanical
way. In the light of the objective of your study you would need to
determine when you are to use which measure. You have learnt in Unit
16 that a graphic representation of distributions shows either a
symmetrical or a skewed pattern. In symmetrical type, you will find that
the three values coincide. This provides you the option of using the
mean. In the case of bi-modal or multi-modal representation, you would
do better to use the mode. In skewed distribution, ifthe tail i s on the
right side, it indicates the positive skewing of distribution. If the tail is
on the left side, it shows the negative skewing of distribution. For both
the negative and the positive types of skewing of distribution, you would
do better by using the median measure of central tendency. You may
want to work out with the help of Unit 16 the type of measure of central
tendency you will use in your mini research project.

Further ~ e a d i n g
Black, Thomas R. 1999. Doing Quantitative Research in the Social
Science. An Integrated Approach to Research Design, Measurement and
Statistics
Nachmias, David and Chava Nachmias 1981. Research Methods in Social
Sciences. St. Martin Press: New York.
Unit 17
Measures of Dispersion and Variability
Contents
17.1 lntroduction -
17.2 The Range
17.3 The Variance
17.4 The Standard Deviation
17.5 Coefficient of Variation
17.6 Conclusion

Learning Objectives
It i s expected that after reading Unit 17 you would be able to
Obtain a measure of dispersion of data
Q Explain the meaning of the term 'range' and work out how to
measure the range of one's data
Q Discuss the element of variance in one's data and find out the
standard variation in it
Q Work out the coefficient of variation in the data.

17.1 Introduction
In addition to a measure of central tendency, it i s generally desirable to
have a measure of dispersion of data. A measure of dispersion (or a
measure of variabilitya, as it i s sometimes called) i s an indication of the
clustering of measurements around the center of the distribution, or,
conversely, it i s an indication of how variable the measurements are.
Sanders (1955) held,that you need to measure dispersion to evaluate the
extent to which the average value depicts the data. Another reason for
measuring dispersion i s to find out the spread i n order to improve or
corltrol the existing variations.

17.2 The Ranqe .


The difference between the highest and the Lowest measurements in a
group of data i s termed rangea. If sample measurements are arranged
in an increasing order of magnitude, as i f the median were about to be
determined, then
Sample range = X, - X, ..........1
Where, X, and Xn are the lowest and the highest value of the series
respectively.

See Box 17.1 for Example 1.


Quantitative and
Survey Methods Box 17.1 Example 1
The number of cattle owned by members of a community is recorded as: 12, 11,
13, 20, 15, 18, 19, 17, 22 and 23. Calculate the range.
X,= 11; X, = 23
Sample range = 24 - 11 = 12

The range i s a relatively crude measure of dispersion, inasmuch as it


does not take into account any measurement except the highest and the
lowest. Furthermore since it is unlikely that a sample will contain both
the highest and the lowest values in the population, the sample range
usually underestimates the population range; therefore, it is a biased
and inefficient estimator. Nonetheless, it is useful in some circumstances
to present the sample range as an estimate (although a poor one) of the
population range. Whenever the range is specified in reporting data, it is
usually a good practice to report another measure of dispersion as well.
The Mean Deviation
It is clear that no information is provided by the range about the distribution
of the measurements in the middle. Since the mean i s so useful a measure
of central tendency, one might express dispersion in terms of deviations
from the mean.
The sum of all deviations from the mean ((X(X - M)) will always be zero,
therefore such a summation would be useless as a measure of dispersion.
On the other hand, the sum of the absolute values of the deviation from
the mean expresses dispersion about the mean. Dividing this sum by the
total number yields a measure that is known as mean deviation, or mean
absolute deviation of the sample, is obtained.
Sample mean deviation = (C1 X , - M 1 ) 1 n ...........2
Where, M is the mean (sample), Xi are the scores, n is the total
number of scores and ?is 'the sum of' and the vertical lines indicate that
the values are absolute (irrespective of sign). See Box 17.2 for example 2.

Box 17.2 Example 2


The number of cattle owned by members of a community i s recorded as: 12, 11, 13,
20, 15, 18, 19, 17, 22 and 23. Calculate the mean deviation.
ZX,=12+11+13+20+15+18+19+17+22+23=170
N = 10
M=ZX,/N; M = 1 7 0 / 1 0 = 1 7
@ ( X i - M 1 ) = (12 - 17) + (11 - 17) + (13 - 17) + (20- 17) + (15-17) + (18- 17) + (19-
1 7 ) + ( 1 7 - 1 7 ) + { 2 2 - 1 7 ) + ( 2 3 - 1 7 ) = 5 + 6 + 4 + 3 + 2 + 1+ 2 + 0 + 5 + 6 = 3 4
Sample mean deviation'= 34 / 10 = 3.4

It i s possible that the two samples may have the same range, but not the
mean deviation. Mean deviation can also be defined by using the sum of
.:.60 .:. the absolute deviations from the median rather than from the mean.
Measures of Dlsperslon
17.3 The Variance and Variability
Another method of eliminating the signs of deviations from the mean is
to square the deviations. The sum of the square of deviation from the
mean is called the sum of squares, abbreviated SS, and is defined as
follows:
Sample 55 = C (Xi - M) ..........3
Where, M i s the mean (sample), Xi are the scores, and Cis 'the sum of'.
From the sample 55, population SS can be estimated.
Population 55 = C (Xi - p) ..........4
Where M is the mean (sample), Xi are the scores, and ?is 'the sum of'.
The mean sum of square is called variance (or mean square, the latter
being short for mean squared deviation), and for a population i s denoted
by 6 ("sigma squared", using the lowercase Greek letter).
Calculating variance from ungrouped data
Population Variance = 6 = C(Xi - p) / N ..........5
The best estimate of the population variance, 6 2, is the sample variance,
s2:
Sample Variance = s2= C(Xi - M) / (n -1) ...........6
Where M is the mean (sample), Xi are the scores, n is the total number
of scores (sample) and C i s 'the sum of'.
The replacement of p by M and N by n in the above equation results in a
quantity which is a biased estimate of 6 2. Dividing the sample's sum of
squares by n-1 (called the degree of freedom, abbreviated DF) rather than
by n, yields an unbiased estimate and the above equation should be used
to calculate the sample variance. If all observations are equal, then there
is no variability and s2= 0; and s2becomes increasingly large as the amount
of variability, or dispersion, increases. Since s2 is a mean sum of squares,
it can never be a negative quantity.
The variance expresses the same type of information as does the mean
deviation, but it has certain important properties relative to probability
and hypothesis testing that makes it distinctly superior. Thus, the mean
deviation i s very seldom encountered in social or bio-statistical analysis.
The variance has square units. If measurements are in grams, their variance
will be in grams squared, or i f the measurements are in cubic centimeters,
their variance will be 'in terms of cubic centimeters squared, even though
such squared units have no physical interpretatior?.
The sample variance@can be calculated using the following formula
Sample variance = s2 = ((C Xi 2, - (C X i) / n)) / (n - 1) .........7
The above 'formula is often called the machine formula, because of its
computational advantages. There are, in fact, two major advantages in
Quantitative and calculating SS by Equation 7 rather than by Equation 6. First, here fewer
Survey Methods
computational steps are involved, a fact that decreases the chance of
error. On a good desk calculator, the summed quantities, CX, and C X
can both be obtained with only one pass through the data, whereas
Equation 6 requires one pass through the data t o calculate M, and at
least one more pass to calculate and sum the squares of the deviations,
Xi - M. Second, there may be a good deal of rounding error in calculating
each Xi - M, a situation which leads to decreased accuracy in computation,
but which is avoided by the use of Equation 7. See Box 17.3 for example 3.

/I Box 17.3 Example 3 I/


The number 01cattle owned by members of a community is recorded as: 12, 11, 13,
20, 15, 18, 19, 17, 22 and 23. Calculate the sample variance.
ZX,=12+11 + 1 3 + 2 0 + 1 5 + 1 8 + 1 9 + 1 7 + 2 2 + 2 3 = 1 7 0
n = 10
M =EX, 1 n; M = 170 /10=17

Sample variance = s2 = C (Xi - M) / (n -1) = 156 / 9 = 17.33


Alternate formula (often called machine formula)
Sample variance = s2 = ((Z X 2, - (CX i) / n)) / (n - 1)

Sample variance = sZ = (3046 - ((170)2)/ 10) / 9 = 156 / 9 = 17.33


Calculating the variance from grouped data
The sample variance in the grouped data can be calculated using the
following formula.
, Sample Variance = s2= C f (X, - M) / (n -1) ' ...........8
Where, M is the mean (sample), f is the frequency of observations with
magnitude Xi., n is the total number of scores (sample) and Cis 'the sum
of'.
The manual calculation becomes complex, i f the mean value i s having
several places after decimal. A commonly used method is from assumed
mean. The formula is listed below.
Sample Variance = s2= ((C f * d i2)/ n - (C f * d / n) 2] *i ..........9
,
Where, i is the size of the class interval, f is the' frequency of Measures Of Dispersion
and Variability
observations with magnitude Xi., n is the total number of scores (sample)
and Xis 'the sum of'. See Box 17.4 for example 4. 1

,
Sample Variance = {(X f * d ,2)/ n - (X f , * d , / n) 2} *1
Sample Variance = {(417/ 110) - (- 27/ 110) 3 * 10 = (3.79 - .06) / 10 = 37.3
The variance in the grouped data can also be calculated using the following
equation (often called machine formula).
Sample variance (s2) = ((C f ,* X,l) - (xf ,* XI) / n)) / (n - 1) ........,.I0
,
Where f is the frequency of observations with magnitude X ,,
But with a desk calculator it is often faster to use Equation 7 for each
individual observation, disregarding the class groupings. See Box 17.5
for example 5.
Box 17.5 Example 5
An investigation in a community on the bride price yielded the following data.
Find the variance In bride price.
Bride Price Frequency Mid-Point of f ,'X~ Xi2
(in Thousand (F,) the Interval (X,)
PSI
10 - 20
20 30
30 - 40
, 40 - 50
50 - 60
Quantitative and Cfi*Xi2=82650 Cf:X,= 1 8 8 0 ( C f , * X , ) 2 = (1880)2=3534400
Survey Methods
n = 50
Sample variance (s2) = ((Cf ,* Xi 2, - (Cf ,* X ,) / n)) / (n - 1)
Sample variance (s2) = (82650 - (3534400 1 50) I49 = (82650 - 70688) I
49 = 11962 / 49 = 244.12

r-------------------------- 1
Reflection and Action 17.1
I Following the examples in the text, provide your own examples for calculating I
I variance from ungrouped and grouped data. I
L~~-~~~~~,,-~~,,~,~,~,~,,,-_I

17.4 The Standard Deviation


The standard deviation" is the positive square root of the variance;
therefore, it has the same units as the original measurements. It can be
calculated using the following formula.
Standard deviation (s) = v (Sample Variance)
In Example 5, you found the sample variance to be = 244.12, and therefore
you can work out the standard deviation to be (s) = v 244.12 = 15.62
Thus various examples given above for the calculation of variance explain
the procedure of calculating standard deviation.

17.5 Coefficient of Variation


Ratio scales are useful in social science research when an investigator is
interested in the variability of a sample on one characteristic as compared
to another.
The coefficient of variation i s the percentage ratio of standard deviation
to mean and it is calculated using the following formula.
Coefficient of variation = standard deviation "100 / Mean
It is a useful measure of dispersion, when comparison of variability is
being made between the variables of unequal magnitude and1 or have
different units of measurements, for example, height and weight.
In example 4, you would find that
Mean (M) =AM + (C f * d i,/ n)* i = 55 + (-27 I 110) '10 = 55 - 2.45 =
52.55 and
Standard deviation (s) = v (Sample Variance) = v37.3 = 6.107
Coefficient of variation = s *I00 / M = 6.107 / 52.55 = 11.62
re------------------------- 1
Reflection and Action 17.2
I Work out standard deviation and coefficient of variation of the examples you
I
I selected i n Reflection and action 17.1.
I
L~~~-~,~-,,,-~,,,,~~~------J
Measures of Dlsperslon
17.6 Conclusion and Varlablllty
After working out in Unit 16 how to measure the central tendency in
one's data, in Unit 17 you acquired the skill of measuring dispersion of
data, which indicates the clustering of measurements around the center
of the distribution, or, you may say that it i s an indication of how
variable the measurements are.
You may agree with Sanders (1955: 90-91) who said that the range i s an
I
easy measure to work out and understand because it requires only one
subtraction and it places stress on the extreme values. The mean absolute
deviation, on the other hand, places equal weight to the deviation in
every observation and it i s equally easy to work out and understand.
The squaring of deviations in calculating standard deviation emphasies
the extreme value. The standard deviation is a more common measure
of dispersion. Th? value of every observation in a series affects the
value of this measure. A change in the value of any observation will
generate a change i n the standard deviation value. Relatively few extreme
values can distort i t s value. The standard deviation i s not possible to
compute from an open ended distribution. Finally, the co-efficient of
deviation is similar to the range as it i s based on only two values, which
identify the range of the middle fifty percent of the va1ues.k i s mostly I

used in the sets of skewed data and it is possible to compute it in an


open-ended distribution.
I
Further Reading @
I
I Sanders, Donald 1955, Statistics. McGraw-Hill: New York
Unit 18
Statistical Inference: Tests of
Hypothesis
Contents
18.1 lntroduktion
18.2 Statistical lnference
18.3 Cases
18.4 Tests of Significance
18.5 Conclusion

Learning Objectives
It is' expected that after reading Unit 18 you would be able to
O Draw statistical inferences on the basis of the concept of probability
9 Use the tool of statistical inference to test hypotheses
Apply the tool of statistical inference for estimating the unknown
parameter of the population under research.

18.1 Introduction
Unit 18 deals with statistical inference, which uses the concepts of
probabilitya to explain the element of uncertainty in decision-maklng.
You would find that though it occupies a lower status among statistical
tests, you would be able to use chi-square test in a wide variety of
researches. If you have a relatively smaller sample, it would be better to
use student's test that i s a test. You would learn In Unit 18
in detail about both the chi-square and student's tests. For hypothesis
testinga, Unit 18 is p i n g to prove to be most helpful in the mini research
project that you have tb complete as a part of your assignment of MSO
002.

18.2 Statistical Inference


Statistical 'Inference uses the concept of probability to deal with uncertainty
in decision-maklng. It refers to the process of selecting and using a sample
statistics to draw inferences about a population parameter, based on a
sample drawn from the population. Statistical inference takes care of the
two classes of problems.
A. Hypothesis testing: It tests some hypotheses about .the parent
population based on the sample drawn from the population.
B. Estimation: It uses the 'statistics' obtained from the sample as an
estimate of the unknown 'parameter' of the population based on
the sample drawn from the population. ~. .
. ,' -4, : .
Statistical Inference:
A. Hypothesis testing Tests of Hypothesis
It begins with an assumption called a hypothesis that one makes about a
population parameter.
Steps in testing a hypothesis
i) Formulate a hypothesis
ii) Decide an appropriate significance level
I
iii) Select a test criterion
iv) Carry out calculations
v) Make Decisions
Let us discuss in brief each of the five steps.
i ) Formulate a hypothesis: First of all a hypothesis is set up about a
population parameter. Thereafter, sample data i s collected, sample
statistics calculated and the information is used to assess how far the
hypothesised parameter i s correct. Examining the difference between
the hypothesised value and the actual value of the sample's mean tests
the validity of an assumption.
Conventionally, rather than a single hypothesis two are constructed.
These hypotheses are constructed in such a way that if one hypothesis i s
accepted the other is-rejected. The two hypotheses are called:
a. Null hypothesis (designated as H, )
b. Alternative hypothesis (designated as HA)
In the simplest form, a null hypothesis states that there i s no true
difference between the sample statistics and the population parameter.
It asserts that the observed difference is accidental and / or unimportant
arising out of the fluctuations in sampling.
A researcher, for instance, who wishes to test whether the annual per
capita income in a community i s higher than Rs. 10,000/. might formulate
the null and alternate hypotheses as under.
Null hypothesis (H,): p = 10,000
Alternative hypothesis (HA): 10,000
In another Instance, a researcher might wish to test the mean difference
between the annual per capita incomes of two groups. In this case, she
might formulate the null and alternate hypotheses as under.
Null hypothesis (H,): pl- p2 = 0
Alternative hypothesis (HA): I.I,. u, Z 0 .
ii) Decide an appropriate significance level: The next step In testing
hypotheses i s to set up a suitable significance level to test the vaUdtty of
H, as against HA,The confidence with which a null hypothesis is adopted
or rejected depends on the adopted significance level. * 679
Quantitative and
Survey Methods

Rejection ' Acceptance


reglon
I Rejection

.02

5%Probability Level I 1%Probability Level I


Figure18.1 Acceptance (or rejection) of Null hypothesis
(two tailed) a t 5% and I%,
respectively
Conventionally, the significance level i s expressed as a percentage, such
as 5% or 1%. In the former case, it would mean that there is 5% probability
of rejecting a null hypothesis, even i f true. This means that there are 5
out 100 chances that the investigator would reject a true hypothesis
(see Figure 18.1).
iii) Select a test criterion: The next step in hypothesis testing is to set
up a test criterion. An appropriate probability distribution that can be
applied is selected for the particular test. Some of the common probability
distributions are r2 t and F.
iv) Carry out calculations: Computation is carried out of various statistics
and their standard errors based on sample.
B. Make Decisions: In this step statistical conclusions and decisions are
made to reject or accept the null hypothesis, depending on whether the
computed value falls in the region of acceptance or rejection (See Case
1 and Case 2 in 18.3).

r-------------------------- 1
Reflection and Action 18.1
I Let us say that you are carrying out a research that has both the null and
I
I alternative hypotheses. You need now to set up a suitable significance level to I
I test the validity of null hypothesis as against alternative hypothesis. For this task I
I as well subsequent tasks, follow the procedure given i n the text. Next, you I
1 would need t o set up a test criterion. For this purpose select an appropriate 1
I probability distribution that can be applied 8pr the particular test. Then carry I
I out computation of various statistics and their standard errors. Now, based on 1
I sample statistical conclusions, make decisions to reject or accept the null
hypothesis. This would depend on whether the computed value falls in the region
I
1 of acceptance or rejection. Work out the steps i n concrete terms of your own I
I research project and incorporate them i n your research work report. I
L,--------,-,---------------I

18.3 Cases
Case 1: I f the hypothesis is being tested at 5% level and the -observed
result has a probability of less than 5%, then the difference between the
sample statistics and the population parameter is significant and cannot
Statistical Inference:
be explained by chance alone. Thus the null hypothesis (or H,) is rejected, Tests of Hypothesis
and in turn, the alternative hypothesis (HA)is accepted.
Case 2: If the hypothesis is being tested at 5% level and the observed
result has a probability of more than 596, then the difference between
the sample statistics and the population parameter is not significant and
can be explained by chance variation. Thus the null hypothesis (or H,) is
accepted, and in turn the alternative hypothesis (HA) is rejected.
i
I1 In hypothesis testing it is important to understand the following:
i) One tailed and two tailed test of hypothesis; and
ii) Type I and Type II errors
i)One-tailed and two-tailed test of hypothesis
Depending on the research problem, the null and alternate hypotheses
are defined in such a way that the test is known as one-tailed or two-
tailed. A two-tailed test of hypothesis will reject the null hypothesis i f
the sample statistic is significantly higher or lower than the population
parameter. Thus, in a two-tailed test of hypothesis the rejection region is
located on both the tails and the size of the rejection region is .025,
whereas the central acceptance region is .95 (Fig. 18 2). If the sample
mean falls within p 1.96 SD (i.e. i n the acceptance region), the
hypothesis is accepted. If on the other hand, it falls beyond p 1.96 SD,
then the hypothesis is rejected, as it will fall in the rejection region.
Let us take an example of the two-tailed hypothesis. Suppose a researcher
is interested in knowing whether there is gender difference in IQ. You can
formulate the following hypotheses.
IQ of Females = IQ of Males (Null hypothesis)
IQ of Females i IQ of Males (Alternative hypothesis) or in other words, IQ
of females may be lower or higher than that of males.

One Ta~led One Ta~led

Acceptance Acceptance
reglon reglon
95

Figure 18.2 One-Tailed and Two-tailed Test of Hypothesis. (A) and (B) are
One-Tailed, whereas (C) is Two-tailed.
Quantitative and
Survey Methods In contrast t o the two-tailed hypothesis, in one-tailed hypothesis the
rejection region will be Located only on one tail (see Figure 18.2). In this
case, the size of the rejection region will be -05, i f one is testing the
hypothesis at 5% probability Level. If the sample mean falls above p +
1.645 SD (Case A: Fig. 2) or below p - 1.645 SD (see Case B of Figure
18.2), then the hypothesis i s rejected, as it will fall in the rejection
region.
Let us take an example of the one-tailed hypothesis. Suppose a researcher
is interested in knowing whether the IQ of females is higher than that of
males. In this case, you can formulate the following hypotheses.
IQ of Females > IQ of Males (Null hypothesis)
IQ of Females = IQ of Males (Alternative hypothesis)
ii) v p e I and Type I lerrors
i ) A researcher's decision is correct when a true hypothesis is accepted
and the false hypothesis is rejected. One-tailed and two-tailed test of
hypothesis; and
ii) Type I and Type II errors

Accept H, Reject H,
I I
H, Is Rue Correct nfpe I
Decision Error

H, Is False ~orhct
Error Decision

Figure 18.3 n p e I and n p e II Errors in Testing a Hypothesis.

The Type I error is designated as < (alpha), whereas Type II error is


designated as B (beta). It is important to note that both types of errors
cannot be reduced simultaneously, as reduction in one leads t o increase
in the other i f the sample size remains unchanged. Thus i f Type I error
decreases, Type II error will increase. In most of the statistical tests, the
Level of significance is fixed at 5% probability level (= 0.05). This means
that the probability of accepting a true hypothesis is 95%. Sometimes,
the Level of significance is fixed at 1% probability Level (=0.01). In that
case, the probability of accepting a true hypothesis is 99%. In this case,
accepting a false hypothesis will also increase.
1 Statistical Inference:
Reflection and Action 18.2 I Tests of Hypothesis
Carry on with hypothesis testing with the same example as you had taken i n I
Reflection and Action 18.1. You need now to carry out both one-tailed and two- I
tailed tests of your hypothesis. A two-tailed test of hypothesis will reject the null I
hypothesis i f the sample statistic is significantly higher or lower than the population I
parameter. In a two-tailed test of hypothesis the rejection region is located on
I
both the tails and the size of the rejection region is .025, whereas the central
acceptance region is .95. In contrast t o the two-tailed hypothesis, you would
I
notice that i n a one-tailed hypothesis the rejection region will be located only
I
on one tail. Similarly, you would need to carry out the Type I and Type II error I
tests. Designate Type I error as < (alpha), and Type II error as B (beta). As pointed I
out in the text above, you need to remember that both types of errors cannot I
be reduced simultaneously, as reduction in one leads to increase in the other, if I
the sample size remains unchanged. If you follow the text in Section 18.3, you I
would be able to carry out both sets of tests on the hypothesis of your research. I
Make sure to include them in your research work report. I

1C 4 Tests of Significance
i)Chi-square test (i2)

chi-squareg is probably the most commonly used of all non-parametric


tests. It is applicable when data are nom,inal and grouped i n categories.
You can examine the difference between the observed and the expected
frequencies.

Where, 0 and E are the observed and expected frequencies respectively.


The calculated value of t2is compared with the table value of t Z f o r
~ i v e ndegrees of ,freedom at a certain specified level of significance
(e.g. 5%). If the calculated value of t2is higher than the table value of
i2* then the difference between the theory and observation is considered
to be significant. On the other hand, i f the calculated value of i2
is lower
than the table value of t2,then the difference between the theory and
observation is considered to be non-significant.
As mentioned above, while comparing the calculated value of t2with the
table value of tZ,one has to determine the degrees of freedom. Degree
of freedom is the number of classes to which the values can be allocated
at will or arbitrarily without defying the limitations or restrictions. For
instance, i f one has t o choose four numbers whose sum i s 100, the
freedom of choice exists only for selecting three numbers, and the
fourth is selected automatically. If, for example, the first three numbers
are 14, 26, 32, then the fourth is fixed and must be 28 (100 - (14+ 26+
32)). In this case the degree of freedom is three. Chi-square is used for
a variety of purposes. Also there are numerous tests that are close to t2.
Here, the test of goodness o f f i t and t h e test of homogeneity /
association are presented.
Quantltattve and ii) Test of goodness of fit: We often want to know whether the observed
Survey Methods
frequencies are in agreement with the probability or expected theoretical
distribution or not. The following steps may be followed:
Step 1: Define null and alternative hypotheses. /

Step 2: Decide probability level.


Step 3: Estimate the expected frequency E for each category based on
theory and or probability.
Step 4: Calculate chi-square.
Step 5: Determine the degree of freedom.
Step 6 : Compare the observed c:.i-square with the tabulated chi-square.
Accept1 reject the null hypothesis.
See Box 18.1 for an example.

Box 18.1 Example: Test whether a form of transport is favoured more


significantly than another?

Mode of Transport

Frequencies Car Bus Metro Scooter Train Total


Observed 18 21 19 20 22 100
Expected* 20 20 20 20 20 100

Solution:
Step 1: Null hypothesis: There is no significant difference in the choice of the
type of transportation.
Alternative hypothesis: There is significant difference in choice of type of
transportation.
Step 2: Probability level for the hypothesis testing is 5%.
Step 3: The expected frequencies (20)in all the categories is based on the fact
that there is an equal choice of the type of transportation.
Step 4: Calculations:
t2 = Z((0 - E) I E)
t2= ((18- 20)2I 20)+ ((21 - 20)2I 20)+ ((19- 20)' I 20)+ ((20- 20)2/ 20)+ ((22- 20)2
/ 20)
+2 = 4/20+ I 120 + 1 120 + 0 + 4/20= 10120 = 0.5

Step 5: Degree of freedom = k -1 = 5 -1 = 4


Step 6: The table value of chi-square at 5% probability Level for 4 degree of
freedom is = 9.49. The calculated value of t2 (0.5)is lower than the table value of
t2 (9.49).Thus the null hypothesis is accepted and the difference between the
theory and observation is non-significant and there is no significant difference in
the choice of the type of transportation.

iii) Test of association1 homogeneity: This type of i s used for two


purposes. The first purpose i s to examine whether or not the two or
more attributes are associated (test of association). The second purpose
is.to determine whether two samples are drawn from the same population
or not (test of homogeneity@). In the former case the data i s based on
Statistical Inference:
one sample whereas in the latter, there are two or more samples. Tests of Hypothesis
Chi-square, a non-parametric test, i s a rough estimate of confidence; it
accepts weaker, leG accurate data as input than the parametric tests,
like t-tests and the analysis of variance, and therefore, has less status in
the pantheon of statistical tests. Nonetheless, its limitation; are also i t s
strengths; because chi-square i s more 'forbearing' i n the data it will
accept, it can be used in a wide variety of researches.
The steps in the chi-square method for the test of homogeneity remain
the same as that of the test of goodness of fit, except that in step 3 the
expected frequencies are calculated for each cell as illustrated.

1
Populations Attribute Total
Category I Category 2 Category 3
Population I A B C N1

Population 2 D E F N,
Total N, "4 N5 N

Expected Frequency of Cell A = (N, * N,) / N


Expected Frequency of Cell B = (N, * N,) 1 N
- Expected Frequency of Cell C = (N, * N,) I N
Expected Frequency of Cell D = (N, * N,) I N
Expected Frequency of Cell E = (N,' N,) I N
Expected Frequency of Cell F = (N, * N,) / N
See Box 18.2 for an example to find out i f there was a difference in the
income of the two groups. On the basis of this example, you may take up
another case to test association. Homogeneity.

lI Box 18.2 Example to Examine if the Bhils and Minas Differ in their Income
Popuiations Income groups I l
High
28
Middle
41
Low

65 Ii
Solution:
Step 1: Null hypothesis: There is no significant difference i n income
between Bhils and Minas.
Alternative hypothesis: There i s significant difference in income between
Bhils and Minas.
Step 2: Probability level for the hypothesis testing is 5%.
Step 3: The expected frequencies are as given below. .:. .:.
73
Quantitative and
Survey Methods

Populations High
Observed Expected Observed Expected Observed Expected

Expected Frequency of Cell A = (N, * N,) / N = 134 * 59 / 263 = 30.06


Expected Frequency of Cell €3 = (N, * N,) / N = 134 * 84 / 263 = 42.80
Expected Frequency of Cell C = (N, * N,) / N = 134 * 120 / 263 = 61 . I 4
Expected Frequency of Cell D = (N, * N,) / N = 129 * 59 / 263 = 28.94
Expected Frequency of Cell E = (N, * N,) / N = 129 * 84 / 263 = 41.20
Expected Frequency of Cell F = (N, * N,) / N = 129 * 120 / 263 = 58.86
Step 4: Calculations:
i2= c. ((0- E) / E)

Step 5: Degree of freedom = [(No. of rows -1)" (No. of column - I ) ] =


(2-1)*(3-1) = 2
Step 6 : The table value of chi-square at 5% probability level for 2 degree
of freedom is = 5.991. The calculated value of e2 (0.940) i s lower than the
table value of +2 (5.991). Thus the null hypothesis is accepted and the
difference between the theory and observation i s non-significant and there
is no significant difference in the income of Bhils and Minas.
There i s a short cut method for the calculation of t2i f the frequency
distribution is arranged in '2x2 contingency table', as illustrated in Figure
18.3.
-
Variable 1 Variable 2 Total
Category 1 Category 2
Sample 1 A B A+B
Sample 2 C D C+D
Total A+C B+D N=A+B+C+D

Figure 18.3 Short-cut Method to Calculate

The calculated value of i2 is examined against the tabulated value at 1 d.


f. at specified probability level to ascertain significance. See Bbx 18.3 to
find out significant difference between males and females in terms of
their occupations.
- Statistical Inference:
i
to Examine Significant Gender-based Difference in Tests of Hypothesis
Occupation as Skilled1 Unskilled Labourers
Gender Skilled Labourers Unskilled Labourers
, ( MLs
Females 32 71 I
Solution
Step 1: Null hypothesis: There is no significant sex difference in skilled
and unskilled laborers.
Alternative hypothesis: There is a significant sex difference in skilled
and unskilled laborers.
Step 2: Probability level for the hypothesis testing is 5%.
Step 3: Calculations:
Gender Skilled Labourers Unskilled Labourers Total
Males 47 56 103
Females 32 71 103
Total 79 127 206

N=206 A*D= 3337 B*C = 1792


A+B=103 C+D=103 A+C=79 B + D=127
z2= N* (A*D - B*C) I (A + B) *(C + D)* (A + C)* (B + D)
+2 = 206* (3337 -1792)2 I 103 +I03 + 79 +I27

= 4917271501
i2 106440097 = 4.620
Step 4: Degree of freedom = [(No. of rows -1)" (No. of column - I ) ] =
(2-1)*(2-1) = 1
Step 5: The table value of chi-square at 5% probability level for 1 degree
of freedom is = 3.841. The calculated value of += (4.620) is higher than
the table value of s2(3.841 ).
So you can say as a conclusion that the null hypothesis is rejected and
the sex difference between skilled and unskilled laborers is significant.
iv) Student's t test (t)
Student's t test is a parametric test most suitable for a small sample. It
is probably the most widely uced statistical test and certainly the most
I widely known. It is simple, straightforward, easy to use, and adaptable
to a broad range of situations. No statistical toolbox should ever be
without it. "Student" (real name: W. S. Gossett) developed the statistical
methods to solve problems stemming from his employment in a brewery.
I
I
Like chi-square, the following steps may be followed for the use of the
Student's t test:
Step 1: Define null and alternative hypotheses.
Quantitative and Step 2: Decide probability level.
Survey Methods
Step 3: Calculate the value of t using appropriate formula.
Step 4: Determine the degree of freedom.
Step 5: Compare the observed chi-square with the tabulated chi-square.
Accept or reject the null hypothesis.
Student's t test i s applied in different conditions, such as
a) To test the significance of the mean of a random sample
b) To test the difference between the means of the two independent
samples
c) To test the difference between the means of the two dependent
'samples
d) To test the significance of the correlation coefficient.
Let us discuss each of the above conditions.
a) To test the significance of the mean of a random sample: This test
i s used when the researcher is interested in examining whether the
mean of a sample from the normal population deviates significantly
from the hypothetical population mean. The following formula i s used
for i t s calculation:
t = {(M - p) " vn} / S
When using actual mean:
S = v [C(X - M)2 / (n - 1)]
When using assumed mean
S = v [ECd2 - (d ,)Zfn} 1 (n - I)]
Where M and p are the means of the sample and population respectively;
n i s the sample size
S i s the standard deviation of the sample.
d = X - A, X being the variable
d ,is the mean of deviation -
A i s the assumed mean. Let us take an example, in Box 18. 4, of testing
the mean nutritional intake.

Box 18.4 Example to Test The Mean Nutritional Intake in the Population
with 2000 Calories.
Nutritional Intake (Calories)
(2300 2150 1950 2300 2150 1900 1900 2250 2050

Solution:
Step 1 : Null hypothesis: The mean nutritional intake in the population,
from which the sample is drawn, is 2000 Calories.
Statlstlcal Inference:
Alternative hypothesis: The mean nutritional intake in the population, Tests of Hypothesis
from which the sample is drawn, is not 2000 Calories.
Step 2: Probability level for the hypothesis testing is 5%.
Step 3: Calculations:
Nutritional Intake (Calories) d =X -A d2
2300 300 90000
2000 0 0
21 50 1 50 22500
1950 -50 2500
2000 0 0
2150 - 150 22500
1900 -100 10000
1900 -100 10000
2250 250 62500
2050 50 2500
20650 650 222500

M = 20650 1 10 = 2065 p = 2000


S = v [{Cd2 - (d J2*n) 1 (n - I)] ,

S = v [I222500 - (65) '*lo] 1 9 = 141.52


t = {(M - p) * vn) I S = {(2065 -.2000) * v (lo)] 1 141.52 = 1.452
Step 4: Degree of freedom = 10 -1 = 9
Step 5 : The table value of t at 5% probability level for 9 degree of
freedom is = 2.232. The calculated value of t (1.452) is lower than the
table value of t (2.232).
Thus the null hypothesis is accepted and the difference is not significant.
b) To test the difference between the means of the two independent
samples: This test is used when the researcher is interested in examining
whether the respective means of two independent samples differ
significantly from each other. The following formula is used for its
calculation:
t = [(MI - M,) * v f(n,* n,) (nl + n,))]
When using actual mean:
V
S = v [(C(Xl - M,) + C(X, - M,) I (n, + n, - 2)]
When using assumed mean
. 5 = v [{Cd, + Cd, - n, (M, - A,), - n2 (M, - A2)3 I (n, + n2 - 2)]
Where dl = X, - A, and d, = X, - 4 respectively
M, and M, are the respective means of the two samples
Quantitative and A, and A, are the assumed mean of the two samples
Survey Methods
n, and n, are the samplr;'sizes and S is the common standard deviation.
We wuld take an example in Box 18.5 to find out the marital distance
among the Santhals and Murias.

Box 18.5 Example to Examinewhether Santhals and Murias Differ in Marital


Distance
Marital Distance (km)
.>Santhals 10 12 15 17 18 17 19 22 22 12
Murlas 22 19 21 23 18 21 23 20 19 21
- - - -

Solution:
Step I : Null hypothesis: Santhals and Murias do not differ in their marital
distance.
Alternative hypothesis: Santhals and Murias differ in their marital distance.
Step 2: Probability level for the hypothesis testing is 5%.
Step 3: Calculations:

Santhals Murias
X, d,=X,-A, d,2 X2 d2=X2-A2 D,2 A1 = A23
10 -6 36 22 2 4 16 20
12 -4 16 19 -1 1 16 20
15 -1 1 21 1 1 16 20
17 1 1 23 3 9 16 20
18 2 4 18 -2 4 16 20
17 1 1 21 1 1 16 20
19 3 9 23 3 9 16 20
* 22 6 36 20 0 0 16 20
22 6 36 19 -1 1 16 20
12 -4 16 21 1 1 16 20
164 4 156 207 7 31

A, = 16 4 = 20 M, = 16.4 M, = 20.7
n, = 10 n,= 10 Xd12= 156 Ed, = 31
S = v [{Cd, + Ed, - n, (M, - - n, (M, - I(n, + n, - 2)]
S = V [{156+ 31 - 10 (16.4- 16),- 10 (20.7- 20)'}/ ( l o + 10 - 2)]
S = v [{156+ 31 - 10 (16.4 - 16), - 10 (20.7 - 20),} / ( l o + 10 - 2)]
S = v [180.51 181 = v10.028 = 3.167
t = [(MI - M,) v m , * n,) 1 (n, + n,)Il 1 S
t = E(16.4 - 20.7) * v (100 I20)3/ 3.167 = (4.3 T.236) I3.167 = 3.036
Step 4: Degree of freedom = 10 + 10 - 2 = 18
Step 5: The table value of t at 5% probability level for 9 degree of
freedom is = 2.101. The calculated value of t (3.036) i s higher than the
+78+
table value of t (2.101).
You can say that the null hypothesis is rejected and the difference in s ~ : ~ ~ ~ ~
I
marital distance between Santhals and Murias is significant.
c) To test the difference between the means of the two dependent
samples: This test i s used when the researcher i s interested in examining
whether the mean of two dependent samples differ significantly from
each other. The following formula i s used for i t s calculation:
I t=(d,*vn) / S
s = ~ [ Z ( d - d , ) ~ (n
/ - 1)] or
S = v [(Z d2 - (d ), *n)/ (n - I)]
Where, d = X, - X,;
d ,i s the mean of the deviations;
n, and n, are the sample sizes; and
5 i s the common standard deviation. We would take an example in Box
18.6 to find out differences in observations of two researchers.

Observer 1 2400 1950 2200 1800 2050 2250 ZOOO 1950 2300 ZOOO

. Solution:
Step f : Null hypothesis: The difference in the observation by the two
observers is not significant.
Alternative hypothesis: The difference in the observation by the two
observers is signiffcant.
Step 2: Probabllity level for the hypothesis testing i s 5%.
Step 3: Calculations:
Quantitative and
Survey Methods
S = ~ [ C d ~ - ( d , ) ~ ' n(/n - I ) ]
5 = v [(67500- (25) '10) I91
5 = 82.496

Step 4: Degree of freedom = 10 -1 = 9


Step 5: The table value of t a t 5% probability level for 9 degree of
freedom is = 2.232. The calculated value of t (0.958) is lower than the
table value of t (2.232).
You would say that the null hypothesis is accepted and the difference
between the observers is not significant.
d) To test t h e significance o f Correlation coefficient: Whether a
coefficient of correlation is significant or not may be tested using the
following formula:

Where, r is the coefficient of correlation and n i s the number o f


observations. Degree of freedom is 17-2. In Box 18.7, we would take an
example to test test the importaocve of a correlation.

Box 18.7 Example: Using the Following Data to Test the Significance of
the Correlation
I r = 0.45, n = 102
Step I : Null hypothesis: The coefficient of correlation is not significant.
Alternative hypothesis: The coefficient of correlation is significant.
Step 2: Probability level for the hypothesis testing is 5%.
Step 3: Calculations:
t = (r v (n - 2)) / v (1 - r2)
t (0.45 " v (100)) / v (1 - 0.45.0.45)
-
t = (0.45.10) / v (1 0.2025) = 4.5 / ~0.7975= 4.5 / 0.893 = 5.039
Step .4: Degree of freedom = 102 -2 = 100
Step 5: The table value of t at 5%probability level for 100 degree of freedom is =
1.96. The calculated value of t (5.039) i s higher than the table value of t (1.96).
~hu;, you would find that the null hypothesis is rejected and the correlation is
significant.

r-------------------------- 1
I Reflection and Action 18.3 I
1 Of the four tests i n section 18.4, select one test and carry it out 4 t h respect to I
your own research work. Write it out i n detail in your research work report. I
I
18.5 Conclusion
Unit 18 has provided you with a range of ways to draw inferences. There
are a good number of examples given for you to try and prepare your
own examples. The exercises of working with as many as possible examples
would help you to master the skills of testing hypotheses and estimating
unknown parameters of the population. You need to keep in mind that
no matter what design you used to test a hypothesis, you wou!d reach
only approaximations in terms of probability. The testing of a hypothesis
prepares for you the ground for generating further hypotheses and in
this manner the scientific knowledge progress. Initial approaximations
put on firm'basis the original hypothesis and from this you can further
deduce other hypotheses. If you are able to establish links between
propositions you would have generated scientific knowledge.

. Further Reading@
Handel, J. D. 1978, Statistics for Sociology, Englewood Cliffs, N . J.
Watson, G. and McGawd 1980. Statistical Inquiry Elementary Statistics
for the Political Science and Policy Sciences. John Wiley: New York
Unit 19
Correlation and Regression
Contents
19.1 lntroduction
19.2 Correlation
19.3 Method of Calculating Correlation of Ungrouped Data
19.4 Method of Calculating Correlation of Grouped Data
19.5 Regression
19.6 Conclusion
ulu
Learning Objectives
It is expected that after reading Unit 19 you would be able to
*:* Appreciate the relevance of the analysis of co-variation between
two or more variables
*:* Describe different types of correlation
* Elaborate methods o f calculating correlation of both ungrouped
and grouped data
*:* Understand the method of regression analysis that helps i n
estimating the values of a variable from the knowledge of one or
more variables.

19.1 lntroduction
In the concluding Section of Ilnlt 18, we mentioned the linkages
between propositions. Let us now discuss the subject of correlation
and regression,
Unit 19 i s about correlation, that is an analysis of co-variation between
two or more varlables. You would notice that the statistical tool of
correlation helps to measure and express the quantitative relationship
between two varlables. Unit 19 elaborates the ways of applying the tool.
I t shows the relevance of coefficient of correlation, coefficient of
determination and regression analysis in the social sciences. Further, it
explains regression analysis, which i s the method of estimating the values
of a variable from the knowledge of one or more variables. The unit tells
you to use the statistical tool of correlation without fear or apprehension
that i t s application i s difficult and complex.

19.2 Correlation
correlation@ i s an analysis of the co-variation between two or more
variables. When the relationship between the two variables is quantitative,
the statistical tool for measuring the relationship and expressing it in a
brief formula i s known as correlation. If a change in one variable results
in a corresponding change in the other, the two variables are correlated.
9829 Let us look at types of correlation.
Correlatlon and Regression
Types of correlation
Probing into the types of correlation, we contemplate two types :
correlation:
A) Positive and Negative correlation;
B) Linear and Non-linear correlation
A) Positive and negative correlation
If the values of the two variables deviate in the same direction, i.e., if
an increase in the value of one results on an average in a corresponding
increase in the value of the other, or i f decrease in the value of one
variable results in a decrease in the value of the other, then correlation
i s said to be positive or direct. Some examples of a series of positive
correlation are (i) height and weight (ii) land owned and household
income. On the other hand, i f the variables deviate in the opposite
directions, i.e. i f an increase (decrease) in the value of one variable, on
an average, results i n a decrease (increase) in the value of the other
variable, then the correlation i s negative or indirect. Some examples of
negative correlation are (i) physical assets and the level of poverty, (ii)
muscle strength and age. Figure 19.1 shows the positive and negative
types of correlation.

f f

Figure 19.1 (a) Positive Correlation and (b) Negative Correlation


The values of correlation range from -1 t o +l.When r = +I, it means
there is perfect positive correlation between the variables. When r = -1,
there is perfect negative correlation. When r = 0, it means there is no
1 correlation between the two variables (see Figure 19.2).

fl f

Figure 19.2 (a) Perfect Positive Correlation (r=+l) and


(b) Perfect Negatlve correlation. (r =-I)
Quantitative and
Survey Methods
0) Linear and non-linear correlation
'The correlation between two variables i s said to be linear if corresponding
to a unit change in one variable, there is a constant change in the other
variable over the entire range of the values. Consider the following data
i n Figure '19.3.

Figure 19.3 Constant Change Figuring in the Entire Range of Values

In this case, the data in Figure 19.3 can be represented by the relation
Y-1 + 2 X. In general, two variables are said to be linearly related i f
there exists a relationship of the form Y=a + b X.
On the other hand, the relationship between the two variables is said to
be non-linear or curvilinear i f corresponding to a unit change in one
variable, the other variable does not change at a constant but a fluctuating
rate. Example of a non-linear correlation i s given by the following data set
i n Figure 19.4.

Figure 19.4 Non4near Correlation


In the example ,in Figure 19.4, there is fluctuating (not constant) change
in the value of Y corresponding t o a unit change i n the value of X, and thus
it represents a non-linear correlation.
You would Like to know how to study correlation. Let us briefly discuss the
methods of studying correlation. But before going on t o metods of
studying correlation, let us complete Reflection and Action19.1
r-------------------------- 1 .
I Reflection and Action 19.1 I
1 Relating to your hypothesis, draw the figure of Its positive and negative I
correlations. Next draw another figure of perfect positive and perfect negative I
I correlations. In addition, draw two more figures of constant change reflected in
I the entire range of values and non-linear correlation. You may take help of
I
I Figures 18.1 to 18.4 in the text above for drawing your fiyres. I
L-,------------------------A

Methods of studying correlation


The various methods t o determine whether there i s a correlation between
two variables are (i) Scatter diagram; (ii) Graphic method; (iii) Karl
Pearson's coefficient of correlation; (iv) Rank method; (v) Concurrent
deviation method; and (vi) Method of least squares. Of these, the first
two are based on the knowledge of diagrams and graphs and the rest on'
Correlatlon and Regress'on
mathematical tools. Of the several mathematical tools used, the most
popular i s the Karl Pearson coefficient of correlation (r) and thus we will
focus on this method. The procedure i s different for calculating correlation
from ungrouped and grouped data.

19.3 Method of Calculating Correlation of .


Ungrouped Data
here are various methods for the calculation of the coefficientQ of
correlation from ungrouped data.
i Using actual mean
ii) Using assumed mean
iii) Direct method
The use of all these methods i s illustrated with the help of the following
example.
Example: Find out the correlation coefficient (Karl Pearson's) between
the age at marriage of husbands and wives using the following data i n
Figure 19.5

Figure 19.5 Correlation Coefficient between the Age at Marriage


of Husbands and Wives

Method of calculating correlation coefficient using the actual mean


You would first learn the method of calculating correlation coefficient
using the actual mean and then you would actually carry out the calculation
itself.
The formula used for calculating r is:
r = Cxy / N* 6 *, 6 ,
Where, x = (X - M ,) i n which M ,is the mean of series of X
values;
y = (Y - M ,) in which M is the mean of series of Y values;
6 ,= Standard deviation of series X
,
6 = Standard deviation of series Y
N = NumDer of pair of observations
This formula can also be expressed as:

The fottowing steps elucidate the calculation of thekoefficient of correlation.


Quantitative and I. Take deviations of X series from the mean of X and denote them
Survey Methods
by x;
!I. Square these deviations and obtain the total, i.e. 6x2;
Ill. Take deviations of Y series from the mean of Y and denote them
by Y;
IV. Square these deviations and obtain the total, i.e. 6y2;
V. Multiply the deviations of x and y and obtain the total 6 xy; and
VI. Substitute the values of 6x216y21 and 6 xy in the above formula.
Calculation of correlatlon coefflclent us1ng actual mean
After learning the method, let us now make the calculation as reflected
in Figure 19.6

Figure 19.6 Calculation of Correlation Coefficient using Actual Mean

Method of calculating correlatlon coefflclent using assumed mean


The only difference in thls method as compared to the above method is
that in the former, the deviations are taken from the actual mean, and
in thls case from the assumed mean (1.e. by looking at the series of X
and Y, assume means for X and Y and proceeding in the same manner).
Calculation of correlatlon coefficient using assumed mean
You would now calculate as per Figure 19.7.
dy = Y- Ay d:
X D,=X -AX d,2 Y
28 3 9 22 0 0 0
0 0 23 1 1 0
25
-1 1 21 -1 1 1
24
29 4 16 25 3 9 12
4 16 24
31 6 36 26
-3 9 20 -2 4 6
21 -4 16 19 -3 9 12
25 0 0 21 -1 1 0
26 1 1 21 -1 1 -1
28 3 9 24 2 4 6
259 9 97 222 2 46 60
Figure 19.7 Calculation of correlation coefficient using assumed mean
NC d*, d ,- (? dx ? d ),
r =----------------------------
v IN C dx2- (Ed x, 2rv IN C dy2 - (C d y)

10*60 - (9" 2)
r =------------------------
0*97
v {I - (9) 'I* v (I
0*46 - (2)
582
r = --------
636.697
r = 0.914
Direct method of calculating correlation coefficfent
The coefficient can also be calculated by taking actual X and Y values,
without taking deviations either from the actual or assumed mean. The
formula for i t s calculation is as follows.
r = (N * CXY - EX * CY) 1 v [N EX2- (CX) 2 ] * v [N * CYZ- (CY) 2]

The direct method gives the same answer as one gets when deviations
are taken from the assumed or actual means. The example demonstrates
this point in Figure 19.8.

Figure 19.8 Calculation of correlation coefficient using direct method


Quantitative and
Survey Methods Let us now complete Reflection and Action 19.2 and then learn in Section
19.4 the methods of calculating correlation of grouped data.

I-----------
1 Reflection and Action 19.2
-I---- ---------- 1
( Select one of the following two calculations and carry it out in relation to your 1I
I hypothesis. You need not worry about makinq mistakes in your calculations. At I
the moment the idea is to learn the procedure. This i s not to be a part of your
I, report. I
i) Calculation of correlation coefficient using assumed mean I
IL--------------------------J
ii) Calculation of-correlation Coefficient using Direct Method I

19.4 Method Of Calculating Correlation Of


Grouped Data
With a large number of observations, the data i s concealed into a two-
way frequency distribution called correlation table. The class intervals of
Y series are written as column headings and that of the X series are
written as row headings. The frequency distribution for the two variables
is written in the respective cells. The formula for calculating the coefficient
of correlation is:

Steps:
i) Take the step deviations of variable X and denote these deviations
by dx
ii) Take the step deviations of variable Y and denote these deviations
by dy
iii) Multiply dx *d, and the respective frequencies for each cell and
write the figure obtained in the right hand upper corner of the
cell.
iv) Add together all values to obtain C f *d," d ,
v) Multiply all the frequencies of the variable X by the deviations of
X and obtain the total C f,'dx
Take the squares of the deviations of the variable X and multiply
by respective frequencies to obtain 2 fx*dxz
vii) Multiply all the frequencies of the variable Y by the deviations of
Y and obtain the total C fy*dy
viii) Take the squares of the deviations of the variable Y and multiply
by respective frequencies to obtain Z f,*d,2
ix) Substitute the values for C f,'d,Z, C f,"d, Z f,"d,2, C f,'cl, C f *d,*d,
i n the above formula to get the value of r.
Let us now take an example t o ~ a l c u l a t ethe Karl Pearson's coefficient
of correlation using the data in Figure 19.9.
Correlation and Regression

Figure 19.9 Coefficient Correlation regarding Expenditure on Luxury Items

We can calculate correlation coefficient in grouped data using direct


method as seen in Figure 19.10 (See figure 19.10).

Figure 19.10 Calculation of Correlation Coefficient in Grouped Data

Now we can proceed to calculate fx*dx*dy using direct method as given


in ~igu;e 19.1 1.

Figure 19.1 1 Calculation of Correlation Coefficient of Grouped Data


Quantttatlve and
Survey Methods
v (C. f;dx2 - (C. f t d x) 1 N)" v (Zf," d,Z - (C. f,^d ), N)

Most of the variables show some kind of relationship. With the help of
correlation one can measure the degree of relationship between two or
more variables. Correlation, however, does not tell us anything about
the cause and effect relationship. Even a high degree of relationship
does not necessarily imply that a cause and effect relationship exists.
Conversely, however the cause and effect relationship (or functional
relationship) would always result in the expression of correlation.
We would now discuss regression analysis.

19.5 Regression
egression" analysis is the method of estimating the values of a variable
from the knowledge of one or more variables. 'The variable that the
researcher tries to estimate is called dependent variable (denoted as Y),
whereas the variable used for prediction i s independent variable (denoted
as X). In a regression equation, there may be one or more independent
variables, but there is only one dependent variable. Depending on whether
there are one or more independent variables, the regression equation i s
called simple or multiple. The term 'linear' i s added i f the relationship
between the dependent and the independent variable i s linear. Thus a
simple linear regression equation i s represented as

Where, Y is dependent variable


X i s independent variable
'a' i s regression constant
'b' i s regression coefficient. It measures the change ,in Y corresponding
t o a change in X.
'
Similarly a multilinear regression equation is represented as
Y = a + b, X, + b, X, +... b, Xn
Where, Y i s dependent variable
XI, X, ....X, are independent variables
'a' i s regression constant
'b,, b, ....b,' are respective regression coefficients.
Like the calculation of coefficient of the correlation, there are various COrre'atiOn and Regress'on

methods of calculating regression equation:


1. From actual mean values of X and Y.
2. From assumed mean values of X and Y.
Calculation of regression equation using actual mean
Regression equation (of Y on X) can be calculated using the following
formula:
Y- M., = byx* (X- %) or
Y- M., = r (0,l 6,) * (X- %)
As, b, = r (6, / 6,) = (?xy / ?xZ), the regression equation may be
calculated using the following formula.
Y- M, = ( C xy I C x2) * (X- M,)
Where, Y and X are dependent and independent variables respectively;
M, and 4 are means of Y and X variable respectively; and
y = Y-M, and x = X-M,
The following example illustrates the calculation of the regression
equation.
Example: Calculate the regression equation using the following data,
taking age at marriage of husbands as independent variable and that of
wives as dependent variable (see Figure 19.1 1)

Age at Case 1 Case 2 Case 3 Case 4 Case 5 Case 6 Case 7 Case 8 Case 9 Case 10
Marriage
Husbands 28 25 24 29 31 22 21 25 26 28

Wives 22 23 21 25 26 20 19 21 21 24

Calculation of regression equation using actual mean (see Figure 19.1 2)


Age of y = Y-4 y2 Age of x = X-Y, x2 XY
Wives Husbands
Y X
22 -0.2 00.04 28 2.1 4.41 -0.42
23 0.8 00.64 25 -0.9 0.81 -0.72

21 -1.2 01.44 25 -0.9 0.81 01.08


21 -1.2 01.44 26 0.1 0.01 -0.12
24 1.8 03.24 28 2.1 4.41 03.78
222 0 45.60 259 0 88.9 58.20
Figure 19.12 Calculation of Regression Equation using Actual Mean
Quantltatlve and My = 222 / 10 = 2 2 . 2 4 = 259 / 10 = 25.9
Survey Methods
Y- M, = (Cxy / Cx2) * (X- M,)

Calculation of regression equation using assumed mean (see Figure


19.13)
Regression equation (of Y on X) can be calculated using the following
formula, taking the assumed mean:
Y- M, = byx* (X- Mx)
Where, b, = [C dX*d - (C dX* C d Y) / N] / [C dx2- (2 d x) IN]
Y and X are dependent and independent variable respectively;
M, and M, are mean of Y and X variables respectively
dy = Y-AM, and dx = X-AM,
AM, and w a r e the assumed mean of Y and X variable respectively; and
Calculation of regression equation using assumed mean
Age of d, = Y-AM, dyZ Age of d, =.x-AM, x dx2 dx * dy
Wives Husbands
Y X
22 - 0 0 28 3 9 0
23 1 1 25 0 0 0
21 -1 I 24 -1 1 1
25 3 9 29 4 16 12
26 4 16 31 6 36 24
20 -2 4 22 -3 9 6
19 -3 9 21 ' -4 16 12
21 -1 1 25 0 0 0
21 -1 1 26 1 1 -1
24 2 4 28 3 9 6
222 2 46 259 9 97 60

Figure 19.1 3 Calculation of Regression Equation using Assumed Mean

M, = 222 / 10 = 22.2 4 = 259 / 10 = 25.9


byx= [C dx* d - (Cdx * C d y) / N] 1 [C d,Z - (C d X) IN]
byx= [60 - (9*2) 1101 / [ 97 - 9*9/10]
b, = 58.2 / 88.9 = 0.655
Y- M, = byx* (X- Mx)
Y- 22.2 = 0.655 * (X - 25.2)
Y; 22.2 = 0.655X - 16.96
Correlatfon and Regression
Y = 5.24 + 0.655X
Standard error of estimate: Perfect prediction, using a regression equation
is not possible (except when correlation value is -1 or + 1). Thus the
researcher is interested i n finding the accuracy of estimation of a
regression equation. Standard error of estimate measures the error
involved in using a regression equation as a basis of estimation. I t can be
calculated using the following equation:
SEE y..x = v (C(Y - Yc) / N - 21
Where, SEE y..x is Standard error of estimate
Y is dependent variable
Yc is predicted value of Y
N is the number of observations
It can also be calculated from the following formula
SEE y,,x = v ((ZY2 - a?Y - bZXY) / N- 23
Where, SEE y..x is Standard error of estimate
Y is dependent variable
X is independent variable
'a' is regression constant
'b' is regression coefficient.
N is the number of observations
Coefficient of determination: Coefficient of determination !r2) i s the
square of correlation coefficient (r) and is often used in interpreting the
value of the coefficient of correlation. If the value of r were 0.8 then
the coefficient of determination or r2would be 0.64. This would mean
that 64% of variance of one variable (dependent) is explained in terms
of the other variable (independent).

r-------------------------- 1
I Reftection and Action 19.3 I
I tried to understand how to make the calculation of regression equation using
I I
assumed mean. I could not succeed. May be you can explain it to me with an
I example. Write out on a separate sheet of paper your explanation with one or
I
I two examples. May be I will then follow it. You will need to send it to the co- I
ordinator of MSO 002.
L---------,-----------------I
I

Unit 19 is the last unit of Block 5 on Quantitative Methods. All five units
of this block have emphasised that quantitative methods should be used
in social research when they are necessary and relevant and can provide
superior results. Sometimes you can use them in combination with the
qualitative methods. You need not avoid the quantitative methods because .3 9 3
Quantitative and
Survey Methods
of lack of information or apprehension that it is difficult to understand
them. The five units of block 5 have provided you appropriate examples
wherever possible and necessary to help you understand the t k l s that
are very useful in your research project assignment.

Further ~ e a d i n ~ @
Burns, Robert B. 2000. Introduction to Research Methods. Sage
Publications: London
Cohen, Louis and Michael Holliday 1982. Statistics for Social Research.
Harper and Row: London

You might also like