Sampling Methods and Estimation of Sample Size: Known
Sampling Methods and Estimation of Sample Size: Known
Learning Objectives
It i s expected that after reading Unit 12 you would be able to
+:+ Define what i s sampling
*:* Classify sampling methods
*:* Calculate sample size. .
15.1 lntroduction
Unit 15 deals with the procedure of samplinga that helps you arrive at a
. subset of the universe of your research. It discusses the various methods
I of sampling and tells you how to work out a sample size. You will-again
read about sampling in Block 6. This is a subject you will need to master
- carefully as no matter what type of research you wish to carry out, you
will need. to apply your skill of the craft of sampling.
15.2 Sampling
A sample i s a subset of the population that represents the entire group.
When the population (or universe) i s too large for the researcher to
survey all i t s members because of i t s cost, the number of personnel to
I be employed, or the time constraint, a small carefully chosen sample i s
extracted to represent the whole (see Figure 15.1). The sample, as
drawn in Figure 15.1, i s expected to reflect the characteristics of the
population.
A well selected sample may provide superior results. For example, in a
paramen ters
Measures describing population
characteristics I -
Statistics estimates
< The parameter
Figure 15.1
Statistics
Measures describing sample
characteristics
r-------------------------- 1
I Reflectibn and Action 15.1 I
I
Work out the relationship between population, parameter, sample and statistics
I to reflect the characteristics of the population of the unit of your research
I
1 project. I
L-,---------,,-,,----,---,-J
(dark colour)
Figure 15.4 Stratified Random Sampling Method
d) Cluster random sampling i s useful when the population is dispersed Sampling Methods and
Estimation of Sample Size
across a wide geographic region. This method allows one to divide the
population into clusters and then select the clusters at random. Thereafter
one can either study all the members of the selected clusters or again
take random (simple or systematic) samples of these sampled clusters. If
the latter system is followed, it is called multi-stage sampling. This
method, for example, could be effective to study a tribal group or a
community that i s dispersed. The villages could be used as clusters and
can be randomly selected. Figure 15.5 shows that five blocks (2, 7, 10
and 14) out of sixteen have been selected by random number. Each
block contains a series of samples, as illustrated.
r-------------------------- 1
I Reflection and Action 15.3 I
Suppose in your research project you wish to estimate sample size for calculating
I I
mean and the ass,umption is that sampling is simple and the population sampled is
I infinitely Large. Further, you are in the stage of taking the three steps as elaborated
I
I in Case One given in the text, the exercise for you is to work out in detail each I
I step and write i t down in the fashion given just after Box 15.2. I
L-------,----,-------------J
15.5 Conclusion
Unit 15 discussed the important subject of sampling and provided you
with -relevant information on different methods of sampling. Further, it
brought to you the skills of calculating the sample size.
You may like t o keep i n mind what Mitchell (1984: 239) said about
sampling theory in statistics that it "devotes itself to providing numerical
Sampling Methods and
estimates of the likelihood that the population values be within some Ertimation of sample
defined range of that established from the sample - provided that the
sample has been chosen in such a way as t o meet the mathematical
conditions to justify the computation of the probabilities concerned."
Further he clarified about another~typeof inference that is derived
while using quantitative data to support theoretical interpretation and
said, 'The sophistication and elaboration for choosing a 'representative'
sample in this restricted sense has overshadowed the other kind of
inference involved when analytical statements are made from associations
uncovered in a statistical sample. 'This is the inference that the theoretical
relationship among conceptually defined elements in the sample will also
apply in the parent populatik. The basis of an inference of this sort is
the cogency of the theoretical argument linking the elements i n an
intelligible way rather than the statistical representativeness of the
.
sample."
Further ~ e a d i n ~
Burgess, R.G. (ed) 1982. Field Research: A Sourcebook and Field Manual.
(Contemporary Social Research 4). George Allen and Unwin: London
(Read page 76 onward fro discussions of random and non-random
sampling)
Denizen, N. K. (ed.) 1970. Sociological Methods: A Sourcebook.
Butterworths: London (Read page 81 onward for useful information on
sampling techniques).
Unit 16
Measures of Central Tendency
Contents
16.1 Introduction
16.2 Mean
16.3 Median
16.4 Mode
16.5 Relationship between Mean, Mode and Median
16.6 Choosing a Measure of Central Tendency
16 . 7 Conclusion
I \
Learning Objectives
It is expected that after reading Unit 16 you would be able to
*: Understand the procedure of arriving at measures of central
tendency of the data collected
Work out the ways of finding out mean, mode and median
measures of central tendency
Q Decide which of the three measures i s more appropriate in the
case of your data.
16.1 lntroduction
After dealing with the skills of sampling techniques for studying large
complex social groups, we would now discuss the matter of measuring
central tendenc and i t s application.
Unit 16 deals with the basic measures of central tendency and their
application for those of you who may lack a strong background in
mathematics. In doing so, complex mathematical derivations of formulae
have been omitted. Besides a minimal number of essential 'shorthand'
mathematical symbols, and familiar examples drawn from social science
data are presented in a non-mathematical form.
16.2 Mean
~ean@ i s the most common and widely used measure of central tendency.
Each observation in a population may be referred to as X, (read "X sub
,,
i") value. Thus, one observation might be denoted as X another as X ,,
,,
a third as X and so on. The subscript i might be any integer value up
through N, the total number of $values in the population. The mean of
the population i s denoted by the Greek letter p (lower case mu).
Calculating the mean from ungrouped data
Mean (M) is the most familiar and useful measure used to describe the
central tendency average of a distribution of scores for any group of
9489
individuals, objects or events. It is computed by dividing the sum of the
Measures of
scores by the total number of scores.
Central Tendency
M =xxi IN ........1
Where, M is the mean (sample), Xi are the scores, N is the total number
of scores and C is 'the sum of'. See Box 16.1 and Box 16.2 for examples
1 and 2.
where, M is the mean, Xi are the midpoint of class intervals, Fi are the
number of cases in various intervals, CFi is the total number of scores or
sum of frequencies of various intervals.
Box 16.2
Example 2: Following is the frequency (8, 9, 12, 9, 7, and 5) of households i n a
community owning numbers of chickens, arranged i n six groups (1-3, 4-6, 7-9,
10-12, 13-16 and 16-18).
Number of
Chickens
4-6 45
Box 16.3 Example 3: Marital Distance (the distance between the villages
of the spouse)
The marital distance was investigated in a community. Following was the frequency
(88, 93, 72, 97, 79, and 54) when the data were arranged in six groups according
to marital distance (25 - 30, 30 - 35, 35 - 40,40 - 45,45 - 50, 50 -55). Let us find the
mean marital distance.
I Step 2: Start at the low end of the frequency distribution and sum the scores
in each interval until the i n t e k l containing the median is reached (C. F.).
I Step 3: Subtract the sum obtained in step two above from the number
necessary (calculated at step 1) to reach the median (Nl2 - C. F.).
I
Step 4: Now calculate the proportion of the median interval that must be
added to its lower limit in order to reach the median score. This is done by .
dividing the number obtained in step 3 above by the number of scores
I (f) in the median interval and then multiplying by the size of the class
interval (i), i.e. [(N 12 - C.F.) If] "i.
Quantltative and Step 5: Finally, add the number obtained in step 4 above to the exact
Survey Methods
lower l,imit of the median interval.
Median = L + [(N 12 - C.F.) I f] "i
Where, L = the exact lower limit of the median interval, N = the total
number of scores; C.F. = the sum of the scores in the intervals below the
median interval, f = the number of scores in the median interval; i = the
size of the class interval.
Graphical representation of calculating the median from grouped data
Median
@
I I I IPIlI I I
Class Intervals 10 20 30 40 5 W O 70 80 90 100
Cumulative Frequency 6 4 1 1 + 1 8 + 2 H 2 ~ 1 3 5 ~ 4 ~ 4 4 + + 5 ~
See Box 16.5 for example 6 for finding the media for grouped data.
I
Median = L + [(N 12 - C.F.) I f] *i
,
r-----.---------------------
Reflection and Action 16.1
Following the examples given in the text for calculating the mean and median for
1
I'
I ungrouped and grouped data and the short method of calculating mean of grouped
I
I data, provide your own examples of each of the five calculations in the manner I
I similar to examples in the text. This exercise would provide you an opportunity I
1 of practicing such calculations. These calculation exercises would come in handy I
1 while you would carry out your own mini research project. I
L ~ , - , ~ - - ~ ~ ~ ~ ~ ~ - , ~ - ~ ~ , , ~ , ~ , ~ J
16.4 Mode
ode@ of a distribution i s simply defined as the most frequent or common
score i n the distribution. Mode i s the point (or value) of X that
corresponds t o the highest point on the distribution. If the highest
frequency i s shared by more than one value, the distribution i s said to be
multimodal. It is not uncommon to see distributions that are bimodal
reflecting peaks in scoring at two different points in the distribution.
10 20 30 40 ) 50 60 70
Mode
Figure 16.1 Graphical Representation of Mode in Grouped Data
The sample mode i s the best estimate of population mode. When one
samples a symmetrical unimodal population, mode i s an unbiased and
consistent estimate of mean and median, but it i s relatively inefficient
and should not be so used. As a measure of central tendency, mode is
affected by skewness less than is mean or median, but it i s affected by
sampling more than these other two measures. Mode, but neither
median nor mean, may be used for data on nominal, as well as the
ordinal, interval, and ratio scales-of measurement. Mode i s not used
often in social or biological researches, although it i s often interesting to
report the number of modes detected in a population, i f there are more
than one. See Box 16.7 for example 8.
1) (1
/I Box 16.7 Example 8: Find the Modal lncome on the Basis of the Following
Data. 1
9
Modal Class 15 - 20
- 29 fm
Mode lies i n the (16 - 20) having the maximum frequency (29) Measures! of
Central TenJdency
Lower limit of the modal class = 16
Frequency of the modal class (f,) = 29
Frequency of the class preceding modal class (f,) = 16
Frequency of the class succeeding modal class (f,) = 22
Size of the class interval = 5
Mode = L + [(f, - f,) 1 (2f, - f, - f,)] * i
Mode = 16 + [(29 - 16) I(2*29 - 16 - 22)] *5 = 16 + (14 121) * 5 = 16 +
3.33 = 18.33
The modal income is 18.33 thous'ands.
After learnign about mean, median and mode, we will discuss in Section
16.5 the relationship among the three measures of central tendency.
But before going on t o Section 16.5, let us complete Reflection and
Action 16.2.
r-------------------------- 1
I Reflection and Action 16.2 1
Make a graphical representation of mode i n grouped data of your choice along
the lines of Figure 16.1. You may then use similar type of graphic representation
I
I of grouped data in your own mini research project. I
L-------,,,,,-,------,,,,,,J
There are occasions, however, when taking into account the value of
every score in a distribution can give a distorted picture of the data. For
example, marriage distance (the distance between the places of residence
of the two partners) i n five cases is 40, 60, 60, 80 and 810. Without the
very atypical score of 810, the mean score of the group is 60 and the
median, likewise, is 60. The effect of introducing the score of 810 is to
pull the mean in the direction of that extreme value. The mean now
becomes 210, a value that is unrepresentative of the series. The median
remains 60, providing a more realistic description of the distribution
than the mean.
With these observations in mind:
v) When the interval level or ratio level data providing that the
distribution of scores approximates a normal curve.
r-------------------------- 1
Reflection and Action 16.3
I Provide examples of data that require mean, median and mode type of calculations
I
for reflecting the central tendency of the data. I
J
-,-,,,,,L
,,-,
16.7 Conclusion
Succinctly, mode would be the appropriate statistic to use as a measure
of the 'most fashionable' or 'most popular' when data are collected
using a nominal scale. Median would generally be associated with the
ordinal level data. Mean will be used with interval level or ratio level data
providing that the distribution of scores approximates a normal curve.
You can take mean to be a mathematical measure and median mode to
be the positional measures. You can always cluster your observations
around a central value. A central value manifests both the distribution
and the comparison of various distributions. It is always useful for a
researcher to provide measures that indicate the average feature of a
frequency distribution. Unit 16 has discussed the three measure of
central tendency and provided skills of basic statistical tools for application
in your research.
It would have become apparent to you that the three measures of the
central tendency, namely, i ) average of all the values in the distribution I
Further ~ e a d i n g
Black, Thomas R. 1999. Doing Quantitative Research in the Social
Science. An Integrated Approach to Research Design, Measurement and
Statistics
Nachmias, David and Chava Nachmias 1981. Research Methods in Social
Sciences. St. Martin Press: New York.
Unit 17
Measures of Dispersion and Variability
Contents
17.1 lntroduction -
17.2 The Range
17.3 The Variance
17.4 The Standard Deviation
17.5 Coefficient of Variation
17.6 Conclusion
Learning Objectives
It i s expected that after reading Unit 17 you would be able to
Obtain a measure of dispersion of data
Q Explain the meaning of the term 'range' and work out how to
measure the range of one's data
Q Discuss the element of variance in one's data and find out the
standard variation in it
Q Work out the coefficient of variation in the data.
17.1 Introduction
In addition to a measure of central tendency, it i s generally desirable to
have a measure of dispersion of data. A measure of dispersion (or a
measure of variabilitya, as it i s sometimes called) i s an indication of the
clustering of measurements around the center of the distribution, or,
conversely, it i s an indication of how variable the measurements are.
Sanders (1955) held,that you need to measure dispersion to evaluate the
extent to which the average value depicts the data. Another reason for
measuring dispersion i s to find out the spread i n order to improve or
corltrol the existing variations.
It i s possible that the two samples may have the same range, but not the
mean deviation. Mean deviation can also be defined by using the sum of
.:.60 .:. the absolute deviations from the median rather than from the mean.
Measures of Dlsperslon
17.3 The Variance and Variability
Another method of eliminating the signs of deviations from the mean is
to square the deviations. The sum of the square of deviation from the
mean is called the sum of squares, abbreviated SS, and is defined as
follows:
Sample 55 = C (Xi - M) ..........3
Where, M i s the mean (sample), Xi are the scores, and Cis 'the sum of'.
From the sample 55, population SS can be estimated.
Population 55 = C (Xi - p) ..........4
Where M is the mean (sample), Xi are the scores, and ?is 'the sum of'.
The mean sum of square is called variance (or mean square, the latter
being short for mean squared deviation), and for a population i s denoted
by 6 ("sigma squared", using the lowercase Greek letter).
Calculating variance from ungrouped data
Population Variance = 6 = C(Xi - p) / N ..........5
The best estimate of the population variance, 6 2, is the sample variance,
s2:
Sample Variance = s2= C(Xi - M) / (n -1) ...........6
Where M is the mean (sample), Xi are the scores, n is the total number
of scores (sample) and C i s 'the sum of'.
The replacement of p by M and N by n in the above equation results in a
quantity which is a biased estimate of 6 2. Dividing the sample's sum of
squares by n-1 (called the degree of freedom, abbreviated DF) rather than
by n, yields an unbiased estimate and the above equation should be used
to calculate the sample variance. If all observations are equal, then there
is no variability and s2= 0; and s2becomes increasingly large as the amount
of variability, or dispersion, increases. Since s2 is a mean sum of squares,
it can never be a negative quantity.
The variance expresses the same type of information as does the mean
deviation, but it has certain important properties relative to probability
and hypothesis testing that makes it distinctly superior. Thus, the mean
deviation i s very seldom encountered in social or bio-statistical analysis.
The variance has square units. If measurements are in grams, their variance
will be in grams squared, or i f the measurements are in cubic centimeters,
their variance will be 'in terms of cubic centimeters squared, even though
such squared units have no physical interpretatior?.
The sample variance@can be calculated using the following formula
Sample variance = s2 = ((C Xi 2, - (C X i) / n)) / (n - 1) .........7
The above 'formula is often called the machine formula, because of its
computational advantages. There are, in fact, two major advantages in
Quantitative and calculating SS by Equation 7 rather than by Equation 6. First, here fewer
Survey Methods
computational steps are involved, a fact that decreases the chance of
error. On a good desk calculator, the summed quantities, CX, and C X
can both be obtained with only one pass through the data, whereas
Equation 6 requires one pass through the data t o calculate M, and at
least one more pass to calculate and sum the squares of the deviations,
Xi - M. Second, there may be a good deal of rounding error in calculating
each Xi - M, a situation which leads to decreased accuracy in computation,
but which is avoided by the use of Equation 7. See Box 17.3 for example 3.
,
Sample Variance = {(X f * d ,2)/ n - (X f , * d , / n) 2} *1
Sample Variance = {(417/ 110) - (- 27/ 110) 3 * 10 = (3.79 - .06) / 10 = 37.3
The variance in the grouped data can also be calculated using the following
equation (often called machine formula).
Sample variance (s2) = ((C f ,* X,l) - (xf ,* XI) / n)) / (n - 1) ........,.I0
,
Where f is the frequency of observations with magnitude X ,,
But with a desk calculator it is often faster to use Equation 7 for each
individual observation, disregarding the class groupings. See Box 17.5
for example 5.
Box 17.5 Example 5
An investigation in a community on the bride price yielded the following data.
Find the variance In bride price.
Bride Price Frequency Mid-Point of f ,'X~ Xi2
(in Thousand (F,) the Interval (X,)
PSI
10 - 20
20 30
30 - 40
, 40 - 50
50 - 60
Quantitative and Cfi*Xi2=82650 Cf:X,= 1 8 8 0 ( C f , * X , ) 2 = (1880)2=3534400
Survey Methods
n = 50
Sample variance (s2) = ((Cf ,* Xi 2, - (Cf ,* X ,) / n)) / (n - 1)
Sample variance (s2) = (82650 - (3534400 1 50) I49 = (82650 - 70688) I
49 = 11962 / 49 = 244.12
r-------------------------- 1
Reflection and Action 17.1
I Following the examples in the text, provide your own examples for calculating I
I variance from ungrouped and grouped data. I
L~~-~~~~~,,-~~,,~,~,~,~,,,-_I
Learning Objectives
It is' expected that after reading Unit 18 you would be able to
O Draw statistical inferences on the basis of the concept of probability
9 Use the tool of statistical inference to test hypotheses
Apply the tool of statistical inference for estimating the unknown
parameter of the population under research.
18.1 Introduction
Unit 18 deals with statistical inference, which uses the concepts of
probabilitya to explain the element of uncertainty in decision-maklng.
You would find that though it occupies a lower status among statistical
tests, you would be able to use chi-square test in a wide variety of
researches. If you have a relatively smaller sample, it would be better to
use student's test that i s a test. You would learn In Unit 18
in detail about both the chi-square and student's tests. For hypothesis
testinga, Unit 18 is p i n g to prove to be most helpful in the mini research
project that you have tb complete as a part of your assignment of MSO
002.
.02
r-------------------------- 1
Reflection and Action 18.1
I Let us say that you are carrying out a research that has both the null and
I
I alternative hypotheses. You need now to set up a suitable significance level to I
I test the validity of null hypothesis as against alternative hypothesis. For this task I
I as well subsequent tasks, follow the procedure given i n the text. Next, you I
1 would need t o set up a test criterion. For this purpose select an appropriate 1
I probability distribution that can be applied 8pr the particular test. Then carry I
I out computation of various statistics and their standard errors. Now, based on 1
I sample statistical conclusions, make decisions to reject or accept the null
hypothesis. This would depend on whether the computed value falls in the region
I
1 of acceptance or rejection. Work out the steps i n concrete terms of your own I
I research project and incorporate them i n your research work report. I
L,--------,-,---------------I
18.3 Cases
Case 1: I f the hypothesis is being tested at 5% level and the -observed
result has a probability of less than 5%, then the difference between the
sample statistics and the population parameter is significant and cannot
Statistical Inference:
be explained by chance alone. Thus the null hypothesis (or H,) is rejected, Tests of Hypothesis
and in turn, the alternative hypothesis (HA)is accepted.
Case 2: If the hypothesis is being tested at 5% level and the observed
result has a probability of more than 596, then the difference between
the sample statistics and the population parameter is not significant and
can be explained by chance variation. Thus the null hypothesis (or H,) is
accepted, and in turn the alternative hypothesis (HA) is rejected.
i
I1 In hypothesis testing it is important to understand the following:
i) One tailed and two tailed test of hypothesis; and
ii) Type I and Type II errors
i)One-tailed and two-tailed test of hypothesis
Depending on the research problem, the null and alternate hypotheses
are defined in such a way that the test is known as one-tailed or two-
tailed. A two-tailed test of hypothesis will reject the null hypothesis i f
the sample statistic is significantly higher or lower than the population
parameter. Thus, in a two-tailed test of hypothesis the rejection region is
located on both the tails and the size of the rejection region is .025,
whereas the central acceptance region is .95 (Fig. 18 2). If the sample
mean falls within p 1.96 SD (i.e. i n the acceptance region), the
hypothesis is accepted. If on the other hand, it falls beyond p 1.96 SD,
then the hypothesis is rejected, as it will fall in the rejection region.
Let us take an example of the two-tailed hypothesis. Suppose a researcher
is interested in knowing whether there is gender difference in IQ. You can
formulate the following hypotheses.
IQ of Females = IQ of Males (Null hypothesis)
IQ of Females i IQ of Males (Alternative hypothesis) or in other words, IQ
of females may be lower or higher than that of males.
Acceptance Acceptance
reglon reglon
95
Figure 18.2 One-Tailed and Two-tailed Test of Hypothesis. (A) and (B) are
One-Tailed, whereas (C) is Two-tailed.
Quantitative and
Survey Methods In contrast t o the two-tailed hypothesis, in one-tailed hypothesis the
rejection region will be Located only on one tail (see Figure 18.2). In this
case, the size of the rejection region will be -05, i f one is testing the
hypothesis at 5% probability Level. If the sample mean falls above p +
1.645 SD (Case A: Fig. 2) or below p - 1.645 SD (see Case B of Figure
18.2), then the hypothesis i s rejected, as it will fall in the rejection
region.
Let us take an example of the one-tailed hypothesis. Suppose a researcher
is interested in knowing whether the IQ of females is higher than that of
males. In this case, you can formulate the following hypotheses.
IQ of Females > IQ of Males (Null hypothesis)
IQ of Females = IQ of Males (Alternative hypothesis)
ii) v p e I and Type I lerrors
i ) A researcher's decision is correct when a true hypothesis is accepted
and the false hypothesis is rejected. One-tailed and two-tailed test of
hypothesis; and
ii) Type I and Type II errors
Accept H, Reject H,
I I
H, Is Rue Correct nfpe I
Decision Error
H, Is False ~orhct
Error Decision
1C 4 Tests of Significance
i)Chi-square test (i2)
Mode of Transport
Solution:
Step 1: Null hypothesis: There is no significant difference in the choice of the
type of transportation.
Alternative hypothesis: There is significant difference in choice of type of
transportation.
Step 2: Probability level for the hypothesis testing is 5%.
Step 3: The expected frequencies (20)in all the categories is based on the fact
that there is an equal choice of the type of transportation.
Step 4: Calculations:
t2 = Z((0 - E) I E)
t2= ((18- 20)2I 20)+ ((21 - 20)2I 20)+ ((19- 20)' I 20)+ ((20- 20)2/ 20)+ ((22- 20)2
/ 20)
+2 = 4/20+ I 120 + 1 120 + 0 + 4/20= 10120 = 0.5
1
Populations Attribute Total
Category I Category 2 Category 3
Population I A B C N1
Population 2 D E F N,
Total N, "4 N5 N
lI Box 18.2 Example to Examine if the Bhils and Minas Differ in their Income
Popuiations Income groups I l
High
28
Middle
41
Low
65 Ii
Solution:
Step 1: Null hypothesis: There is no significant difference i n income
between Bhils and Minas.
Alternative hypothesis: There i s significant difference in income between
Bhils and Minas.
Step 2: Probability level for the hypothesis testing is 5%.
Step 3: The expected frequencies are as given below. .:. .:.
73
Quantitative and
Survey Methods
Populations High
Observed Expected Observed Expected Observed Expected
= 4917271501
i2 106440097 = 4.620
Step 4: Degree of freedom = [(No. of rows -1)" (No. of column - I ) ] =
(2-1)*(2-1) = 1
Step 5: The table value of chi-square at 5% probability level for 1 degree
of freedom is = 3.841. The calculated value of += (4.620) is higher than
the table value of s2(3.841 ).
So you can say as a conclusion that the null hypothesis is rejected and
the sex difference between skilled and unskilled laborers is significant.
iv) Student's t test (t)
Student's t test is a parametric test most suitable for a small sample. It
is probably the most widely uced statistical test and certainly the most
I widely known. It is simple, straightforward, easy to use, and adaptable
to a broad range of situations. No statistical toolbox should ever be
without it. "Student" (real name: W. S. Gossett) developed the statistical
methods to solve problems stemming from his employment in a brewery.
I
I
Like chi-square, the following steps may be followed for the use of the
Student's t test:
Step 1: Define null and alternative hypotheses.
Quantitative and Step 2: Decide probability level.
Survey Methods
Step 3: Calculate the value of t using appropriate formula.
Step 4: Determine the degree of freedom.
Step 5: Compare the observed chi-square with the tabulated chi-square.
Accept or reject the null hypothesis.
Student's t test i s applied in different conditions, such as
a) To test the significance of the mean of a random sample
b) To test the difference between the means of the two independent
samples
c) To test the difference between the means of the two dependent
'samples
d) To test the significance of the correlation coefficient.
Let us discuss each of the above conditions.
a) To test the significance of the mean of a random sample: This test
i s used when the researcher is interested in examining whether the
mean of a sample from the normal population deviates significantly
from the hypothetical population mean. The following formula i s used
for i t s calculation:
t = {(M - p) " vn} / S
When using actual mean:
S = v [C(X - M)2 / (n - 1)]
When using assumed mean
S = v [ECd2 - (d ,)Zfn} 1 (n - I)]
Where M and p are the means of the sample and population respectively;
n i s the sample size
S i s the standard deviation of the sample.
d = X - A, X being the variable
d ,is the mean of deviation -
A i s the assumed mean. Let us take an example, in Box 18. 4, of testing
the mean nutritional intake.
Box 18.4 Example to Test The Mean Nutritional Intake in the Population
with 2000 Calories.
Nutritional Intake (Calories)
(2300 2150 1950 2300 2150 1900 1900 2250 2050
Solution:
Step 1 : Null hypothesis: The mean nutritional intake in the population,
from which the sample is drawn, is 2000 Calories.
Statlstlcal Inference:
Alternative hypothesis: The mean nutritional intake in the population, Tests of Hypothesis
from which the sample is drawn, is not 2000 Calories.
Step 2: Probability level for the hypothesis testing is 5%.
Step 3: Calculations:
Nutritional Intake (Calories) d =X -A d2
2300 300 90000
2000 0 0
21 50 1 50 22500
1950 -50 2500
2000 0 0
2150 - 150 22500
1900 -100 10000
1900 -100 10000
2250 250 62500
2050 50 2500
20650 650 222500
Solution:
Step I : Null hypothesis: Santhals and Murias do not differ in their marital
distance.
Alternative hypothesis: Santhals and Murias differ in their marital distance.
Step 2: Probability level for the hypothesis testing is 5%.
Step 3: Calculations:
Santhals Murias
X, d,=X,-A, d,2 X2 d2=X2-A2 D,2 A1 = A23
10 -6 36 22 2 4 16 20
12 -4 16 19 -1 1 16 20
15 -1 1 21 1 1 16 20
17 1 1 23 3 9 16 20
18 2 4 18 -2 4 16 20
17 1 1 21 1 1 16 20
19 3 9 23 3 9 16 20
* 22 6 36 20 0 0 16 20
22 6 36 19 -1 1 16 20
12 -4 16 21 1 1 16 20
164 4 156 207 7 31
A, = 16 4 = 20 M, = 16.4 M, = 20.7
n, = 10 n,= 10 Xd12= 156 Ed, = 31
S = v [{Cd, + Ed, - n, (M, - - n, (M, - I(n, + n, - 2)]
S = V [{156+ 31 - 10 (16.4- 16),- 10 (20.7- 20)'}/ ( l o + 10 - 2)]
S = v [{156+ 31 - 10 (16.4 - 16), - 10 (20.7 - 20),} / ( l o + 10 - 2)]
S = v [180.51 181 = v10.028 = 3.167
t = [(MI - M,) v m , * n,) 1 (n, + n,)Il 1 S
t = E(16.4 - 20.7) * v (100 I20)3/ 3.167 = (4.3 T.236) I3.167 = 3.036
Step 4: Degree of freedom = 10 + 10 - 2 = 18
Step 5: The table value of t at 5% probability level for 9 degree of
freedom is = 2.101. The calculated value of t (3.036) i s higher than the
+78+
table value of t (2.101).
You can say that the null hypothesis is rejected and the difference in s ~ : ~ ~ ~ ~
I
marital distance between Santhals and Murias is significant.
c) To test the difference between the means of the two dependent
samples: This test i s used when the researcher i s interested in examining
whether the mean of two dependent samples differ significantly from
each other. The following formula i s used for i t s calculation:
I t=(d,*vn) / S
s = ~ [ Z ( d - d , ) ~ (n
/ - 1)] or
S = v [(Z d2 - (d ), *n)/ (n - I)]
Where, d = X, - X,;
d ,i s the mean of the deviations;
n, and n, are the sample sizes; and
5 i s the common standard deviation. We would take an example in Box
18.6 to find out differences in observations of two researchers.
Observer 1 2400 1950 2200 1800 2050 2250 ZOOO 1950 2300 ZOOO
. Solution:
Step f : Null hypothesis: The difference in the observation by the two
observers is not significant.
Alternative hypothesis: The difference in the observation by the two
observers is signiffcant.
Step 2: Probabllity level for the hypothesis testing i s 5%.
Step 3: Calculations:
Quantitative and
Survey Methods
S = ~ [ C d ~ - ( d , ) ~ ' n(/n - I ) ]
5 = v [(67500- (25) '10) I91
5 = 82.496
Box 18.7 Example: Using the Following Data to Test the Significance of
the Correlation
I r = 0.45, n = 102
Step I : Null hypothesis: The coefficient of correlation is not significant.
Alternative hypothesis: The coefficient of correlation is significant.
Step 2: Probability level for the hypothesis testing is 5%.
Step 3: Calculations:
t = (r v (n - 2)) / v (1 - r2)
t (0.45 " v (100)) / v (1 - 0.45.0.45)
-
t = (0.45.10) / v (1 0.2025) = 4.5 / ~0.7975= 4.5 / 0.893 = 5.039
Step .4: Degree of freedom = 102 -2 = 100
Step 5: The table value of t at 5%probability level for 100 degree of freedom is =
1.96. The calculated value of t (5.039) i s higher than the table value of t (1.96).
~hu;, you would find that the null hypothesis is rejected and the correlation is
significant.
r-------------------------- 1
I Reflection and Action 18.3 I
1 Of the four tests i n section 18.4, select one test and carry it out 4 t h respect to I
your own research work. Write it out i n detail in your research work report. I
I
18.5 Conclusion
Unit 18 has provided you with a range of ways to draw inferences. There
are a good number of examples given for you to try and prepare your
own examples. The exercises of working with as many as possible examples
would help you to master the skills of testing hypotheses and estimating
unknown parameters of the population. You need to keep in mind that
no matter what design you used to test a hypothesis, you wou!d reach
only approaximations in terms of probability. The testing of a hypothesis
prepares for you the ground for generating further hypotheses and in
this manner the scientific knowledge progress. Initial approaximations
put on firm'basis the original hypothesis and from this you can further
deduce other hypotheses. If you are able to establish links between
propositions you would have generated scientific knowledge.
. Further Reading@
Handel, J. D. 1978, Statistics for Sociology, Englewood Cliffs, N . J.
Watson, G. and McGawd 1980. Statistical Inquiry Elementary Statistics
for the Political Science and Policy Sciences. John Wiley: New York
Unit 19
Correlation and Regression
Contents
19.1 lntroduction
19.2 Correlation
19.3 Method of Calculating Correlation of Ungrouped Data
19.4 Method of Calculating Correlation of Grouped Data
19.5 Regression
19.6 Conclusion
ulu
Learning Objectives
It is expected that after reading Unit 19 you would be able to
*:* Appreciate the relevance of the analysis of co-variation between
two or more variables
*:* Describe different types of correlation
* Elaborate methods o f calculating correlation of both ungrouped
and grouped data
*:* Understand the method of regression analysis that helps i n
estimating the values of a variable from the knowledge of one or
more variables.
19.1 lntroduction
In the concluding Section of Ilnlt 18, we mentioned the linkages
between propositions. Let us now discuss the subject of correlation
and regression,
Unit 19 i s about correlation, that is an analysis of co-variation between
two or more varlables. You would notice that the statistical tool of
correlation helps to measure and express the quantitative relationship
between two varlables. Unit 19 elaborates the ways of applying the tool.
I t shows the relevance of coefficient of correlation, coefficient of
determination and regression analysis in the social sciences. Further, it
explains regression analysis, which i s the method of estimating the values
of a variable from the knowledge of one or more variables. The unit tells
you to use the statistical tool of correlation without fear or apprehension
that i t s application i s difficult and complex.
19.2 Correlation
correlation@ i s an analysis of the co-variation between two or more
variables. When the relationship between the two variables is quantitative,
the statistical tool for measuring the relationship and expressing it in a
brief formula i s known as correlation. If a change in one variable results
in a corresponding change in the other, the two variables are correlated.
9829 Let us look at types of correlation.
Correlatlon and Regression
Types of correlation
Probing into the types of correlation, we contemplate two types :
correlation:
A) Positive and Negative correlation;
B) Linear and Non-linear correlation
A) Positive and negative correlation
If the values of the two variables deviate in the same direction, i.e., if
an increase in the value of one results on an average in a corresponding
increase in the value of the other, or i f decrease in the value of one
variable results in a decrease in the value of the other, then correlation
i s said to be positive or direct. Some examples of a series of positive
correlation are (i) height and weight (ii) land owned and household
income. On the other hand, i f the variables deviate in the opposite
directions, i.e. i f an increase (decrease) in the value of one variable, on
an average, results i n a decrease (increase) in the value of the other
variable, then the correlation i s negative or indirect. Some examples of
negative correlation are (i) physical assets and the level of poverty, (ii)
muscle strength and age. Figure 19.1 shows the positive and negative
types of correlation.
f f
fl f
In this case, the data in Figure 19.3 can be represented by the relation
Y-1 + 2 X. In general, two variables are said to be linearly related i f
there exists a relationship of the form Y=a + b X.
On the other hand, the relationship between the two variables is said to
be non-linear or curvilinear i f corresponding to a unit change in one
variable, the other variable does not change at a constant but a fluctuating
rate. Example of a non-linear correlation i s given by the following data set
i n Figure 19.4.
10*60 - (9" 2)
r =------------------------
0*97
v {I - (9) 'I* v (I
0*46 - (2)
582
r = --------
636.697
r = 0.914
Direct method of calculating correlation coefficfent
The coefficient can also be calculated by taking actual X and Y values,
without taking deviations either from the actual or assumed mean. The
formula for i t s calculation is as follows.
r = (N * CXY - EX * CY) 1 v [N EX2- (CX) 2 ] * v [N * CYZ- (CY) 2]
The direct method gives the same answer as one gets when deviations
are taken from the assumed or actual means. The example demonstrates
this point in Figure 19.8.
I-----------
1 Reflection and Action 19.2
-I---- ---------- 1
( Select one of the following two calculations and carry it out in relation to your 1I
I hypothesis. You need not worry about makinq mistakes in your calculations. At I
the moment the idea is to learn the procedure. This i s not to be a part of your
I, report. I
i) Calculation of correlation coefficient using assumed mean I
IL--------------------------J
ii) Calculation of-correlation Coefficient using Direct Method I
Steps:
i) Take the step deviations of variable X and denote these deviations
by dx
ii) Take the step deviations of variable Y and denote these deviations
by dy
iii) Multiply dx *d, and the respective frequencies for each cell and
write the figure obtained in the right hand upper corner of the
cell.
iv) Add together all values to obtain C f *d," d ,
v) Multiply all the frequencies of the variable X by the deviations of
X and obtain the total C f,'dx
Take the squares of the deviations of the variable X and multiply
by respective frequencies to obtain 2 fx*dxz
vii) Multiply all the frequencies of the variable Y by the deviations of
Y and obtain the total C fy*dy
viii) Take the squares of the deviations of the variable Y and multiply
by respective frequencies to obtain Z f,*d,2
ix) Substitute the values for C f,'d,Z, C f,"d, Z f,"d,2, C f,'cl, C f *d,*d,
i n the above formula to get the value of r.
Let us now take an example t o ~ a l c u l a t ethe Karl Pearson's coefficient
of correlation using the data in Figure 19.9.
Correlation and Regression
Most of the variables show some kind of relationship. With the help of
correlation one can measure the degree of relationship between two or
more variables. Correlation, however, does not tell us anything about
the cause and effect relationship. Even a high degree of relationship
does not necessarily imply that a cause and effect relationship exists.
Conversely, however the cause and effect relationship (or functional
relationship) would always result in the expression of correlation.
We would now discuss regression analysis.
19.5 Regression
egression" analysis is the method of estimating the values of a variable
from the knowledge of one or more variables. 'The variable that the
researcher tries to estimate is called dependent variable (denoted as Y),
whereas the variable used for prediction i s independent variable (denoted
as X). In a regression equation, there may be one or more independent
variables, but there is only one dependent variable. Depending on whether
there are one or more independent variables, the regression equation i s
called simple or multiple. The term 'linear' i s added i f the relationship
between the dependent and the independent variable i s linear. Thus a
simple linear regression equation i s represented as
Age at Case 1 Case 2 Case 3 Case 4 Case 5 Case 6 Case 7 Case 8 Case 9 Case 10
Marriage
Husbands 28 25 24 29 31 22 21 25 26 28
Wives 22 23 21 25 26 20 19 21 21 24
r-------------------------- 1
I Reftection and Action 19.3 I
I tried to understand how to make the calculation of regression equation using
I I
assumed mean. I could not succeed. May be you can explain it to me with an
I example. Write out on a separate sheet of paper your explanation with one or
I
I two examples. May be I will then follow it. You will need to send it to the co- I
ordinator of MSO 002.
L---------,-----------------I
I
Unit 19 is the last unit of Block 5 on Quantitative Methods. All five units
of this block have emphasised that quantitative methods should be used
in social research when they are necessary and relevant and can provide
superior results. Sometimes you can use them in combination with the
qualitative methods. You need not avoid the quantitative methods because .3 9 3
Quantitative and
Survey Methods
of lack of information or apprehension that it is difficult to understand
them. The five units of block 5 have provided you appropriate examples
wherever possible and necessary to help you understand the t k l s that
are very useful in your research project assignment.
Further ~ e a d i n ~ @
Burns, Robert B. 2000. Introduction to Research Methods. Sage
Publications: London
Cohen, Louis and Michael Holliday 1982. Statistics for Social Research.
Harper and Row: London