New Sample Mathode 1
New Sample Mathode 1
19. Sufficient Statistics: Selected Contributions, Vasant S. Huzurbazar (edited by Anant M. Kshirsagar)
20. Handbook of Statistical Distributions, Jagdish K. Patel, C. H. Kapadia, and D. B. Owen
21. Case Studies in Sample Design, A. C. Rosander
22. Pocket Book of Statistical Tables, compiled by R. E. Odeh, D. B. Owen, Z. W. Birnbaum, and L. Fisher
23. The Information in Contingency Tables, D. V. Gokhale and Solomon Kullback
24. Statistical Analysis of Reliability and Life-Testing Models: Theory and Methods, Lee J. Bain
25. Elementary Statistical Quality Control, Irving W. Burr
26. An Introduction to Probability and Statistics Using BASIC, Richard A. Groeneveld
27. Basic Applied Statistics, B. L. Raktoe and J. J. Hubert
28. A Primer in Probability, Kathleen Subrahmaniam
29. Random Processes: A First Look, R. Syski
30. Regression Methods: A Tool for Data Analysis, Rudolf J. Freund and Paul D. Minton
31. Randomization Tests, Eugene S. Edgington
32. Tables for Normal Tolerance Limits, Sampling Plans and Screening, Robert E. Odeh and D. B. Owen
33. Statistical Computing, William J. Kennedy, Jr., and James E. Gentle
34. Regression Analysis and Its Application: A Data-Oriented Approach, Richard F. Gunst and Robert L. Mason
35. Scientific Strategies to Save Your Life, I. D. J. Bross
36. Statistics in the Pharmaceutical Industry, edited by C. Ralph Buncher and JiaYeong Tsay
37. Sampling from a Finite Population, J. Hajek
38. Statistical Modeling Techniques, S. S. Shapiro and A. J. Gross
39. Statistical Theory and Inference in Research, T. A. Bancroft and C.-P. Han
40. Handbook of the Normal Distribution, Jagdish K. Patel and Campbell B. Read
41. Recent Advances in Regression Methods, Hrishikesh D. Vinod and Aman Ullah
42. Acceptance Sampling in Quality Control, Edward G. Schilling
43. The Randomized Clinical Trial and Therapeutic Decisions, edited by Niels Tygstrup, John M Lachin, and
Erik Juhl
44. Regression Analysis of Survival Data in Cancer Chemotherapy, Walter H. Carter, Jr., Galen L. Wampler,
and Donald M. Stablein
45. A Course in Linear Models, Anant M. Kshirsagar
46. Clinical Trials: Issues and Approaches, edited by Stanley H. Shapiro and Thomas H. Louis
47. Statistical Analysis of DNA Sequence Data, edited by B. S. Weir
48. Nonlinear Regression Modeling: A Unified Practical Approach, David A. Ratkowsky
49. Attribute Sampling Plans, Tables of Tests and Confidence Limits for Proportions, Robert E. Odeh and D. B.
Owen
50. Experimental Design, Statistical Models, and Genetic Statistics, edited by Klaus Hinkelmann
51. Statistical Methods for Cancer Studies, edited by Richard G. Cornell
52. Practical Statistical Sampling for Auditors, Arthur J. Wilburn
53. Statistical Methods for Cancer Studies, edited by EdwardJ. Wegman and James G. Smith
Start of Citation[PU]M. Dekker[/PU][DP]1996[/DP]End of Citation
54. Self-Organizing Methods in Modeling: GMDH Type Algorithms, edited by Stanley J. Farlow
55. Applied Factorial and Fractional Designs, Robert A. McLean and Virgil L. Anderson
56. Design of Experiments: Ranking and Selection, edited by Thomas J. Santnerand Ajit C. Tamhane
57. Statistical Methods for Engineers and Scientists: Second Edition, Revised and Expanded, Robert M. Bethea,
Benjamin S. Duran, and Thomas L. Boullion
58. Ensemble Modeling: Inference from Small-Scale Properties to Large-Scale Systems, Alan E. Gelfand and
Crayton C. Walker
59. Computer Modeling for Business and Industry, Bruce L. Bowerman and Richard T. O'Connell
60. Bayesian Analysis of Linear Models, Lyle D. Broemeling
61. Methodological Issues for Health Care Surveys, Brenda Cox and Steven Cohen
62. Applied Regression Analysis and Experimental Design, Richard J. Brook and Gregory C. Arnold
63. Statpal: A Statistical Package for Microcomputers-PC-DOS Version for the IBM PC and Compatibles, Bruce
J. Chalmer and David G. Whitmore
64. Statpal: A Statistical Package for Microcomputers-Apple Version for the II, II +, and lle, David G. Whitmore
and Bruce J. Chalmer
65. Nonparametric Statistical Inference: Second Edition, Revised and Expanded, Jean Dickinson Gibbons
66. Design and Analysis of Experiments, Roger G. Petersen
67. Statistical Methods for Pharmaceutical Research Planning, Sten W. Bergman and John C. Gittins
68. Goodness-of-Fit Techniques, edited by Ralph B. D'Agostino and Michael A. Stephens
69. Statistical Methods in Discrimination Litigation, edited by D. H. Kaye and Mikel Aickin
70. Truncated and Censored Samples from Normal Populations, Helmut Schneider
71. Robust Inference, M. L. Tiku, W. Y. Tan, and N. Balakrishnan
72. Statistical Image Processing and Graphics, edited by Edward J. Wegman and Douglas J. DePriest
73. Assignment Methods in Combinatorial Data Analysis, Lawrence J. Hubert
74. Econometrics and Structural Change, Lyle D. Broemeling and Hiroki Tsurumi
75. Multivariate Interpretation of Clinical Laboratory Data, Adelin Albert and Eugene K. Harris
76. Statistical Tools for Simulation Practitioners, Jack P. C. Kleinen
77. Randomization Tests: Second Edition, Eugene S. Edgington
78. A Folio of Distributions: A Collection of Theoretical Quantile-Quantile Plots, Edward B. Fowlkes
79. Applied Categorical Data Analysis, Daniel H. Freeman, Jr.
80. Seemingly Unrelated Regression Equations Models: Estimation and Inference, Virendra K. Srivastava and
David E. A. Giles
81. Response Surfaces: Designs and Analyses, Andre I. Khuri and John A. Cornell
82. Nonlinear Parameter Estimation: An Integrated System in BASIC, John C. Nash and Mary Walker-Smith
83. Cancer Modeling, edited by James R. Thompson and Barry W. Brown
84. Mixture Models: Inference and Applications to Clustering, Geoffrey J. McLachlan and Kaye E. Basford
Start of Citation[PU]M. Dekker[/PU][DP]1996[/DP]End of Citation
< previous page cover-3 next page >
< previous page cover-4 next page >
85. Randomized Response: Theory and Techniques, Arijit Chaudhuri and Rahul Mukerjee
86. Biopharmaceutical Statistics for Drug Development, edited by Karl E. Peace
87. Parts per Million Values for Estimating Quality Levels, Robert E. Odeh and D. B. Owen
88. Lognormal Distributions: Theory and Applications, edited by Edwin L. Crow and Kunio Shimizu
89. Properties of Estimators for the Gamma Distribution, K. O. Bowman and L. R. Shenton
90. Spline Smoothing and Nonparametric Regression, Randall L. Eubank
91. Linear Least Squares Computations, R. W. Farebrother
92. Exploring Statistics, Damaraju Raghavarao
93. Applied Time Series Analysis for Business and Economic Forecasting, Sufi M. Nazem
94. Bayesian Analysis of Time Series and Dynamic Models, edited by James C. Spall
95. The Inverse Gaussian Distribution: Theory, Methodology, and Applications, Raj S. Chhikara and J. Leroy
Folks
96. Parameter Estimation in Reliability and Life Span Models, A. Clifford Cohen and Betty Jones Whitten
97. Pooled Cross-Sectional and Time Series Data Analysis, Terry E. Dielman
98. Random Processes: A First Look, Second Edition, Revised and Expanded, R. Syski
99. Generalized Poisson Distributions: Properties and Applications, P. C. Consul
100. Nonlinear Lp-Norm Estimation, Rene Gonin and Arthur H. Money
101. Model Discrimination for Nonlinear Regression Models, Dale S. Borowiak
102. Applied Regression Analysis in Econometrics, Howard E. Doran
103. Continued Fractions in Statistical Applications, K. O. Bowman and L. R. Shenton
104. Statistical Methodology in the Pharmaceutical Sciences, Donald A. Berry
105. Experimental Design in Biotechnology, Perry D. Haaland
106. Statistical Issues in Drug Research and Development, edited by Karl E. Peace
107. Handbook of Nonlinear Regression Models, David A. Ratkowsky
108. Robust Regression: Analysis and Applications, edited by Kenneth D. Lawrence and Jeffrey L. Arthur
109. Statistical Design and Analysis of Industrial Experiments, edited by Subir Ghosh
110. U-Statistics: Theory and Practice, A. J. Lee
111. A Primer in Probability: Second Edition, Revised and Expanded, Kathleen Subrahmaniam
112. Data Quality Control: Theory and Pragmatics, edited by Gunar E. Liepins and V. R. R. Uppuluri
113. Engineering Quality by Design: Interpreting the Taguchi Approach, Thomas B. Barker
114. Survivorship Analysis for Clinical Studies, Eugene K. Harris and Adelin Albert
115. Statistical Analysis of Reliability and Life-Testing Models: Second Edition, Lee J. Bain and Max
Engelhardt
116. Stochastic Models of Carcinogenesis, Wai-Yuan Tan
117. Statistics and Society: Data Collection and Interpretation: Second Edition, Re-vised and Expanded, Walter
T. Federer
118. Handbook of Sequential Analysis, B. K. Ghosh and P. K. Sen
119. Truncated and Censored Samples: Theory and Applications, A. Clifford Cohen
120. Survey Sampling Principles, E. K. Foreman
Start of Citation[PU]M. Dekker[/PU][DP]1996[/DP]End of Citation
Ranjan K. Som
Consultant
New York, New York
To
KANIKA, SUJATO, and BISHAKH
Start of Citation[PU]M. Dekker[/PU][DP]1996[/DP]End of Citation
FOREWORD
A sample survey properly conducted, with an optimal survey design, can supply information of sufficient
accuracy useful for planning and policy purposes with great speed and low cost. It is especially useful in
countries where official statistical systems are not well developed and enough resources are not available for a
complete enumeration. It is also claimed that estimates based on sample data collected by well-trained
investigators under proper supervision, though subject to sampling errors, are more accurate than those based on
complete enumeration, which would be subject to non-sampling errors usually of a larger magnitude than the
sampling errors. It is with these ideas that the methodology of sample surveys was developed during the forties.
A major contribution to survey methodology was made by P.C. Mahalanobis and his colleagues at the Indian
Statistical Institute in designing and instituting a multipurpose National Sample Survey to collect data
periodically, at the national level, on the socio-economic-demographic characteristics of the people. In this
program, Dr. R.K. Som played an important role as a chief executive in charge of designing and conducting
large-scale demographic surveys and analyzing data. Later he became the Chief of the Population Program
Center of the United Nations Economic Commission for Africa and then a Special Technical Adviser on
Population at the United Nations Headquarters: he thus acquired considerable international experience in
demographic data collection and analysis.
Practical Sampling Techniques is the result of his rich and varied experience in conducting large scale sample
surveys combined with his expertise in statistical theory.
A sample survey involves a variety of operations, starting with a statement of objectives, identification of the
target universe, construction of sampling frame, choice of sampling plan, preparation of schedules, and training
of investigators and ending up with the collection and analysis of data, and reporting on survey results. Dr. Som's
manual covers all these aspects with immense detail and lively illustrations. It is, indeed, a valuable and
indispensable guide to the practitioners of sample surveys.
Start of Citation[PU]M. Dekker[/PU][DP]1996[/DP]End of Citation
PREFACE
Using only simple mathematics and statistics, the book aims to present the principles of scientific sampling and
to show in a practical manner how to design a sample survey and analyze the resulting data. It seeks a via media
between, on the one hand, monographs on sampling and special chapters in textbooks on statistics that set forth
the basic ideas of sampling, and, on the other hand, advanced sampling texts with detailed proofs and theory. It is
more in the nature of a manual or a handbook which attempts to systematize ideas and practices and could also
be used as a textbook for a one- or two-semester course. For an understanding of the book, the basic requirement
is a knowledge of elementary statistics (including the use of multiple subscripts and summation and product
notations) and algebra, but not of calculus.
The approach is essentially practical and the book is primarily aimed at those who are not sampling specialists
but are interested in the subject, namely undergraduate, graduate and continuing education students of statistics
and professional and research workers from other fields such as sociology, economics, agriculture, forestry,
public administration, public health, family planning, demography, psychology, and public opinion and
marketing research. For them, the book presents the basic knowledge required to allow, at a minimum,
intelligent consultation with professional sampling statisticians, and, at best, the design, implementation, and
analysis of sampling inquiries when such expert services are not easily available. The latter applies particularly
in quite a few developing countries where many professionals or researchers must work in their own subject
fields without the ready benefit of consultation with sampling experts. It is hoped that survey practitioners would
find the book useful and would know when to seek expert advice, e.g., for complex topics, such as varying
probability sampling without replacement, that are not covered here in detail.
For both types of readers, viz., students and practicing professionals, the book tries to be as complete as possible
to permit self-study. The requirements of students have been considered by including proofs of some
Start of Citation[PU]M. Dekker[/PU][DP]1996[/DP]End of Citation
ACKNOWLEDGMENTS
My grateful thanks are due to Dr. I.M. Chakravarti, Dr. R.G. Laha, and Dr. J. Roy for permission to quote six
examples/exercises from their book, Handbook of Methods of Applied Statistics, Vol. II, Planning of Surveys and
Experiments (Wiley, 1967); to John Wiley & Sons Inc., New York for permission to quote five
examples/exercises from Sampling Techniques by W.G. Cochran (1977); an extract from the article, "Boundaries
of statistical inference," by W. Edwards Deming in New Developments in Survey Sampling (1969), edited by
Norman L. Johnson and Harry Smith, Jr.; and two exercises from Sampling of Populations: Methods and
Applications by Paul S. Levy and Stanley Lemeshow (1991); to Dover Publications, New York and the Deming
Institute for permission to quote an extract from W. Edwards Deming's Some Theory of Sampling (Wiley, 1950;
Dover, 1966); to John Wiley & Sons Inc., New York, and the Deming Institute for permission to quote an extract
from W. Edwards Deming's Sample Design in Business Research (1960); to the Government of Ethiopia for
permission to use the data of Example 15.1; to the American Statistical Association to quote an extract and Fig.
I.1 from its publication, What Is a Survey? by R. Ferber, P. Sheatsley, A. Turner, and J. Waksberg; to Longman
& Green on behalf of the Literary Executor of the late R.A. Fisher and Frank Yates for permission to reprint two
tables from their book, Statistical Tables for the Biological, Agricultural and Medical Research (6th ed., 1973);
to the Food and Agriculture Organization of the United Nations for permission to reproduce (in an adapted form)
a figure from the book, Quality of Statistical Data, by S.S. Zarkovich); to the Academic Press for permission to
quote an exercise from Probability in Science and Engineering (1967) by J. Hajek and V. Dupac; to the
Scarecrow Press for permission to quote an example and a table from The Mathematical Theory of Sampling
(1956) by W.A. Hendricks; to the German Statistical Office for permission to quote as an exercise and an
example from the article, "Organization and functioning of the Micro-Census in the Federal Republic of
Germany," by L. Herbergger in Population Data and Use of Computers with Special Reference to Population
Research (1971), German Foundation for Developing
Start of Citation[PU]M. Dekker[/PU][DP]1996[/DP]End of Citation
CONTENTS
Foreword v
Preface vii
Acknowledgments xi
Guide to Readers xxiii
List of Abbreviations xxvii
Chapter 1. Basic Concepts of Sampling 1
Introduction - Complete enumeration or sample - Advantages of sampling -
Limitations of sampling - Relationship between a complete enumeration and a
sample - Probability versus purposive samples - Terms and definitions -
Sampling frames - Types of probability sampling - Universe parameters -
Sample estimators - Criteria of estimators - Mean square error - Principal steps
in a sample survey - Symbols and notations - Further reading Exercises
Part I: Single-stage Sampling
Chapter 2. Simple Random Sampling 29
Introduction - Characteristics of the universe - Simple random sampling -
Sampling without replacement - Sampling with replacement - Important results -
Fundamental theorems - Standard errors of estimators and setting of confidence
limits - Estimation of totals, means, ratios and their variances - Use of random
sample numbers - Examples - Estimation for sub-universes - Estimation of
proportion of units - Method of random groups for estimating variances and
covariances - Further reading - Exercises
Topics which may be considered too advanced or too detailed may be de
ferred at the first reading. Some advanced topics are mentioned below:
xxiii
XXIV Guide to Readers
Chapters 1-2 and 4-9; sections 10.1 10.3, 12.1 12.3.6, 12.4, 12.6;
Chapters 13-14; sections 15.1-15.5; sections 17.1-17.2.3; Chap
ters 18-19; sections 20.1-20.2.9; Chapters 22-23; sections 24.1-
24.5; Chapters 25 and 27; Appendices I, III and IV.
in srswr to obtain the standard error of the mean. In using these software
programs or the scientific calculator, note the definition of the functions
used. The table below shows the functions in four spreadsheet software
programs and Radio Shack scientific calculators:
Function/Comm Lotus 1-2-3 Quattro Pro Excel for As-easy-As Radio Shack
V5/W1N V5/D0S v5/WIN V5/DOS scientific
or WIN calculator
Sampling techniques
Computer-related terminologies
xxvii
xxviii Abbreviations
Journals
Organizations
Programs
1.1 Introduction
2. Major TV networks rely on surveys to tell them how many and what
types of people are watching their programs.
3. Auto manufacturers use surveys to find out how satisfied people are
1
2 Chapter 1
9. Magazines and trade journals utilize surveys to find out what their
subscribers are reading.
10. Surveys are used to ascertain what sort of people use our national
parks and other recreation facilities.
World-wide, an indication of the scope of sampling in different coun
tries may be obtained from the following partial listing from the latest
(1982) issue of the United Nations report, Sample Surveys of Current In
terest, relating to surveys conducted during 1977-79: a survey of crime
victimization in Hong Kong; surveys of industries in Ethiopia, India, In
donesia, and the U.S.A.; family/household budget/income and expenditure
surveys in Belgium, Brunei, Canada, Chile, Hungary, Indonesia, Malawi,
Norway, Sri Lanka, Sweden, the U.S.A., and Zimbabwe; health survey in
Australia, France, Netherlands, Thailand, and the U.S.A.; fisheries sur
veys in Indonesia, Sri Lanka, Tufalu, the U.S.A., and Zimbabwe; popula-
tion/demographic/fertility and family planning surveys in Cyprus, Egypt,
Indonesia, Iran, Morocco, Nepal, Pakistan, Panama, Papua and New Guinea,
Peru, Philippines, Singapore, Spain, Sri Lanka, Thailand, Trinidad and To
bago, the U.S.A., Yugoslavia, and Zimbabwe.
Leaving aside the role of sampling in personal lives and the design of exper
imentation under controlled laboratory or field conditions, there arises in
every country a demand for statistical information in two ways: first, when
the data obtained from routine administrative and other sources have to
Basic Concepts of Sampling 3
be analyzed speedily, and second, and more frequently, when accurate data
are not available from conventional sources. Consider, for example, the
problem of how to conduct an inquiry into family budgets by employing
field interviewers. Three alternatives seem possible for collecting data:
In reality, especially for large parts of the world for even a limited num
ber of inquiries, and for all the countries for an inquiry into every facet
of community life, an adequate number of competent staff are simply not
available. In practice, therefore, to enumerate fully all the units and still
make use of the best available staff, the enumeration has to be spread over
a long period. But the situation may change over time: we might then be
measuring different things.
In general, therefore, only two alternatives remain, namely, a complete
enumeration of all the units by less efficient enumerators and a sample of
units with more efficient staff. The advantages of sampling (see Section
1.3) are such that the latter method is usually chosen, although errors can
arise as a result of generalizing about the whole from the surveyed part.
There are situations where a complete enumeration would be essential,
and if situations do not permit this, efforts should be made to establish it
as soon as possible. Examples are the counting of population for census
purposes, income tax returns and a voters’ list.
Conversely, sampling is the only recourse in destructive tests such as
testing of the life of electric bulbs, the effectiveness of explosives, and hema
tological testing.
In practice, sampling is often used to evaluate or augment the census
process (see section 1.5 below). Post-enumeration sample surveys are one
means of assessing the coverage and content errors of population censuses,
almost universally used in all countries of the world; for the evaluation,
particularly through post-enumeration field surveys, of the 1990 population
censuses in France, see Coeffic (1993), and in the U.S., see Hogan (1993),
and of the 1991 population census of China, see Zongming and Weimin
(1993).
4 Chapter 1
sample returns, and individual comparisons made with the special sam
ple interviews taken by the survey enumerators soon after the date of the
census enumeration, revealed significant net differences in the labor force
participation rates and the number unemployed between the census and
the sample. The sample indicated about 2| million more persons in the
labor force than the census and half a million more unemployed. These dif
ferences appeared to occur because the marginal classifications were more
adequately identified in the sample than in the census, mainly because of
the type of enumerators and their training and supervision; the census enu
merators, with their necessarily limited training on labor force problems,
were apparently less effective in identifying marginal labor force groups
(Hansen and Pritzker, 1956). Further studies made at the U.S. Bureau of
the Census have shown that because of the response biases of a substantial
order in difficult items such as occupation, industry, work status, income,
and education, the amount lost by collecting such items for a sample of 25
per cent instead of a complete census was far less than one might assume,
even for very small areas. These considerations, plus those of economies
and timeliness of results, led to the adoption, for the first time, in the 1960
census of a 25 per cent sample of households as the basic procedure for such
information (Hansen and Tepping, 1969).
To be useful, a sample must be scientifically designed and conducted.
A classic example of a biased sample is the ill-fitted Literary Digest poll of
the 1936 U.S. presidential election, where a mammoth sample of ten mil
lion individuals was taken. But the sample was obtained from automobile
registration lists, telephone directories and similar sources and only 20 per
cent of the mail ballots were returned; both these factors resulted in a final
sample biased heavily in favor of more literate and affluent individuals, with
political affiliations generally different from the rest: the poll erred by the
huge margin of 19 per cent in predicting votes for Franklin D. Roosevelt,
when he won by a 20 per cent majority. The failure of the 1948 polls for the
U.S. presidential election, in which Harry S. Truman was elected, is mainly
explained by changes in the preferences of the people polled between the
time they were interviewed and the time they had voted, but could be partly
attributed to the quota system of sampling that was deficient in organized
labor. In the 1970 British General Election also, pollsters performed very
poorly when they predicted the return of the Labour Party.
The universe (also called population, but we shall not use this term in
order to avoid a possible confusion with the population of human beings) is
the collection of all the units of a specified type defined over a given space
and time. The universe may be divided into a number of sub-universes.
Thus in a country-wide inquiry into family budgets, all the families in the
country, defined in a certain manner, would constitute the universe. The
sub-universe might consist of the totality of families living in rural or urban
areas or of families with incomes below and above a certain level.
The universe may be finite, that is, it may consist of a finite number of
elements. Almost all sample surveys deal with finite universes, but there
are some advantages in considering a sample as having been obtained from
an infinite universe. This is so when the universe is very large in relation
to the sample size, or when the sampling plan is such that a finite universe
is turned into an infinite one: the latter procedure is ensured by sampling
with replacement that will be explained later.
For drawing a sample from the universe, a frame of all the units in the
universe with proper identification should be available. This frame may
consist either of a list of the units or of a map of areas, in case a sample of
areas is to be selected. The frame should be accurate, free from omissions
and duplications, adequate and up-to-date, and the units should be identi
fied without ambiguity. Supplementary information available for the field
covered by the frame may also be of value in improving the sample design.
Two obvious examples of frames are lists of households (and persons)
enumerated in a population census, and a map of areas of a country showing
boundaries of area units.
The first, however attractive, suffers from two limitations, namely of
possible under- or over-enumeration of households (and population) in cen
suses and of changes in the population due to births, deaths, and migration,
unless the sample is conducted simultaneously with the census. There are
remedial methods for these difficulties. One, for example, consists of se
lecting a sample of area units, such as villages and urban blocks, from the
frame available from a population census and listing anew the households
residing in these sample areas at the time of survey-enumeration: these
households may then be completely enumerated with regard to the study
variables or a sub-sample of households might be drawn for the inquiry.
If, from the universe, a sample is taken in such a manner that the sample
selected has the same a priori chance of selection as any other sample of the
same size, it is known as simple random sampling. This is the most direct
method of sampling but has most often to be modified in order to make
better approximations to the unknown universe values, or to obviate the
lack of accurate and up-to-date frames, to simplify the selection procedure.
One variant is systematic sampling. In this procedure, if we are to draw,
say, a five per cent sample from a universe, a random number is first chosen
between 1 and (1/0.05 =) 20; suppose that the selected random number is
12. Then the 12th, (12 4- 20 =) 32nd, (32 4- 20 =) 52nd etc. units would
constitute the systematic sample. This method is often used because of its
simplicity.
Another method is to use the available information on an ancillary vari
able in drawing the sample units. If the sampling frame does not contain
any information on the universe units except their identification, there will
be no reason to prefer one unit to another for inclusion in the sample, i.e.,
all the units would have the same chance of selection; and this is what hap
10 Chapter 1
and other aspects of the survey, e.g., choice and training of enumerators,
tabulation plans, etc.
Let the finite universe comprise N units, and let us denote by Yi the value of
a study variable for the ¿th unit (i = 1,2,...,TV). For example, in a family
budget inquiry, where the unit is the family, one study variable may be the
total family size, when Yi would denote the size of the ¿th family; another
study variable may be the total family income in a specified reference period
(e.g., the thirty days preceding the date of inquiry), when Yj would denote
the value of the income of the ¿th family; and so on.
A universe parameter (or simply a parameter) is a function of the fre
quency values of the study variable. Some important parameters are de
scribed below.
In general, we would be interested in the universe total and the universe
mean of the values of the study variable. The total of the values Yi is
denoted by Y,
N
Y1 + Y2 + •■■ + yN = ^yi=Y (1.1)
1=1
and the universe mean is the universe total divided by the number of uni
verse units and denoted by Y,
Y = Y/N (1.2)
In addition, interest centers also on the variability of the units in the
universe, for the variability of any measure computed from the sample data
depends on it. This variability of the values of the study variable in the
universe is measured by the mean of the squared deviations of the values
from the mean, and is called the universe variance per unit; it is denoted
by (Ty,
N
<r2y=^Yi~V^/N (L3)
2=1
The positive square root of cry is termed the universe standard deviation
per unit.
Note: In the notation for the universe variance per unit, ay, the subscript Y is
included to denote that the universe variance refers to a particular study variable;
in later sections, the subscripts are used in a somewhat different way.
12 Chapter 1
Note that both the universe mean and the universe standard deviation
are in the same unit of measurement; for example, in a family budget
inquiry, if the study variable is the total family income, then the universe
mean and the universe standard deviation (per unit) would be expressed in
the same currency, be it dollars, pounds, pesetas, or rupees.
To obtain a measure of the universe variability independent of the unit
of measurement, the universe standard deviation is divided by the universe
mean; the ratio is called the universe coefficient of variation and denoted
by CV,
CVy=(Ty/Y (1.4)
It is often expressed as a percentage. The square of the CV is called
the relative variance. With the CV it becomes possible to compare the
variability of different items, e.g., the variability of family consumption in
different countries and time.
Another universe measure of interest is the ratio of the totals or means
of the values of two study variables, e.g., the proportion of income that is
spent on food in a family budget inquiry; it is denoted by R,
R=Y/X=Y/X (1.5)
where X is the universe total and X the universe mean of another study
variable, similarly defined as Y and Y. Let the universe standard variable
of the second study variable be denoted by ax-
The universe covariance between two study variables is obtained on
taking the mean of the products of deviations from their respective means,
and is denoted by ayx,
n
<rYX = £(Yi-Y)(Xi-X)/N (16)
i~ 1
y = Ny (1.10)
In general, the estimates obtained from different samples of the same size
and taken from the same universe, following the same survey procedure, will
vary among themselves and will only by accident coincide with the universe
value being estimated, even if the same survey procedure is followed in both
the sample and the complete enumeration. This is simply because a part,
and not the whole, of the universe is covered in a sample. This variability
due to sampling is measured by the sampling variance of the estimator.
For example, the sampling variance of the sample mean in sampling with
replacement is
(т|=<4/п (1.11)
In general, of course, <jy will not be known, and the sampling variance
of the sample estimator has to be estimated from the sample data. For
a simple random sample of n units, the sample estimator of the universe
14 Chapter 1
i=l
4 = sy/n (113)
Several criteria have been established for the sample estimators. One is un
biasedness. A sample estimator is said to be unbiased if the average value
of the sample estimates for all possible samples of the same size is math
ematically identical with the value of the universe parameter; this average
over all possible samples of the same size is also known as the mathematical
expectation of, or the (mathematically) expected survey value. In the exam
ples above, the sample estimators ÿ and x are unbiased estimators of the
respective universe parameters Y and X respectively; this will be shown in
Chapter 2. That a sample estimator is unbiased does not, of course, mean
that the estimate obtained from a particular sample will necessarily be the
same as the universe value.
Basic Concepts of Sampling 15
Ti = yi/xi (117)
1=1
except in the trivial case of all the ratios Yi/Xi being the same.
An unbiased estimator is not necessarily consistent or vice versa; but
in the general cases that we shall consider, an unbiased estimator can be
taken to consistent. Also, if a consistent estimator is biased, its bias will
decrease with sample size.
The third criterion is that of precision or efficiency. We shall consider
here the concept of relative efficiency and relative precision. Of two
estimators, based on the same sample size, for the same universe parameter,
one is said to be more efficient than the other when its sampling variance is
16 Chapter 1
smaller than the other’s. The precision of the estimator t relative to that
of t' is defined as
Precision (t,t') = (119)
The efficiency of the estimator t relative to that of t' is defined as
Efficiency (/,/') = MSEv/MSEt (1.20)
MSEt denoting the mean square error defined in the next section. For
unbiased estimators, precision and efficiency are equivalent.
When, as is common, the universe values of the variances are not avail
able, the sample estimators are substituted in the above expressions.
It can be shown that in large samples (simple random), the variance of
the median as an estimator of the universe mean Y is | Tvcy/n (where % is
the ratio of the circumference of a circle to that of its diameter = 3.14159 ...,
or 3.1416, approximately), so that the efficiency of the mean relative to the
median is, from equations (1 11) and (1.20), is
or 1.57 approximately, i.e., for simple random samples of the same size,
the sample mean is 57 per cent more efficient than the sample median in
estimating the universe mean.
A relative concept is that of a minimum variance estimator. A well-
known inequality in mathematical statistics (Cramer-Rao inequality) states
that the variance of an estimator of Y cannot be smaller than ciyln. But
as the sample mean itself has sampling variance ayln in simple random
sampling (with replacement), we can say that under this sampling plan,
the sample mean is an unbiased, consistent, minimum variance estimator
of the universe mean.
The principal steps that should be taken in planning and executing a sample
survey and analyzing and publishing the results could be grouped under the
following topics (the broad topics have been illustrated in Figure 1.1, from
the American Statistical Association’s booklet, What Is A Survey?):
6. pre-test and pilot studies - plan, conduct and analyze pre-tests and
pilot studies to determine suitable topics for investigation, respon
dent types, suitability of sampling plan and estimation of sampling
estimates and variances to help in finalizing sampling plan.
10. survey data processing - plan and test electronic data processing pro
cedures and programs; acquire and test equipment; receive and check
in competed questionnaires; data editing an entry; data tabulation.
Figure 1.1: Stages of a sample survey (from Ferber el al.).
QO
Basic Concepts of Sampling 19
The above-mentioned tasks are not all sequential (i.e., they do not nec
essarily follow one another): some, such as the preparation of the ques
tionnaire (and its testing) and the recruitment of enumerators (and their
training) are fully linked; others are partly linked; and several others, such
as the construction of sampling frames and the preparation of training
material for the enumerators may follow simultaneously. It would be most
helpful to show these tasks in a Gantt or Program Evaluation and Re
view Technique (PERT) Chart so that the tasks could be monitored,
evaluated, and corrective action taken. A project management computer
software program, such as TimeLine (by Symantec Corporation) or Project
(by Microsoft Corporation), should be installed to facilitate this.
The design and conduct of a sample survey require more technical ex
pertise than that for a census. One principal recommendation of the United
Nations should be heeded (U.N. 1947):
“A sample survey should be carried out only under the technical guid
ance of professional statisticians not only with adequate knowledge of
sampling theory but also with actual experience in sampling practice,
and with the help of a properly trained filed and computing staff.”
Capital (and sometimes small) letters are used to denote the universe values
of the study and ancillary variables (such as Yi, Y2,.. ., Yn ; Xi,X2,...,Xn),
and small letters those in the sample (such as y\, y2,... , yn ; and x^, X2, • • •,
xn). N is the universe size (i.e., the total number of units in the universe),
and n the sample size. Universe parameters are denoted by capital or Greek
letters (such as Y, the universe mean; cr2, the universe variance per unit; p,
the universe correlation coefficient of two variables), and the sample estima
tors by small letters or with circumflex or the hat Q on the corresponding
universe parameters (such as ÿ, the sample mean, as estimator of Y; s2,
sample estimator for <r2; and p, sample estimator for p).
£2 stands for summation (either for the universe or for the sample), and
the number of terms to be summed is indicated by the letters at the bottom
20 Chapter 1
the last five terms being used when there is no risk of ambiguity.
Double and triple summations may often be required. Thus, consider N
(universe) first-stage units, the ¿th first-stage unit (i = 1,2,..., TV) contain
ing Mi second-stage units. Let the value of the study variable for the jth
second-stage unit (j = 1,2,..., in the ¿th first-stage unit be denoted
by Yij. Then the sum of the values of the study variable for any first-stage
and for all first stage units can be represented as follows:
First-stage units Sum of the values of the study variable of the
second-stage units
1 Yll+Yn+'+YlM, =E7=,iyi;
will be termed the corrected sum of squares. The choice of the particular
expression for SSy{ in (1.22) will depend on computational convenience.
Similarly, ytXi will be called the “raw” or “crude” sum of products
of the yi and Xi values and the sum of products of deviations from the
respective means
SPyiXi
n n
^i
n n
52 yiXi
n
(1.23)
Further reading
References have been made only to the names of the authors (and to the year of
publication when the author has more than one publication referenced); complete
references are given at the end of the book. Some of the books mentioned in the
first edition, as well as some published since then, are now out of print; however,
I have retained references to the important ones among them, in the hope that
they may still be available in university libraries.
For a history of sampling, see Bellhouse (1988a). For an introduction to
the ideas of scientific sampling, see the books by Kalton (1983); Scheaffer,
22 Chapter 1
Mendenhall, and Ott; Slonim (a popular account); Stuart; Sudman (who gives
in chapter 1 of his book examples of some interesting sample surveys conducted
in the U.S.A.); Warwick and Linninger; and special chapters on sampling in the
book by Moser and Kalton and in textbooks on statistics (with generally the
same mathematical level as this book), such as those by Blalock; Chakravarty,
Laha and Roy; Hajek and Dupac; P.O. Johnson; Snedecor and Cochran; Tippet;
and Yule and Kendall, and an article by Dalenius (1988).
A Short Manual on Sampling (1972) by the U.N., Sampling Lectures (1968) by
the U.S. Bureau of the Census, and Tore Dalenius’s Elements of Survey Sampling
(1985) would also be valuable introductory reading. Barnett’s Sample Survey
Principles and Methods (1991) contains basic sampling techniques and provides
many instructive examples and exercises: it would be a good preliminary read to
this book.
For readers interested in sampling theory, texts by the following authors
are recommended as follow-up:
Chaudhuri and Stenger; Hajek; Hansen et al. (1953), vol. II; Hedayat
and Sinha; Kish (1965); Murthy (1967); Sukhatme et al. (1984); and
Yates.
For readers interested in the general topic of foundations of survey sam
pling, including statistical inference in sampling from finite populations,
see articles, starting with Godambe (1955) and other relevant articles, in New De
velopments in Survey Sampling, edited by Norman L. Johnson and Harry Smith,
Jr. (1969), Foundations of Statistical Inference, edited by V.P. Godambe and
D.A. Sprott (1971), and Handbook of Statistics, Vol. 6, Sampling, edited by P.R.
Krishnaiah and C.R. Rao (1988) and by T.M.F. Smith (1984). These collections
and Contributions to Survey Sampling and Applied Statistics, edited by H.A.
David (1978), and Survey Sampling and Measurement, edited by K. Namboodiri
(1978), contain a number of important articles on sampling theory.
For relevant issues in statistical designs for research, see Kish (1987).
For definition of statistical terms, see A Dictionary of Statistical Terms by
Kendall and Buckland: I have relied on it heavily.
For readers interested in sampling methods and their application in
specific subject fields, the following are recommended:
Sampling methods (general): Cochran; Hansen et al. (1953),
vol. I (especially relating to sample surveys undertaken by U.S.
Government agencies); Kish (1965).
Agriculture, forestry and fishery: FAO (1989); Hendricks (1956);
Jessen; Panse (1954); Sampford; Singh and Chaudhary; Sukhatme
et al. (1984); Yates.
Auditing and accounting: T.F.M. Smith (1976); Trueblood and
Cyert; Vance; Vance and Neter; Welburn.
Biological and geological studies: S.K. Thompson.
Business research: Deming (1960).
Basic Concepts of Sampling 23
Note: In following the literature on sampling, one difficulty stems from the dif
ferences in concepts and notations followed by different authors. It is suggested
that the reader first gets acquainted with the concepts and notations used in this
book, and then, while consulting other references, notes the differences in concepts
and notations. For example, some authors (including myself) use the «/-notation
for the (first) study variable while others use the z-notation. In dealing with
single-stage sampling, all the authors cited use N and n to denote respectively
the number of units in the universe and in the sample; for two-stage sampling,
some authors (including myself) employ Mt and mt to denote respectively the
universe and the sample number of second-stage units in the ith first-stage unit
(i = 1, 2,. . . , N for the universe, and i = 1,2,. . ., n for the sample), but some
others denote by M and m the universe and the sample number of first-stage
units and by Nt and nt the universe and sample numbers of second-stage units
in the ith first stage unit. Again, in two-stage sampling, some authors use the
24 Chapter 1
term “size” to mean the number of second-stage units it (the first-stage unit)
contains, but I and others have used the term “size” in a more general sense of
the value of an ancillary variable for selection of units with probability propor
tional to “size” (pps), which can then be used also in single-stage pps sampling; a
special case of “size” is, of course, the number of second-stage (or for that matter
any subsequent-stage) units. Note also that some authors introduce multi-stage
sampling under “cluster sampling.”
Exercises
SINGLE-STAGE
SAMPLING
CHAPTER 2
2.1 Introduction
29
30 Chapter 2
a 8 3 9
b 6 1 1
c 3 -2 4
d 5 0 0
e 4 -1 1
f 4 -1 1
Total 30 0 16
Mean Y=5 0 a2 = 2.6
The computations are shown in Table 2.1. The standard deviation per unit
of the universe is
1.6330
CV = a/Y = 1.6330/5
— 0.3266 or 32.66 per cent (2.4)
Note: The characteristics of the universe will not be known in general, and the
main objective in sample surveys is to estimate these characteristics.
If the sampling frame consists of an identifiable list of all the units in the
universe, but without any information on the value (or magnitude) of the
Simple Random Sampling 31
variable under study (the “study variable”) or of any ancillary variable, and
if we are sampling only one unit from the universe, then a priori there would
be no reason to choose one unit over another, i.e. all the units should have
the sample chance of being selected. Similarly, if we are selecting n (> 1)
units from the universe of N units, then each combination of the n units
should have the same chance of selection as every other combination: such
a sample is termed a simple random sample of n units.
Note: In simple random sampling, the probability that a universe unit will be
selected at any given draw is the same as that at the first draw, namely, 1/N,
and the probability that the specific unit is included in the sample of n units is
n/N-, see Appendix II, section A2.3.2, notes 1 and 2.
If we draw a simple random sample such that no unit occurs more than
once in the sample, sampling is said to be without replacement, if a unit can
occur more than once in the sample, it is said to be with replacement. We
shall illustrate these with the hypothetical universe of six households given
in Table 2.1, the objective being to estimate the average size of household
from samples of different sizes.
n
Estimator of the universe mean: sample mean y — E I/i/n (2.5)
i=i
n
Estimator of variance per unit: = E(»< - - !) (2-6)
32 Chapter 2
Table 2.2: All possible samples of size 2 drawn without replacement from
the universe of Table 2.1
* yi and yz are the values of the two units included in the sample.
t y = |(l/i +1/2) from equation (2.5).
Sy = ^(yi — yz}2 from equation (2.8).
§ Sjf = (1 — /)3y/n = j s2y from equation (2.7), as n = 2, N = 6, and n/N = |.
where y\ and 3/2 are the values of the two sample units included in
the sample.
2 _ SSyj _ ^(yi-y)2
y n— 1 n— 1
n— 1
En </t2-m/2
n—1
= (29)
n—1 ’
1. The average of the estimates of all possible samples for any sample size
(this is known as the mathematical expectation) is the true universe
value, namely 5. If the mathematical expectation of a sample esti
mator is the universe parameter, the estimator is said to be unbiased
for the universe parameter; if not, the estimator is biased. We have
seen that the sample mean, obtained from a simple random sample
without replacement, is an unbiased estimator of the universe mean
(see also section 1.12).
2. As the size of the sample increases, the sample estimates concentrate
around the universe value; thus 73 per cent of the sample estimates
of size two, 90 per cent of size three, and all from four and larger
sizes fall within the range of household size 4-6 (see Figure 2.1). This
characteristic of a sample estimator is known as consistency (see also
section 1.12).
34 Chapter 2
Table 2.3: All possible samples of size 3 drawn without replacement from
the universe of Table 2.1
s2t sit
Sample Sample Total Mean size
no. units sizes *
(V)
1 EU;
: S5»,/(n - 1) = | SSyt J
as n = 3, N = 6 and f = n/N = 1
:s2(l-/)/n == 2‘
Simple Random Sampling 35
Table 2.4: Estimates of average household size from all possible samples of
different sizes drawn without replacement from the universe of Table 2.1
3.00-3.24 1
3.25-3.49
3.50-3.74 2 1
3.75-3.99
4.00-4.24 2 2 2 1
4.25-4.49 3 1 1
4.50-4.74 3 2 2
4.75-4.99 2 1
5.00-5.24 1 2 4 2 3 1
5.25-5.49 2 3 1
5.50-5.74 2 3 2
5.75-5.99 2
6.00-6.24 1 2 2
6.25-6.49 1
6.50-6.74 1
6.75-6.99
7.00-7.24 1
7.25-7.49
7.50-7.74
7.75-7.99
8.00-8.24 1
Number of
samples 6 15 20 15 6 1
Mean 5 5 5 5 5 5
Variance
of mean 2.6667 1.0667 0.5333 0.2667 0.1067 0
36 Chapter 2
1. The average of s12 for all the sample sizes is a constant and is 3.2,
whereas the universe variance is cr2 = 2.6 (Table 2.1). Thus s2 is
not an unbiased estimator of the universe variance cr2. The expected
value of is:
N
Etf) = N—l
(2.10)
Table 2.5: Results of all possible samples of different sizes drawn without
replacement from the universe of Table 2.1
1 6 5 3.2 2.6667
2 15 5 3.2 1.0667
3 20 5 3.2 0.5333
4 15 5 3.2 0.2667
5 6 5 3.2 0.1067
6 1 5 3.2 0
In our example,
.2 = = 16 =
W-l 5
2. The universe variance of the sample mean for srs without replacement
is
2 _ . (N ~ - S2^1 ~
(211)
y n (N — 1) n
3. The value of the estimator of the variance of the sample average $3-
decreases with increasing sample size and becomes zero when n = N
(see Figure 2.2)
4. Tables 2.4 and 2.5 give the results obtained from the theoretical prob
abilities of the different samples. The same results are approximated
when actual samples are drawn a large number of times from a uni
verse.
38 Chapter 2
n
Estimator of the universe mean: sample mean y = yi/n (2.12)
n
Estimator of variance per unit: (2-13)
(216)
This decreases with increasing sample size, but does not reach zero
(Figure 2.2).
40 Chapter 2
Table 2.6: All possible samples of size 2 drawn with replacement from the
universe of Table 2.1
1 aa 16 8.0 0 0
2 ab 14 7.0 2.0 1.00
3 ac 11 5.5 12.5 6.25
4 ad 13 6.5 4.5 2.25
5 ae 12 6.0 8.0 4.00
6 af 12 6.0 8.0 4.00
7 ba 14 7.0 2.0 1.00
8 bb 12 6.0 0 0
9 be 9 4.5 4.5 2.25
10 bd 11 5.5 0.5 0.25
11 be 10 5.0 2.0 1.00
12 bf 10 5.0 2.0 1.00
13 ca 11 5.5 12.5 6.25
14 cb 9 4.5 4.5 2.25
15 cc 6 3.0 0 0
16 cd 8 4.0 2.0 1.00
17 ce 7 3.5 0.5 0.25
18 cf 7 3.5 0.5 0.25
19 da 13 6.5 4.5 2.25
20 db 11 5.5 0.5 0.25
21 de 8 4.0 2.0 1.00
22 dd 10 5.0 0 0
23 de 9 4.5 0.5 0.25
24 df 9 4.5 0.5 0.25
25 ea 12 6.0 8.0 4.00
26 eb 10 5.0 2.0 1.00
27 ec 7 3.5 0.5 0.25
28 ed 9 4.5 0.5 0.25
29 ee 8 4.0 0 0
30 ef 8 4.0 0 0
31 fa 12 6.0 8.0 4.00
32 fb 10 5.0 2.0 1.00
33 fc 7 3.5 0.5 0.25
34 fd 9 4.5 0.5 0.25
35 fe 8 4.0 0 0
36 ff 8 4.0 0 0
* yi and 1/2 are the values of the two units included in the sample,
t y = i (yi + 1/2) from equation (2.5).
sy = 2 ~ V2)2 from equations (2.13) and (2.8).
§ s?_ = j sy from equation (2.14).
Simple Random Sampling 41
cr£2 = — °
m srswr (2.17)
y n
a2 (N — n)
V (A - 1)
52
= — (1 — /) in srswor (2.18)
where
N
s2 N-l
a2 (219)
2 ZTfa-ÿ)2
~ n-l
2
Stt = — •in srswr
*
(2.20)
y n
s2
Stt
3/
= — (1 — /)
n '
in srswor (2.21)
From the numerical examples, these results have been seen to hold for
the universe of six households. Theoretical proofs are given in Appendix II
that they hold for a universe of any size (sections A2.3.2 - A2.3.4).
42 Chapter 2
CVr2 = CVY + CV
* - 2pCVYCVx (2.24)
Theoretical proofs for simple random sampling are given in Appendix II,
section A2.3.6. Estimators of <r2 and CVr are obtained on substituting the
sample estimators for the universe parameters in the respective expressions.
Notes
1. In effect, sampling with replacement is equivalent to drawing samples from
an infinitely large universe, and in the estimating formulae for the variance
of the sample mean (and other related measures), the factor
i = (2.28)
(2.30)
Simple Random Sampling 45
_ -u) _ SPtjUj
tu~ n(n-l) n(n-l) 1 >
r = i/u (2.33)
Ptü — stû/stsn
Notes
1. The above theorems have a very wide range of application in sampling.
These can be generalized to stratified and multi-stage designs, with sample
units being selected with equal or varying probabilities. The only require
ment is that for the universe parameters, a number of independent and
unbiased estimators be available from the sample. The independence of
the sample estimators is ensured by selecting the sample units with re
placement (at least at the first stage in a multi-stage sample design and in
each stratum separately in a stratified design); in practice, when sampling
is carried out without replacement this is approximated by making the
(first-stage) sampling fraction fairly small (10 per cent or less, preferably 5
per cent or less: see note 2, section 2.6).
2. The estimators t and ü generally refer to the totals of the characteristics;
the advantage of this will be see later.
46 Chapter 2
(2.36)
_ t — T _ (t - T)>/n
sj st
(2.38)
Denoting by t'a n_i the 100a percentage point of the /-distribution cor
responding to (n — 1) degrees of freedom, we see that the inequality
|i-r|v^/s, (2.39)
by the limits
t~ t + (t'Q,n_iSt)/y/n (2.40)
(i-T)M (2.41)
2. When the sample size is small, the /-distribution may still be used in
setting probability limits to the universe values, but with the addi
tional assumption that the sample itself comes from a normal distri
bution.
s2=sl
y n n(n — 1)
Let us see, however, how the above and other results follow from the fun
damental theorems of section 2.7.
With a simple random sample of n units selected with replacement, let
the sample values of the study variable be J/i, 2/2, • • -, • Each of these yt
values (i = 1,2,... ,n), when multiplied by N, the total number of units
in the universe, gives an independent, unbiased estimator of the universe
total Y
y- = Nyi (2.42)
48 Chapter 2
An unbiased estimator of the variance of the estimator j/q is, from equation
(2.29),
- yg)2 _ SSy‘
n(n — 1) n(n — 1)
^Z\yi~y)2 =
n(n — 1) n
N2SSyi
(2.44)
n(n — 1)
An unbiased estimator of the universe mean Y is obtained on dividing
the unbiased estimator of the universe total by the number of universe units
N, namely,
(2.45)
i. the sample mean, and an unbiased estimator of the variance of y on
e.
dividing the unbiased estimator of variance of the estimator of the universe
total by №, namely,
s’./№ = s2/n = 4 (2.46)
From the same sample, unbiased estimators of the total and the mean
of another study variable may be computed from estimation formulae of
the above types. An unbiased estimator of the covariance between the
estimators y$ and Xq of the two universe total Y and X is, from equation
(2.32),
r = Vo/®o (2.48)
Simple Random Sampling 49
•L Q 71 Jb
£n(yi - ÿ)2 + r2 ETfot - *)2 ~2r 'L’Xyj ~ ÿ)(x» ~ æ)
n(n — l)æ2
SSyi + r2SSxi — 2r SPyiXi
n(n — l)x2
(2.49)
n(n — l)x2
Here the unbiased estimators of universe totals and means, and the es
timator of the universe ratio of two universe totals (or means) remain
the same as for srs with replacement, but the finite sampling correction,
(1 — /) = (1 — n/N), is applied to the variance and covariance estimators.
Thus the unbiased variance estimator of ÿ is
(2.21)
(2.50)
(2.51)
N2SPyiXi
U n(n-l) (2.52)
50 Chapter 2
A variance estimator of r is
sr = =2 (4+r24~2rS^)
The first two plates of random sample numbers from Statistical Tables for
Biological, Agricultural and Medical Research by R.A. Fisher and F. Yates
are reproduced in Appendix III, Table 1. The use of these tables in select
ing probability samples will be illustrated with examples. Random number
tables are also provided by Kendall and Smith and by the Rand Corpora
tion.
Notes
2.11 Examples
In Appendix IV are given for thirty villages located in three states, the data
on the current total population and also the current size of the households
and the population obtaining during a census conducted five years ago. For
some sample villages additional information on items such as the monthly
income of the households will also be given. These will constitute our
universe, from which samples will be drawn and the principles of sample
selection and procedures of estimation of universe values illustrated.
Some universe values follow; normally, of course, not all these values
will be known and will have to be estimated from a probability sample:
Simple Random Sampling 51
Example 2.1
From village number 8 in zone 1 (Appendix IV), select a simple random sample
of 5 households without replacement, and on the basis of the data on size and
the monthly income of these 5 sample households, estimate for the 24 households
in the village the total number of persons, total daily income, average household
size, and average monthly income per household and per person, together with
their standard errors.
Of the two plates of random numbers given in Appendix III (Table 1), we
choose one plate at random, say, by tossing a coin; suppose the second plate is
chosen. To obtain a simple random sample of 5 households without replacement
from the present universe of 24 households in the village, we start from the top
left-hand corner of the plate, and read down the first two columns (as the total
number of households is of two digits).
One method of procedure is to select any random number that lies between
01 and 24, and not to consider all numbers between 25 and 99 as also 00; the five
numbers thus selected would constitute our required sample of 5 households. (As
selection without replacement, if a random number is repeated in the draws we
would reject it, and continue drawing until five different numbers are selected.)
However, this procedure will lead to a large number of rejections of the random
numbers, on an average 100 — 24 = 76 per cent. A modified procedure is to divide
all two-digit random numbers greater than 24 by 24 and take the remainder as
the random number selected. Obviously, we should not consider numbers 97 to
99 and also 00 in order to give all the 24 households equal chances of selection.
The random numbers as they appear (and after division by 24 and taking the
remainder, if the random number is larger than 24) are: 53 (remainder 5); 97
(rejected); 35 (remainder 11); 63 (remainder 15); 02; 98 (rejected); 64 (remainder
16).
The five households that constitute our sample are then households with serial
numbers 5, 11, 15, 2, and 16. The data on the size and daily income of these 5
52 Chapter 2
Table 2.7: Size and daily income of the 5 simple random samples of house
holds selected from 24 households in village no. 8 in zone 1
1 2 3 33 9 1089 99
2 5 4 40 16 1600 160
3 11 3 34 9 1156 102
4 15 5 68 25 4624 340
5 16 6 61 36 3721 366
sample households are given in Table 2.7, along with the required computations.
The household size is denoted by yi and the daily household income by xt.
Here A = 24; n = 5; f = n/N = 0.2083; and 1 — f = 0.7916. The corrected
sums of squares and products are
SSy,
= ¿’-(¿p/"
SSxt
= ±^-(±p2/”
SPytxt
Note: The standard errors of the totals and the means are rather large: the
standard errors could be decreased by increasing the size of the sample and/or
by adopting more appropriate sampling techniques, detailed in later chapters.
54 Chapter 2
Example 2.2
From the list of 600 households residing in 30 villages (Appendix IV), select a
simple random sample of 20 households with replacement, and, on the basis of the
data on the size of these 20 sample households, estimate for the whole universe
the total number of persons the and average household size, together with their
standard errors, and the 95 per cent confidence limits of the universe values.
Before the required sample can be drawn, we have to obtain a serial list of
all the 600 households. The households in the 30 villages need not actually be
re-numbered, but the purpose would be served by taking the cumulative totals of
the number of households from the 30 villages, given in Table 2.8. Starting with
the 3-digit numbers from where we had left off, and rejecting numbers between
601 and 999 as also 000, we get the following 20 acceptable random numbers:
585 348 39 84 70 18 451 433 504226
317 366 72 101 551 538 518 359 37729
Random number 585 refers to the first (585 — 584 = 1) household in the ninth
village is state III; random number 348 refers to the seventh (348 — 341 = 7)
household in the eighth village in state II; and so on. The selected sample of 20
households and the data on household size are given in Table 2.9.
Here N — 600 and n = 20. The reader should verify the following sums, sums
of squares and products:
Therefore,
1 102 E 1
The estimated total number of persons in the 600 households in the universe is,
from equation (2.43),
y0* = Ny = 600 x 5.1 = 3060
The estimates variance of yo is, from equation (2.44),
I 1 17 17
2 18 35
3 26 61
4 18 79
5 24 103
6 17 120
7 20 140
8 24 164
9 24 188
10 22 210
II 1 15 225
2 22 247
3 17 264
4 19 283
5 20 303
6 19 322
7 19 341
8 23 364
9 25 389
10 23 412
11 18 430
III 1 15 445
2 21 466
3 18 484
4 21 505
5 20 525
6 21 546
7 21 567
8 17 584
9 16 600
56 Chapter 2
Table 2.9: Size of the srs of 20 households selected from 600 households in
30 villages
State no.
I II III
The estimated average household size is y = 5.1 and the estimated standard
error of y is
SSyt
sy — = 0.4407
n(n — 1)
The 95 per cent confidence limits of the universe average household size are
5.1 ± 2.09 x 0.4407 or 4.2 and 6.0.
Note: The total number and the serial numbering of households in each village
will not in general be known beforehand; also, in the above sampling scheme,
not all the villages will necessarily be represented. We shall see later how these
limitations can be overcome by using stratified multi-sample designs.
Example 2.3
From the list of 30 villages (Appendix IV), select a simple random sample of
4 villages with replacement, and from the data on the numbers of households
and persons, estimate for the universe the total numbers of persons and house
holds, the average numbers of households and persons per village, and the average
household size, together with their standard errors.
With the 30 villages in the 3 zones, it is easy to number these villages serially
from 1 to 30. Starting from where we had left off, and using two-digit random
numbers, we get the following: 19, 24, 18, 19, 15. In a “with replacement” sample,
we would have to take the first four indicated villages, with village number 19
occurring twice; however, although the sampling fraction 4/30 = 1/7.5 is over
10 per cent, we shall, for the sake of illustration, reject the repeated random
number 19, and select the next non-repetitive random number, 15. The sample
then becomes “without replacement”, but we shall treat it as if it were selected
with replacement. The required data for these sample villages are given in Table
2.10.
Simple Random Sampling 57
1 19 25 127
2 24 18 105
3 18 23 114
4 15 20 105
Total 86 451
Mean - h = 21.5 V = 112.75
Here N = 30, n = 4. The reader should verify the following sums of squares
and products: 52 = 1878; SSh, = 29; 52 = 51175; SSyt = 324.75;
J2 hty, = 9787; SPhty, = 90.5.
From equations of the type (2.42) to (2.49), the estimated total number of
households in the 30 villages is
ho = Nh = 30 x 21.5 = 645
0
= 30 x 1.5055 = 45.16
* = Ny = 30 x 112.75 = 3382.5
t/o
Estimates of totals, means, ratios, etc. are often required for the study
variables for sub-divisions of the universe or domains of study, e.g. for
households in different occupation or income groups in household budget
surveys, or for different family structures (such as nuclear and extended
families) in social investigations. The general estimating equations for a
simple random sample apply equally here, the only difference being that
the study variable will take the sample value for units belonging to the par
ticular sub-universe, and will be considered to have a zero value otherwise.
Thus, the sub-universe total is given by
N
y' = Yy< <2 54)
'* 1 '
Vo = ~ , Vi (2.55)
n
Simple Random Sampling 59
2 , (2.56)
*Vo n(n — 1)
The covariance and ratio of two sample estimators yQ* and x'q are sim
ilarly defined, as also is the estimated variance of the ratio y$ /x§ .
From equation (2.43), the combined unbiased estimator of Y1 from a
simple random sample (with replacement) is
= (258)
(2.59)
is
Vo = Vo/N' (2-61)
A. = A./n* (2.62)
V0 9Q
where n0* is the combined unbiased estimator of N'. For a simple random
sample, n0* is obtained on putting t/,- = 1 for the sample units which belong
to the sub-universe, namely,
where n' is the number of sample units that belong to the sub-universe.
From equations (2.57) and (2.64)
y'o = ^yi/n'
the sample average (the divisor being n' rather than n).
An estimator of the variance of y' is
(2.65)
y n'(n' — 1)
For an example, see Example 2.4.
Estimates are often required of the total number or the proportion of the
units that possess a certain qualitative characteristic or attribute or fall
into a certain class. Thus, in a household survey, we might wish to known
the proportion that live in houses they own.
If N' is the number of units in the universe possessing the attribute,
then the universe proportion of the number with the attribute is
P = N’/N (2.66)
where N is the total number of units in the universe. If we define our study
variable so that it takes the value
Yt■, = 1, if the unit has the attribute; and
Yi = 0, otherwise,
1
and the mean of the y,s is
N
y = 52 Yi/N = N'/N = p (2-67)
(2.70)
^ = a = ,/EZ
(2.72)
P P V nP
An unbiased estimator of cr2 is
s2 _ E2(y» - y)2 _ (nP ~ nP2) _ -p)
(2.73)
n—1 n—1 n—1
and an unbiased estimator of the variance of p is
c2 _ c2 _ _ P(1~P)
(2-74)
p y n n— 1
An estimator of the universe CV is
s ~(1 — p) n
(2-75)
P . P (n-1).
sy _ /i 1 -P '
(2.76)
*
n = Np (2.77)
62 Chapter 2
Table 2.11: Size of sample households with TV satellite dishes in the srs of
20 households selected in Example 2.2
Zone no.
I II III
Village no. 2 5 6 8 6
Household no. 1 5 14 18 13
Size y, 6 4 5 5 8
In large samples, the 100(1 — o) per cent probability limits of the universe
proportion P are provided by
P±t'asp (2.79)
Note: The results of this section could be derived from those of the preceding
section by setting y't = 1, if the sample unit has the attribute, and zero otherwise.
Theoretical proofs are given in Appendix II, section A2.3.5.
Example 2.4
Example 2.5
In the sample of 20 households selected from the 600 households (example 2.2),
the numbers of males and females are given in Table 2.12. Estimate the universe
proportion of males to total population and its standard error.
The procedure of section 2.13 will be inappropriate in this case, for the sample
persons were not selected directly out of the universe of persons. However, we
shall illustrate both the appropriate and the inappropriate procedures.
Assuming that the 102 sample persons were selected directly from the universe
of persons, we apply the procedure of section 2.13, noting that n = 102, and r
(the number of males) = 52. Hence from equation (2.69) the estimated proportion
of males in the universe is p = r/n = 52/102 = 0.5098.
From equation (2.74) the estimated variance of the sample proportion p is
Sp = p(l — p)/(n — 1) = 0.00206545, so that sp = 0.04545.
Table 2.12: Numbers of males and females in the srs of 20 households selected in
Example 2.2
Total
Male 32322215322254224312 52
Female 32222224233233123504 50
SSy, = 73.8
and the estimated universe proportion of males is, from equation (2.48),
(2.80)
Simple Random Sampling 65
i.
e. the mean of the group means.
The overall mean ÿ is an unbiased estimator of the universe mean Y with
variance cr2/n, so also the group means ÿj are each an unbiased estimator
Y but with variance <r2/m.
An unbiased estimator of tr^- is
k
42 = 52^ “ tf/W ~ 1) (2-82)
Note: The variance and covariance estimators and s'-- are based on (k — 1)
degrees of freedom, and are, therefore, less efficient than the variance and covari
ance estimators, obtained respectively from equations (2.44) and (2.47) based on
(n — 1) degrees of freedom.
Example 2.6
Further reading
Chaudhuri and Stenger, chapter 1; Cochran, chapters 2 and 3; Deming (1950),
chapter 4; Hansen et al. (1953), Vols. I and II, chapter 4; Hedayat and Sinha,
chapters 1 and 2; Kish (1965), chapter 2; Murthy (1967), chapter 2; Pathak
(1988); Sukhatme et al. (1984), chapter 2; Yates, sections 6.1-6.4, 6.9, 7.1-7.4
and 7.8.
Exercises
Ignoring the finite sampling correction, estimate from the sample (a) the
total number of cattle in the province and (b) the average number of cattle
per farm, along with their standard errors and the 95% confidence limits.
(United Nations (1972), Manual, Example 1).
3. From village number 8 in zone 1 (Appendix IV) draw a simple random
sample of 5 households without replacement. Obtain estimates of the same
characteristics (other than income) for the village as a whole as in Example
2.1, and compare the results.
4. From the list of 30 villages (Appendix IV) draw an srs of 4 villages with
replacement and obtain estimates of the same universe characteristics as in
Example 2.3. Compare your results with those in Example 2.3.
5. In a simple random sample of 50 households drawn with replacement from
a total of 250 households in a village, only 8 were found to possess tran
sistor radios. These households had respectively 3, 5, 3, 4, 7, 4, 4, and 5
members. Estimate the total number of households that possess transistor
radios and the total number of persons in these households, along with
their standard errors (Chakravarti, Laha, and Roy, Example 3.1, modified
to “with replacement” scheme).
6. A survey was organized to determine the incidence of HIV (human immun
odeficiency virus) seroconversion among first-time blood donors at a blood
bank in a large city. During a particular month, 180 first-time donors gave
blood there, of whom 175 were tested seronegative: a simple random sample
of 60 persons was selected without replacement out of the seronegative first
time donors and were asked to give blood again after a year, when 9 were
found to be seropositive. Estimate the incidence of HIV seroconversion and
its standard error (Levy and Lemeshaw, Exercise 7.10, adapted).
7. Show that in a simple random sample, unbiased estimators of the universe
mean and universe total of a study variable have the same coefficient of
variation, and that the CV of the estimated number of units possessing a
certain attribute is the same as that of the estimated proportion of such
units in the universe.
8. Table 2.14 gives the number of persons belonging to 43 kraals which form a
random sample of 325 kraals in the Mondora Reserve in Southern Rhodesia
(now Zimbabwe) and also the numbers of persons absent from these kraals.
(a) Estimate (i) the total number of persons belonging to the reserve,
(ii) the number absent from the reserve, and (iii) the proportion of
persons absent, and their standard errors.
(b) What would be the estimated standard error of the proportion of
persons absent had the sample been a sample of individuals selected
directly from all the persons belonging to the reserve? (Yates, Ex
amples 6.9b and 7.8b, modified to “with replacement” sampling for
computation of variance; note the differences in notations in Yates’
book and this book).
68 Chapter 2
y X y X y X y X
95 18 89 7 75 12 159 36
79 14 57 9 69 16 54 26
30 6 132 26 63 9 69 27
45 3 47 7 83 14 61 2
28 5 43 17 124 25 164 69
142 15 116 24 31 3 132 41
125 18 65 16 96 45 82 10
81 9 103 18 42 25 33 8
43 12 52 16 85 35 86 22
53 4 67 27 91 28 51 19
148 31 64 12 73 13
Total 3427 799
9. Table 2.15 shows the number of persons (jyx), the weekly family income (x,),
and the weekly expenditure on food (w.) in a simple random sample of 33
families selected from 660 families. Estimate from the sample (a) the total
number of persons, (b) the total weekly family income, (c) the total weekly
expenditure on food, (d) the average number of persons per family, (e)
the average weekly expenditure on food per family, (f) the average weekly
expenditure on food per person, (g) the average weekly income per person,
and (h) the proportion of income that is spent on food, along with their
standard errors (adapted from Cochran, Example, pp. 33-34).
The following raw sums of squares and products are given: ^2 y,2 =
533; £x2 = 177,524; £ w2 x = 28,224; £>,x. = 8889; = 3595.3;
w t = 66,678.0.
i,
Simple Random Sampling 69
Table 2.15: Size, weekly income, and food cost of the srs of 33 families
selected from 660 families
3.1 Introduction
If, however, the values of the ancillary variable z are available for all the
units of the universe, then
Z= (3.2)
the universe total for the ancillary variable is also known, and this infor
mation can be utilized to improve the efficiency of the estimators relating
to the study variable.
71
72 Chapter 3
the ratio of the estimators of the universe totals of the study and the an
cillary variables is taken,
*
= K + ~ ^Ry/z^z^/Z2
a2 (3.9)
Vr
an estimator of which is
(ry/zsz* ~ sy
*
)/^
z (3.12)
3. The variance estimator of y^ is valid for large sample generally, say when
n > 30.
4. Correction for bias. If the sample is drawn in the form of k independent
sub-samples, each of the same size m (so that n = mk) and giving unbiased
estimators y* and z* (j = 1,2, of the universe totals Y and Z
respectively, two estimators of the universe ratio YfZ are
r = yo/zo (3.13)
and
m
r' = ^2rj/m (3.14)
Again, with the selected sample in the form of m independent and inter
penetrating sub-samples, a ratio estimator of Y, corrected for bias is (vide
T.J. Rao) is
»r = 1r + (»» - - 1) (3.17)
74 Chapter 3
where
y* R = Z-Ty/z (3.18)
For a simple random sample of n units out of the universe of N units, the
ratio estimator of the universe total Y, corrected for bias, is
Vr = Vr + (Vo ~rzo)/(n - 1)
= y*R +N(ÿ-rz)/(n - 1) (3.19)
where
n n
r = ^r,/n = ^2(y,/2r,)/n (3.20)
We shall, however see later in Parts II-IV that in stratified and multistage
sampling, ratio estimators may be applied at different levels of aggregation.
6. In a large srs, the ratio estimator would be more efficient than the simple
unbiased estimator if the correlation coefficient between the study vari
able and the ancillary variable is greater than 1/2(CV of the ancillary
variable/CV of the study variable). For example, if the ancillary variable
represents the value of the study variable at some time past (such as the
population in the previous census while estimating the current population),
the two CVs could be taken as about equal and the ratio estimator would
be superior if the correlation coefficient exceeds 1/2. On the other hand,
suppose that the CV of the ancillary variable is double that of the study
variable: the ratio estimator would then be less precise, for the correlation
coefficient cannot exceed 1.
7. With a negative correlation between the study and the ancillary variables,
the ratio method of estimation will be extremely inefficient: in that case,
another method, the product method of estimation, may be used (see,
Cochran, section 6.21; Murthy (1967), section 10.8; Singh and Chaudhary,
section 6.10; and Sukhatme et al. (1984), section 5.11).
Example 3.1
For the same data as for Example 2.3, where a simple random sample of 4 vil
lages was taken from 30 villages, use the information given in Table 3.1 on the
Ratio and Regression Estimators 75
population data of a census, conducted five years previously, to obtain ratio es
timates of the total numbers of households and persons in the 30 villages. The
total population of the 30 villages in the previous census was 2815.
Here N = 30; n = 4; Z = 2815. The reader should verify the following:
z, = 419; z = 104.75; z2 = 44, 301; y,zt = 47, 597; h,zt = 9102;
and the corrected sums of squares and products: SSz, =410.75; SPyxzx = 354.75;
SPhxzx = 93.50. The other required computations have already been made for
Example 2.3.
From equation (3.5), ryjz = yi/ z, = 451/419 = 1.076372, so that the
ratio estimate of the current total population, using the previous census popula
tion, is, from equation (3.6),
The estimated variance of yR is, from equation (3.11), 302 x 3.079132, so that
the estimated standard error of YR is 30 x 1.755 = 52.650, and the estimated CV
of YR is 52.65/3030 or 1.74 per cent.
Similarly, the ratio estimate of the total number of households is hR = Z ■
Th/z = 2815 x 0.205251 = 577.78 or 578, with estimated standard error of 24.345,
and estimated CV of 4.21 per cent.
Note the following points:
1. The estimated average household size, using the ratio estimates, is the ratio
estimate of the current total population divided by the ratio estimate of
the total number of households, i.e. 3030 ¿-578 = 5.24, the same figures as
obtained by using simple unbiased estimates of totals in Example 2.3 (note
5 of section 3.2).
2. The ratio estimates of the total numbers of persons and households have
smaller standard errors than the unbiased estimates (in Example 2.3) and
the ratio estimates are closer to the universe values.
3. Correction for bias. To obtain ratio estimates of totals corrected for bias,
we need the additional computations shown in Table 3.2.
The ratio estimate of the total current population, corrected for bias,
is, from equation (3.19),
Table 3.2: Computation of the ratio estimated, corrected for bias: data of
Example 3.1
(3.27)
n(n — 2)
Notes
1. The regression method of estimation is more general than the ratio method:
the two become equivalent only when the (linear) regression of y on z passes
through the origin (0,0). The ratio method is, however, simpler to compute
and is preferred when the regression line is expected to pass through the
origin, i.e. when y is expected to be proportional to z.
2. Although the regression method is more general than the ratio method, it
is not much used in practice in large-scale surveys for two reasons: first,
the computation of the estimates is more complex, and secondly, the gain
in efficiency is not very marked in many cases as the regression line passes
either through the origin or very close to it.
3. When the regression of y on z is perfectly linear, i.e. when the correlation
coefficient p = 1, the variance of the regression estimator becomes zero;
and when y and z are uncorrelated, the variance is the same as that of the
unbiased estimator.
4. In some textbooks, in the formula for the estimated variance of the regres
sion estimator, the divisor n(n — 1) is used instead of n(n — 2). Although in
large samples the difference in these two estimators is negligible, the later
divisor has been suggested here, as in the regression method two estima
tors - the y-intercept on z and the (linear) regression coefficient of y on z
- are to be computed from the sample, leading to the loss of two degrees
of freedom; it is also standard in regression theory and is known to give an
unbiased estimator of the error variance in the universe regression equation
if the universe is infinite and the regression is linear (Cochran, section 7.3).
5. In practice, the estimated variance of the regression estimator, namely,
s2. can be greater than that of the ratio estimator, namely, s2* ; in
yReg
fact, the greatest value the ratio s2» Is2* can take is (n — l)/(n — 2),
yReg y Reg
when the y-intercept on z happens to equal zero (that is, the regression
line of y on z passes through the origin), so that /?"* = 2«>and
the numerators of the variance formulae (3.11) and (3.27) are identical.
6. If the regression of y on z is non-linear, the regression estimator is subject
to a bias of the order 1/n, so that the ratio of the bias to the standard
error is small for large sample. The bias is equal to —covariance (/? *, Zq).
Example 3.2
For the same data as for Example 3.1, obtain the regression estimates of the
total numbers of households and persons, using the data on the previous census
population, along with their standard errors. Here N = 30; n = 4; j/q = 3382.5;
= 645; z0* = 30 x 104.75 = 3142.5; Z = 2815; SSyt = 324.75; SSht = 29;
SSz, = 410.75; SPytz, = 354.75; SPhtzt = 93.5.
78 Chapter 3
Further reading
Cochran, chapters 6 and 7; Deming (1950), chapter 4; Foreman, chapter 4; Hansen
et al. (1953), vol. I, section 4C and 11.2, and vol. II, sections 4.11-4.19; Hedayat
and Sinha, chapter 6; Kish (1965), chapter 2; Murthy (1967), chapters 10 and 11;
Rao, P.S.R.S. (1988); Singh and Chaudhary, chapters 6 and 7; Sukhatme et al.
(1984), chapters 5 and 6; Thompson, S.K., chapters 7 and 8; Yates, sections 6.8,
6.9, 6.12, 7.8 and 7.12.
Exercises
1. A simple random sample of 2055 (= n) farms, drawn from the universe of
75,308 (= A) farms to obtain the total number of cattle, gave the following
data:
Sample total number of cattle,
= 25,751
Sample total area of the 2055 farms,
zt = 62, 989 acres
Actual total area of the 75,308 farms,
Z = 2, 353, 365 acres
The corrected sums of squares and products were
Obtain the ratio estimate of the total number of cattle, along with the
standard error, and compare the results of the unbiased estimate (United
Nations Manual, Example 7.i).
Ratio and Regression Estimators 79
Systematic Sampling
4.1 Introduction
81
82 Chapter 4
(4.1)
(4.2)
Table 4.1: Serial numbering of the universe units showing the systematic
samples of n units from N (= nk) units
1 1 2 r k
2 1+k 2+k r 4- k ... 2k
J 1 + (j - 2 4- (j — 1)
* r + (j - l)k ... jk
Noting that
1 k n
-7)2
r=1 j=1
= I 52 ¿(V'-i-Vr)2 (4.3)
r=l r=l j = l
= °b + (4.4)
where cr2 is the between-sample variance and cr2 is the within-sample vari
ance, defined respectively by the two terms on the right-hand side of equa
tion (4.3), then the sampling variance of the sample mean is
crt2 = <T2 - (4.5)
The sampling variance of the sample mean can also be expressed in
terms of the intraclass correlation coefficient between pairs of sample units
in a column
_ 52r=i ~Y)(yrj' — ^)
?c kn(n — l)cr2
so that
= — [1 + (n - l)/9c] (4-7)
Tl
Since <r2 > <r2, we note that pc must lie between — l/(n — 1) and 1.
Systematic Sampling 85
Example 4.1
(a) From village number 8 in state I (Appendix IV), select a circular systematic
sample of 5 households and for the 24 households in the village obtain
estimates of the total number of persons, total monthly income and average
daily income per person; also obtain estimates of standard errors of these
estimates, assuming the data came from a simple random sample.
(b) Draw with a fresh random start a second circular systematic sample of 5
households from the village, and obtain estimates with standard errors for
the same characteristics.
(c) From the two circular systematic samples, obtain combined estimates of
the characteristics along with their standard errors.
Systematic Sampling 87
Table 4.2: Size and daily income of the circular systematic sample of 5
households from 24 households in village no. 8 and state I (Appendix IV)
(a) The random start (i.e. the random number between 1 and 24) chosen was
14; the sampling interval is 24/5 or 5 to the nearest integer, so that the
five sample households are those with serial numbers 14, (14 + 5) = 19,
(19 + 5) = 24, 5 and (5 + 5) = 10. The data for these five households are
given in Table 4.2.
Here N = 24 and n = 5 so that f = n/N = 0.2083. Hence we have
y, = 18; y = yi/n = 18/5 = 3.6; i, = $193; x = xt/n =
$193/5 = $38.6; ^ny2 = 74; SSyt = 9.2; x2 = 7989: SSxt = 539.2;
ytx, = 764; SPyiX, = 69.2.
The estimated total number of persons in the village is given by
yo = Ny = 24 x 3.6 = 86.4 or 86
The estimated variance of y£ is, from equation (2.51),
s2. = (1 - f)N2SSyi/n(n - 1) = 0.79167 x 242 x 9.2/20
= 0.79167 x 242 x 0.46 = 242 x 0.364168
Table 4.3: Size and daily income of the second circular systematic sample
of 5 households selected from 24 households in village no. 8 in state I
(Appendix IV)
$1084.8/100.8 = $10.76
Systematic Sampling 89
Exercise
Table 4.4: Area under wheat of the two linear systematic samples of 4
villages each from 24 villages in an area
5.1 Introduction
91
92 Chapter 5
1 a 8 8/30
2 b 6 6/30
3 c 3 3/30
4 d 5 5/30
5 e 4 4/30
6 f 4 4/30
Total - 30 1
is
yi=Vi/^i (5-1)
If the values yi of all the universe units were known before sampling and
sampling is carried out with probability proportional to yi, i.e.
N
*i=yi/^yi (5.2)
Zi = (3yi
remains the same, and would give the same results as probability propor
tional to the value of the study variable, i.e. there would be no sampling
error.
The foregoing gives the clue to the determining factor for selection with
pps. We cannot, of course, know the actual “sizes” of the study variable,
but if we can find an ancillary variable, the values of which are known to
be roughly proportional to the values of the study variable, then we may
select the units with probability proportional to the values of the ancillary
variable in order to obtain estimators with greater efficiency than those
obtained from srs. the ancillary variable chosen should be such that its
values are known prior to sampling and the two are linearly related with
the regression line passing through the origin (0,0). If there is a perfect
positive correlation between the study and the ancillary variables but the
regression line does not pass through the origin (0,0), sampling with pps of
the ancillary variable will not necessarily be more efficient than srs.
94 Chapter 5
y< = ~ = z ~ (51)
7Tj Zi
and where Z = zt .
By the fundamental theorem of section 2.7, a combined unbiased esti
mator of Y is
r = i/o/^o (5-9)
and an estimator of the variance of r is
»? = (^.+^¿.-2«,;,;)/^
(b) Unordered (Murthy) estimator. If yi and y2 are the values of the two
sample units selected with pps and without replacement with proba
bilities of section ttj and %2 respectively, then an unbiased unordered
estimator of Y is
1
2 — Tl — TT2
96 Chapter 5
With the advent of electronic computers and the use of sample de
signs with two sample first-stage units in each stratum (after the
strata have been formed in a desirable manner, to be detailed later),
the unordered estimator is being employed increasingly. For detailed
methods of ppswor, see the references under “Further reading" at the
end of the chapter and consult a sampling statistician.
The unordered estimator is more efficient than the ordered, and
both are more efficient than the unbiased pps “with replacement”
estimator.
4. Note that the selection of the sample units with probability proportional
to “size” is equivalent to the selection with probability proportional to the
ratios of the sizes to their (a) total or (b) average. For in the latter cases,
the selection probabilities are
Zt >' zt/Z zt
(a) 7T- = —- -- ----- == z = *‘ (b) z =-
Thus in pps sampling it is not necessary to know the “sizes” if the ratios
of these sizes to their total or average are known.
5. pps sampling can be made with a suitable function of the value of an
cillary variable. For example, in surveys on fruit count, the selection of
branches with probability proportional to the fourth power of the firth
may be more efficient than ppg3, ppg2, or ppg (“g” indicating girth) or
srs (Murthy (1967), Section 6.6); in a stratified two-stage design, selection
with probability proportional to the square root of the number of ssu’s will
be reasonably close to the optimum and more efficient than probability
proportional to the number of ssu’s or srs if the costs vary substantially
with both the number of fsu’s and the number of ssu’s per fsu (Hansen et
al. (1953), vol. I, section 8.14, vol. II, section 8.3).
Example 5.1
From village number 8 in state I (Appendix IV), given the sizes of the 24 house
holds in the village, select 5 households with probability proportional to size (and
Varying Probability Sampling 99
1 5 5 13 6 63
2 3 8 14 4 67
3 3 11 15 5 72
4 7 18 16 6 78
5 4 22 17 3 81
6 4 26 18 5 86
7 6 32 19 1 87
8 6 38 20 3 99
9 4 42 21 5 95
10 5 47 22 6 101
11 3 50 23 4 105
12 7 57 24 4 109
with replacement) and from the data on total monthly income and food cost of
these 5 households, obtain for the whole village estimates of the total daily in
come, the total daily food cost, and the proportion of income spent on food with
their standard errors.
In this example, we shall illustrate the selection of the pps sample by cumu
lating the sizes. The cumulative sizes of the 24 households are shown in Table
5.2. As the total size of the 24 households is 109, we choose random numbers
between 1 and 981 (the highest three digit number which is a multiple of 109);
if the three-digit random number is greater than 109, it is divided by 109, and
the remainder taken. The first three-digit random number is 213, which leaves
a remainder of 104 when divided by 109; from Table 5.2, this random number is
seen to be greater than the cumulative size up to the 22nd household (101), but
less than the cumulative size up to the 23rd household (105): therefore the 23rd
household is selected.
The procedure of selection of five sample households with pps is shown in
Table 5.3.
The data on the selected five households and the required computations are
shown in Table 5.4.
Here N = 24; n = 5. The probability of selection of a household is
N
= y'/= S'«/109
100 Chapter 5
Random number
213 104 23
290 72 15
953 81 17
908 36 8
464 28 7
where yt is the known size of the ith household. The corrected sums of squares
and products are:
*
SSx = 1092 x 2.451704
SSw
* = 1092 x 3.7
*
SPx'w = 1092 x 1.91822
An unbiased estimate of the total daily income in the 24 households is, from
estimating equation (5.3),
Wq = *
/n
w = 109 x 5.4 = $588.6
House-
hold
House-
hold
Size Selection
probability
Total daily <=
Xi/lTi
* =
w
Wi/irt
* 2
w w
r*
Vi
sen id sample 7T, = Income Food
no. no. cost
i ii wt
o
102 Chapter 5
r = wo/io = $588.6/$1210.26
= 0.4863 or 48.63 per cent.
The estimated variance, s2, of r is, from equation (5.10),
Example 5.2
For the thirty villages listed in Appendix IV, the population data obtained from
a census conducted fives years previously are available and given in Table 5.5.
Select four samples villages with probability proportional to their previous census
population, and on the basis of the current population and number of households
in these four sample villages, obtain for the thirty villages estimates of the current
total population, number of households, and average household size, along with
their standard errors.
We shall follow the second procedure for selecting the pps sample which does
not require cumulation of the sizes. As there are 30 villages and the maximum
previous census population in any village is 122, we take five-digit random num
bers, the first two digits referring to the village serial number, and the last three
digits to the village census population. The first five-digit number is 06 733; the
last three digits, divided by 122, leaves a remainder of 1, which is less than the
census population of village serial number 6, namely, 65, so this village is selected.
The second five-digit random number is 65 511; the first two-digits, on division
by 30, leaves a remainder of 5, and the last three digits on division by 122, leaves
a remainder of 23; which is less than the census population of the village serial
number 5, namely 92, so this village too is selected. The third five-digit random
number of 01 932, the last three digits, on division by 122, leaves a remainder
of 78, which is greater than the census population of village serial number 1,
namely, 69, so this random number is rejected. We continue in this manner until
four villages have been selected, as shown in Table 5.6.
Note: We shall illustrate with this example the procedure of selection of a pps
systematic sample. As the interval I = Zjn = 2815/4 = 703.75 is not an integer,
we first select a random start between 1 and Z (i.e. 2815). Let this be 1938;
Varying Probability Sampling 103
1 1 69 II 6 16 84
2 2 81 7 17 85
3 3 110 8 18 102
4 4 80 9 19 122
5 5 92 10 20 102
6 6 65 11 21 86
7 7 72 III 1 22 78
8 8 108 2 23 112
9 9 106 3 24 97
10 10 80 4 25 117
1 11 72 5 26 106
2 12 102 6 27 115
3 13 73 7 28 110
4 14 84 8 29 104
5 15 98 9 30 103
Total 2815
06 733 (R 1) 65 Accept
65 (R 5) 511 (R 23) 92 Accept
01 932 (R 78) 69 Reject
71 (R 11) 508 (R 20) 80 Accept
48 (R 18) 222 (R 100) 102 Accept
then our random numbers are 1938; 1938 + 704 = 2642; (1938 4- 2 x 704 =)
3346 — 2815 = 531; and 531 + 704 = 1235. From Table 5.5 the reader may
verify that the cumulative previous census population is 1873 up to village no.
11 in state II and 1951 up to village no. 1 in state III; the random number 1938
therefore corresponds to the latter village, which is selected. Similarly the other
three sample villages would be village no. 8 in state III, village no. 7 in state I,
and village no. 5 in state II.
The data on the four sample villages and the required computations are shown
in Table 5.7: the computational procedures are somewhat different from those in
Example 5.1, and are generally to be preferred.
Here N = 30, n = 4, Z = t z, = 2815. The corrected sums of squares and
products are SSy
* = 88945.57; SSh
* h
*
= 16714.07; and SPy = 37524.44.
An unbiased estimate of the present total population in the thirty villages is,
from equation (5.3),
yo- = - F y* = 3239.33 or 3239
n
an unbiased estimate of variance of which is, from equation (5.5),
s2. = SSy
*
/n(n - 1) = 7412.1308
so that the estimated standard error of j/0 is
sv. = 86.09
and the estimated CV is 2.66 per cent.
Similarly, an unbiased estimate of the total number of households in the thirty
villages is
h* 0 = - V5 h* = 672.95 or 673
n
an unbiased estimate of the variance of which is si. = 1392.8392, so that the
0
estimated standard error of is = 37.32 and the estimated CV is 5.55 per
cent.
The estimated average household size is, from equation (5.9),
r = yo/hÔ = 3239.33/672.95
= 4.8136 or 4.81
As an unbiased estimated of the covariance of j/0 and is
5y.h. = SPy
h
*
/n(n - 1) = 3127.0367,
the estimated variance of r is, from equation (5.10),
s2T = (sy. + r2 s^. — 2rSy»h«)//iÔ2 = 0.0211556
so that the estimates standard error of r is sr = 0.1454 and the estimated CV
3.02 per cent.
Note that although this particular sample has not provided units with rel
atively large sizes, the CV’s are much smaller than those for a simple random
sample (Example 2.3).
Table 5.7: Present population and number of households of 4 sample villages selected with probability proportional to previous
census population and computation of estimates for 30 villages
5.5.1 Introduction
In a survey designed to estimate the total area under any particular crop
and its total yield, sampling of fields (or farms or plots) with probability
proportional to total (geographical) area (ppa) introduces simplifications
in the estimating procedures in addition to possible improvements in the
efficiency of the estimators. If a map of fields (or farms or plots) is available,
selection with ppa may be made by the procedure described in section 5.4.2.
7Tj = ai/A
where
N
and where a, is the area of the zth field (z = 1,2,..., N for the universe, and
i — 1,2,... ,n for the sample). Let z/t denote the area under a particular
crop for the zth sample field, then an unbiased estimator of the total area
under the crop in the universe is, from estimating equation (5.3),
n n
i/o = 52 = 52^ !niTi
n n
- A^aiPi/nai = A^Tpi/n = Ap (5.11)
where p, = Pi/di is the proportion of the area of the zth field under the
particular crop, which varies from 0 to 1.
An unbiased estimator of the variance of z/g is, from equation (5.5),
s2. = A2Sj- = A2SSpi/n(n — 1) (512)
= SSpi/n(n — 1) (5.14)
Note: If the crop is such that it either occupies the whole of a field or no part of
it, or if the fields are small enough for the this assumption to hold, such that the
proportion of the total area under the crop (p^) is either 1 (whole) or 0 (none),
the estimator for the variance of the total area under the crop in equation (5.12)
reduces to
For let r of the n sample units have p, = 1, and the rest (n — r) have p, = 0.
Then ^2” pi = £7 1 = r, also 2"?? = £7 1 ~ r> so t,hat P = P'/n = r/n>
and SSpt = 22” Pi — np2 = r — np2 = rip — np(l — p). Substituting this value of
SSpi in equation (5.12), we obtain equation (5.15).
Example 5.3
Ten farms selected with probability proportional to total area from the universe
of 100 farms gave the following proportion of area under a crop:
(p,): 0.20; 0.25; 0.10; 0.30; 0.15; 0.25; 0.20; 0.25; 0.10; 0.20
Estimate the total area under the crop for the 100 farms and its CV\ the total
geographical area is 16 124 acres.
Here N = 100; n = 10; A = 16 124; p, = 2.00; £np2 = 0.4175; and
SSp, = 0.0175.
The estimated proportion of area under the crop is, from equation (5.13),
s^ = 0.0174/90 = 0.00019444
<516)
i.
e. the simple (unweighted) average of the yields per unit area in the
different fields. Also,
Sy — SSri/n(n - 1) (517)
An unbiased estimator of the total yield X is (from an equation of the
type 5.11)
Sq = Ar (5.18)
and
= A2^ (5.19)
A generally biased but consistent estimator of the average yield per unit
of crop area is
r' = Xq/pq = r/p (5.20)
an estimator of the variance of which is
sr' = ~ 2r>Sy^-)/yo2 (5.21)
where sy^x* is an unbiased estimator of the covariance of and Xq, given
by
Sy’r« = A2SPpiri/n(n - 1) (5.22)
Note: In the far less common situation when the areas under a particular crop
are known for all the fields in the universe, sample fields can be selected with
probability proportional to crop area, and an unbiased estimator of the average
yield per unit of corp area is given by an equation of the type (5.16), namely
r =£r,/n (5.23)
and an unbiased estimator of the total yield of the crop by Yf , where r, = it/y,.
Variance estimators of these two estimators are given by estimating equations of
the types (5.17) and (5.19), respectively.
Varying Probability Sampling 109
As with simple random sampling, so also with pps sampling, the ratio
method of estimation may be used to improve the efficiency of estimators.
The principle is the sample and will be illustrated with examples.
Note, first, however, that if sampling is with pps, and the ratio method
is used with the help of the sizes themselves, the ratio estimator of the
universe total becomes the same as the unbiased estimator from the pps
sampling. For if y is the study variable and z the size variable, then with
the usual notations,
(5.3)
(5.24)
The ratio estimator of the total Y, using the size variable z, is therefore,
from an estimating equation of the type (3.6),
(5.25)
where rt- = yijzi- This result has special application in agricultural crop
surveys (see section 5.5).
Although it is known that for sampling with pps to be more efficient
than simple random sampling, the size variable should have a high, positive
correlation with the study variable, and the linear regression line of the
study variable on the size should pass through the origin, in a multi-subject
inquiry the size variable chosen (as a necessary compromise owing to the
conflicting desiderata of a number of variables) may be such that the above
conditions are not fulfilled in respect of a particular study variable; in
other situations, the required information on the desired size may not be
available at the time of the sample selection (see exercise 3 at the end of
this chapter). In these cases, the ratio method of estimation may be used
in order to improve the efficiency of the estimators obtained from the pps
sampling.
по Chapter 5
Thus, if w is the ancillary variable used for the ratio estimation and z
is the size variable used for pps sampling, then with the usual notations
vh = = ivr (5.27)
where
IV = £2 wt- ; r _ yo/w
* o (5.28)
Note: For the ratio method of estimation to be efficient, the selection probability
should be appropriate for both у and w.
Example 5.4
A sample of 4 villages, drawn with probability proportional to area from the list
of 30 villages given in Appendix IV, gave the data on the present population
(Table 5.8). Obtain an unbiased estimate of the present total population in the
30 villages. Given the previous census population of the 30 villages in Table 5.6,
obtain the ratio estimate of the present total population and compare the two
estimates. The total area of the 30 villages is 270.0 km2 and the total previous
census population, 2815.
Here N = 30; n = 4; Z = z, = 270.0 km2; W = w, = 2815.
An unbiased estimate of the present total population of the 30 villages is
yÔ = - V V* = 3506.14 or 3506
n '
An unbiased estimate of the variance of yÔ is
*
sj. = SSy
/n(n - 1) = 443 565.2459
so that the estimated standard error of y£ is sÿ* = 666.01 and the estimated CV
is 19.00 per cent.
These are the estimates we would obtain if no other information were avail
able. If, however, the data on the previous census were available, we could use
Table 5.8: Present and previous census population of 4 villages selected with probability proportional to area and computation
of estimates for 30 villages
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11)
7 1 4.5 60,0000 88 5 280.00 27878 400.00 72 4320.00 18 662 400.00 22 809 600.00
13 2 5.8 46.5517 80 3 724.14 13869 218.74 73 3398.27 11 548 238.99 12 655 633.24
22 3 10.0 27.0000 83 2 241.00 5 022 081.00 78 2106.00 4 435 236.00 4 719 546.00
30 4 10.2 26.4706 105 2 779.41 7 725 119.95 103 2726.47 7 433 638.66 7 577 977.98
these to improve our estimates. We first obtain the unbiased estimate of the total
previous census population, from equation (5.26),
Wo = - * = 3137.68 or 3138
w
n
An unbiased estimate of the variances of Wq is
st
* = SSw
*
/n(n - 1) = 224,942.3142
Also,
w
sy*w* = SPy
*
/n(n - 1) = 313, 175.4731
As the ratio of the two unbiased estimates j/^ and Wq is
the ratio estimate of the present total population, using the previous census pop
ulation is, from equation (5.27),
y*R = WVo/u;o = Wr
= 2825 x 1.11743 = 3145.57 or 3146
Further reading
Brewer and Hanif; Chaudhury and Stenger, chapter 2; Godambe and Thompson;
Hedayat and Sinha, chapters 1-3 and 5; Kish (1965), chapter 7; Murthy (1967),
chapter 6 and section 15.5c; Singh and Chaudhary, chapter 5; Sukhatme et al.
(1984), chapter 3; Yates, sections 3.9, 6.16, 7.15, 8.9 and 8.20.
Exercises
1951 census
population 551 865 2535 3523 8368 7357 5131 4654 1146 1165
Cultivated
area (acres) 4824 924 1948 3013 7678 5506 4051 4060 809 1013
Table 5.10: Area under a crop and the proportion to total area of 5 sample
farms, selected with probability proportional to the total area
Farm No. 3 18 28 34 35
Total area 52 110 300 410 430
Area under a crop 10 24 59 72 103
Proportion of area under the crop 0.1923 0.2182 0.1967 0.1756 0.2385
3. Table 5.11 shows for 1937 the area under wheat in 34 villages in Lucknow
sub-division (India) selected out of 170 villages with probability propor
tional to the cultivated area as recorded in 1931.
(a) Given that the total cultivated area in 1931 in the 170 villages was
78,019 acres, obtain an unbiased estimate of the total area under
wheat and its standard error.
(b) After the sample selection and enumeration, data on the area under
wheat in 1936 became available. Using this information, and given
that the total area under wheat in 1936 was 21,288 acres, obtain the
ratio estimate of the area under wheat in 1937 and its standard error
(Sukhatme and Sukhatme, 1970b, Example 4.4).
114 Chapter 5
Table 5.11: Total cultivated area in 1931 and area under wheat in 1936
and 1937 for a sample 34 villages selected with probability proportional to
cultivated area in Lucknow sub-division, India
Village Total Area under wheat Village Total Area under wheat
serial cultivated in in serial cultivated in in
no. area, 1931 1936 1937 no. area, 1931 1936 1937
(acres) (acres) (acres) (acres) (acres) (acres)
1 401 75 52 18 186 45 27
2 634 163 149 19 1767 564 515
3 1194 326 289 20 604 238 249
4 1770 442 381 21 701 92 85
5 1060 254 278 22 524 247 221
6 827 125 111 23 571 134 133
7 1737 559 634 24 962 131 144
8 1060 254 278 25 407 129 103
9 360 101 112 26 715 192 179
10 946 359 355 27 845 663 330
11 470 109 99 28 1016 236 219
12 1625 481 498 29 184 73 62
13 827 125 111 30 282 62 79
14 96 5 6 31 194 71 69
15 1304 427 399 32 439 137 100
16 377 78 7 33 854 196 141
17 259 78 105 34 824 255 265
6.1 Introduction
115
116 Chapter 6
the sampling unit - the cluster or its elements - and its size.
In this chapter, we shall consider the criteria for the choice of the sam
pling units, and for simplicity, only clusters of equal size (i.e. each with an
equal number of elementary units) selected by simple random sampling.
We shall see that when, as it generally happens, the elementary units
within a cluster tend to be similar in respect of the study variable, then
cluster sampling will be less efficient than a direct (unrestricted) simple
random sample of the elementary units, given the same total number of the
units in the sample. However, cluster sampling reduces the costs and labor
of travel, identification, contact, data collection etc., and may sometimes
also reduce non-sampling errors and biases in data (section 25.7). Apart
from the latter consideration, the question therefore arises of balancing the
general increase in sampling error against the decrease in costs. This leads
to the problem of determining the optimal size of cluster, i.e. the number
of elementary units it should contain. This problem is considered in the
next chapter.
Notes
1. Cluster sampling, as defined, may be used with single- as well as multi-stage
designs, and unstratified as well as stratified designs. A single-stage cluster
sampling may be considered as the case where all the second-stage units
(i.e. the elementary units) in the selected first-stage units (i.e. clusters)
are surveyed. Cluster sampling and two-stage sampling, to be considered
in Part III, have many common considerations. The term “cluster sam
pling” is sometimes used to denote any multi-stage design; the difference
in definitions is therefore well worth bearing in mind. The similarity in
formulation of the theory in section 6.2 with that of systematic sampling
in section 4.3 may also be noted.
2. Cluster sampling in general refers to geographical units; and a sample of
households is not generally described as a cluster sample of household mem
bers.
where
"I NM0
(6.6)
, XLfYi-Y?
- -------------------- (6.7)
N
is the between-cluster variance.
The between-cluster variance can also be expressed in terms of the in
traclass correlation coefficient pc between pairs of units within the clusters
(or “intra-cluster” correlation coefficient), which is defined by
„ = s:, E^,№-W-Y)
Pc NM0(M0 - 1)<Z2 1 ■ '
yi = yi/M0 = Yi (6.11)
The overall mean per unit in the sample is also the mean of the sample
cluster means
\-^M0 v —
_ 2_>: = 1 2_^7 = 1 , yt
v‘ = —---------------------------- (6-12)
n
sb = ¿(!/i - Vc)2/(n - 1)
(6-14)
i=i
4c = sl/n (6.15)
where
rri
= a2'/nMo (6.18)
Choice of Sampling Units: Cluster Sampling 119
is the sampling variance of the mean of nMo units selected directly by srs
(with replacement). Equation (6.17) is the fundamental formula in cluster
sampling.
An unbiased estimator of the within-cluster variance c? is the sample
estimator
n Mo
4 = 52(619)
i= 1 j =1
A sample estimator of pc is
This ratio is termed “design effect” or “deff” for short (Kish, 1965).
1. As <r2 > cr2 (from equation (6.5)), from equation (6.20) one sees that
pc lies between — 1/(Mq — 1) (when a2 = 0) and 1 (with cr2 = 0).
(iii) If po = 1 (the maximum value), i.e. if all the units in the cluster
have the same value for the study variable such that the greatest
degree of homogeneity obtains in a cluster, then cr2 = Mold ;
e. the variance for a cluster sampling will be Mo times that for
i.
an srs. Cluster sampling will then be extremely inefficient.
(iv) The factor (Mq — l)pc of the “design effect”, gives a measure
of the relative change in the sampling variance due to sampling
clusters instead of sampling the elementary units directly; for
example, if clusters (Mo =) 100 persons each are formed and
pc = 0.01, then the design effect is 1.99 or 2 approximately, so
that the variance of cluster sampling will bet twice that of an srs
of individuals. Thus, a relatively small value of the intracluster
correlation coefficient, multiplied by the size of the cluster, could
lead to a substantial increase in variance.
(v) pc will be negative for the sex and age composition of members
of a household (see Note 4, and exercises 3 and 4 at the end of
this chapter); cluster sampling of households will then be more
efficient than an srs of persons, and the cost-efficiency of cluster
sampling greater still.
(vi) In general, however, pc is positive, and decreases with the cluster
size Mo, but the factor (Mq — l)pc increases with increasing
cluster size, so that cluster sampling becomes less efficient than
srs, and increasingly inefficient as the cluster size increases.
6.4 Notes
has the attribute, the value zero otherwise, and if in the ¿th sample
cluster, M- units possess the attribute, then the cluster proportion of
units possessing the attribute is M[/Mo = Pi = pc Pi and pi denote
respectively the universe proportion and the sample estimator in the
¿th sample cluster,
An unbiased estimator of the universe proportion P is
n
Pc=^Pi/n (6.22)
Further reading
Cochran, chapter 9; Deming (1950), chapter 3C; Hansen et al. (1953), vol. I,
chapter 6A-D, vol. II, chapter 6; Hedayat and Sinha, chapter 7; Kish (1965),
sections 5.1, 5.2, 5.4, 6.1 and 6.2; Murthy (1967), chapter 8; Sukhatme et al.
(1984), chapter 7; Yates, section 2.9.
Exercises
1. From the data and results of exercise 7, Chapter 2, estimate the value of
intra-class correlation coefficient for the proportion of persons absent.
2. A bed of white pine seedlings contained six rows, each 434 ft long. Data
for four types of sampling units into which the bed could be divided are
shown in Table 6.1 along with estimates of cost (in terms of length of a row
that could be covered in 15 minutes). Obtain the optimum sampling units
after comparing the relative cost-efficiencies of the different sampling units
(Cochran, pp. 235-236).
(Hint: First compute the relative cost of measuring one unit, in this ex
ample, in terms of time required to count one unit: Cu = relative size of
unit/length of a row (ft) that can be covered in 15 minutes; then compute
the relative net precision of each unit which is defined as being inversely
proportional to the variance obtained for fixed cost, namely (CUS^) is
the universe variance per unit. For details, see Cochran).
3. (a) Consider households all of size 4, consisting of a couple and two chil
dren, and assume that the sex of a child is binomially distributed with
the proportion of male children being one-half. Show by computing the
between- and within-cluster variances or otherwise that the intraclass cor
relation coefficient between sexes of different members of the household is
— and that the efficiency of sampling households (clusters of persons)
to that of sampling persons directly for estimating the sex ratio is 200%
(Sukhatme and Sukhatme (1970b), section 6.3).
(b) Compute separately the values of the intraclass correlation coefficient
between sexes of different members of households with 1 child, 3 children,
and 4 children respectively in addition to a couple. Generalize the results
for a household with Mo members (a couple and the rest of their children) to
show that the intraclass correlation coefficient between the sexes of different
members of the households is — 1/M°C2.
[1 C2 - 2(M0 - 2)]/m°C2.
CHAPTER 7
7.1 Introduction
Two important questions on the designing of any sample inquiry are the
total cost of the survey and the precision of the main estimates. Both these
are related to the size of the sample, given the variability of the data, the
type of sampling and the method of estimation. Obviously, the larger the
sample, the smaller will be the sampling error, i.e. the greater the precision
of the estimates, but the higher will also be the cost. The survey should
be so designed as to provide estimates with minimum sampling errors (i.e.
with maximum precision) when the total cost is fixed, and to result in
the minimum total cost when the precision is preassigned: a sample size
fulfilling these conditions is called the optimal sample size. We shall see
later in Chapter 25 that other considerations such as the existence of non
sampling errors and biases in data have also to be taken into account: these
generally increase with sample size beyond a certain point.
The size of a sample will be determined by the objective of the inquiry,
and the permissible margin of error in the estimates. For example, during
the depression of the Thirties in the U.S.A., when it was not known whether
the unemployed numbered five million or fifteen million, the first sample
did not necessarily have to be large to be useful; at present, however, larger
samples with more sensitive measures are required to estimate if the unem
ployment rate (the number unemployed as a percentage of the total number
in labor force) increases from say 3 to 3 | per cent (Kish, 1971).
For simplicity, we shall in this chapter consider simple random sampling.
In general, under conditions in which pps sampling and the ratio method
of estimation are applied, these require smaller samples than an srs with
the same efficiency.
125
126 Chapter 7
which depends only on the universe proportion P and the sample size n.
The coefficient of variation of p is
(7-2)
from which
P(l-P) (1-P)
(7.3)
<Tp Pe2
This determines the sample size n required for any given coefficient of
variation e.
As the CV per unit is
The sample size can also be determined such that the universe propor
tion P would lie within a given margin of error d on both sides of the sample
estimator with a certain probability (1 — a), which is equivalent to saying
that the acceptable risk that the universe proportion P will lie outside the
limits p ± d is a. For this, we use the assumption (which holds when n is
large and the proportion P is not too small) that the sample proportion is
normally distributed with mean P and standard deviation ap. Then the
(1 — a) per cent probability limits of the universe proportion P are
P± ta(Tp (7.6)
Size of Sample: Cost and Error 127
where ta is the value of the standard normal deviate that cuts off a total area
a from both the tails taken together in the normal curve. Some illustrative
values of ta are given in Appendix III, Table 2; for example, for a = 0.05
(1 in 20), 1 — a = 0.95 and t = 1.96 or approximately 2.
Setting the permissible margin of error
d = taap = tay/[P(l-P)/n] (7.7)
and solving for n, we get
n = t2aP(l - P)/d2 (7.8)
In Appendix III, Table 3 have been tabulated the values of CV per
unit in the universe for some values of the universe proportion P, and
in Appendix III, Table 4, the required sample sizes (n) corresponding to
given values of the universe CV per unit and the desired CV of the sample
estimator. Given a value of P, the universe CV per unit can be obtained
from equation (7.4) or read off Appendix III, Table 3; and the required
sample size n can then be obtained from equation (7.5) or from Appendix
III, Table 4, given the desired CV of the sample estimator e.
The margin of error d can be expressed as
d = taap = taPe (7.9)
Notes
1. As the CV of the estimated number of units possessing an attribute is
the same as that of the estimated proportion of such units in the universe
(exercise 6, Chapter 2), the same formulae for n will hold for the estimated
number with the attribute as for the estimated proportion.
2. As the universe proportion P will not generally be known, an advance
estimate p may be taken and used in the preceding equations.
3. As the value of p(l — p) increases as p approaches 1/2, a safer estimate of
n is obtained on taking as an advance estimate that value of p which is
nearer to 1/2; for p = 1/2, the margin of error takes the maximum value
(from equation 7.7)
d = taffp = 1.96 x ^/[(1/2 x l/2)/n] = 0.98/Vn = 1/-^
approximately, for a = 0.05 (i.e., 95% confidence limits).
This is a good back-of-the-envelope formula to remember. For example,
for a sample of size 1,296 persons, d = 1/^/1, 296 = 1/36, i.e., about 3 per
cent.
For very small p, the advanced estimate should not be too rough; in this
case, the Poisson approximation can be used, so that equation (7.8) is
simplified to
n = ti P/d2 (7.10)
128 Chapter 7
n = 1/Pe2 (7-11)
(The Poisson distribution can be regarded as the limiting distribution of a
binomial when P becomes indefinitely small and n increases sufficiently to
keep nP finite (say m), but not necessarily large; both the mean and the
universe variance per unit are then equal to m).
4. When sampling is without replacement, and the sampling fraction n/N is
not negligible, a more satisfactory estimate of the sample size is
'= n (7.12)
1 + n/N
where n is obtained from the previous equations on the basis of assumption
of sampling with replacement.
5. In practice, estimates of proportion and ratios such as birth and death rates
are seldom obtained from a simple random sample of individuals.
Example 7.1
Example 7.2
The (crude) birth rate is to be estimated in a country from a sample survey with
2.5 per cent CV. What is the sample size required if a rough estimate of the
birth rate places it at 40 per 1000 persons?
Here P = 0.04, and the universe CV per unit is 4.9 (from equation (7.4)). As
the desired CV is e = 0.025, n = 38,400 persons (from equation (7.5)).
is
CV = ff/Y (7.13)
From equation (7.15), the sample size n required to obtain the sample
mean with a given CV e is
In Appendix III, Table 4, have been given the required sample sizes (n)
corresponding to some values of the universe CV per unit and the desired
CV of the sample mean (e).
The sample size may also be determined such that the acceptable risk
that the universe mean Y will lie outside the limits Y is a. For this, we
assume that the sample mean y is normally distributed with mean Y and
standard deviation ffy = <r¡y/n. The (1 — a) per cent confidence limits of
the universe mean Y are
y±t<^ (7.17)
where ta is the value of the standard normal deviate that cuts off a total area
a from both the tails taken together in the normal curve. Some illustrative
values of ta are given in Appendix III, Table 2.
Setting the permissible margin of error
where V is the desired variance of the sample mean. This determines the
sample size n, given the values of ¿r, d and a.
130 Chapter 7
Notes
1. If the universe total Y is to be estimated with margin of error D, the sample
size required is
n = (Ntff/D)2 = {Na)2/V' (7.20)
where V' is the desired variance of the sample total. Note that the unbiased
estimators of the universe mean and total have the same CV (exercise 2,
chapter 6), and equations (7.19) and (7.20) give the same value of n.
2. The above formulation assumes a knowledge of universe CV or cr. The
universe CV remains remarkably stable over time and space and for char
acteristics of the same nature as the study variable; the CV for a previous
study in the same or a different area for a related characteristic may there
fore be taken. An advance estimate of tr may be taken from a pilot or some
other study.
3. When sampling is without replacement, and the sampling fraction n/N is
not negligible, a more satisfactory estimator of the sample size is
' - n (721)
n ~ 1 + n/N
This applied both to equations (7.19) and (7.20).
4. If a sample of size n gives a CV e for a sample estimator (mean or total),
then to obtain the sample estimator with CV e , the required sample size
is
n = n(e/e )2 (7.22)
This follows from the relation
{CV per unit)2 = ne2 = n e 2 (7.23)
which in turn follows from equation (7.15).
Example 7.3
The coefficients of variation per unit (an area of 1 mile square) obtained in a farm
survey in Iowa, U.S.A, are given in Table 7.1. A survey is planned to estimate
acreage items with a CV of 2.5 per cent, and the number of workers (excluding
the unemployed) with a CV of 5 per cent. With simple random sampling, how
many units are needed? How well would this sample be expected to estimate the
number unemployed? (Cochran, exercise 4.5).
The maximum CV of the items (other than the number unemployed) is that
of the number of hired workers, for which the CV is 1.10, the desired CV of which
is 0.05. From equation (7.16),
n = {CV per unit/desired CV of sample estimator)2
= (1.10/0.05)2 = 222 = 484
Size of Sample: Cost and Error 131
The acreage items are required with CV of 0.025. Taking the acreage item
with the maximum CV, namely, acres in oats, the required sample size is, from
equation (7.16),
n = (0.44/0.025)2 = 17.62 = 310
so that with a sample size of 484, required for the number of workers, the desired
CV of the acreage items will also be attained.
For the number unemployed, we get, from equation (7.15), the CV of the
sample estimator from a sample of size 484,
Example 7.4
d' = 10 per cent of the true value at 95 per cent probability level, or
from equation (7.18); i.e. the desired CV for the sample estimator in the second
village is
e' = a'y/Y = 0.1/2 = 0.05
From equation (7.22), the required sample size in the second village is
If estimates are required not only for the universe as a whole, but for sub
divisions such as geographical area, or sex and age groups of the population,
obviously the sample size, obtained to estimate the overall universe value
with a given precision, must be enlarged if estimators for the sub-divisions
are required with the same precision as that of the overall universe estima
tors.
As a rough rule, if estimators with variance V are required for each of
the k universe sub-divisions, the sample size should be
n = kn (7.24)
where n is the required sample size for the overall universe estimate with the
same variance V. The assumptions are that the per unit CVs of the sub
divisions are about equal, that the sub-divisions are approximately equal
in size, and that the overall sample is large.
If the proportion of the universe units in any sub-division pt, is known
or could be estimated, the required sample size to estimate the average
from the sub-division with variance V is
ni = (Ti/PiV (7.25)
where af is the variance per unit in the zth sub-division and npi is large.
We have to take the maximum value of the right-hand side of equation
(7.25) in order that it holds for all sub-divisions,
as the <r2s will, on an average, be slightly smaller than the universe variance
<r2. If the sub-divisions are approximately equal in size, the pt may be taken
to be equal to 1/A”, and this relation is used in equation (7.25).
Size of Sample: Cost and Error 133
Example 7.5
In Example 7.1, what size of sample is required if total cases are wanted separately
for males and females, with the same precision? (Cochran, exercise 4.3).
Here k = 2, and assuming that males and females are about equal in number
in the population, we have, from equation (7.24), n = 2 x 2475 = 4590.
Table 7.2: Rules for estimating the variance from the range (h) depending
on the shape of the distribution
Note: The mathematical relations are not of much use if h is large or cannot be
estimated closely. If h is large, the universe can be stratified (Part II) when, within
a stratum, the shape of the distribution becomes simpler, closer to a rectangle, if
stratification is effective and the mathematical relation between h and <r can be
used for each stratum.
Example 7.6
The four-year colleges in the U.S.A, were divided into classes of four different
sizes according to their 1952-3 enrollments. The standard deviations within each
class are given in Table 7.3. If you know the class boundaries but not the value
of <r, how well can you guess the a values by using simple mathematical figures?
No college has less than 200 students and the largest has about 50,000 students
(Cochran, exercise 4.8).
Assuming a rectangular distribution within each class, the range (A) and the
estimates of standard deviation (= 0.29 h) are given in Table 7.3. If for Size class
1, we assume the right triangle distribution (II), the estimate s.d. = 0.24 h = 192;
and if for Size class 4, we assume the right triangle distribution (I), estimated s.d.
= 0.24 ft = 9600.
Table 7.3: Actual and estimated standard deviations within each size class
of four-year colleges in the U.S.A., 1952-3
1 2 3 4
We have seen how to obtain the sample size required to provide estimators
with a given precision in simple random sampling. To determine the implied
Size of Sample: Cost and Error 135
total cost of the survey, we take the simplest type of cost function
c = Co + HCi (7.27)
where co is the overhead cost, and ci the cost of surveying one sample unit.
On the other hand, if the total cost C is fixed, the sample size is deter
mined from the above relation, namely
n = (C - c0)/ci (7.28)
and the only thing left is to estimate the expected CV of the sample es
timators from a sample of this size if prior information is available on the
variability of data.
With increase in sample size, the cost of the survey increases but the sam
pling error, and so the loss involved in basing any decision on the sample
estimators, decreases; it is necessary to express this loss in monetary terms
in order to find a balance between cost and error.
Taking an srs (with replacement) of n units, where the sample mean y
is used to estimate the universe mean Y, the error in the estimator y is
b=y—Y
(7.29)
(7.30)
n=a (7.31)
136 Chapter 7
where n is the sample size required for the given precision had the ultimate
units been sampled directly.
Example 7.7
In Example 7.2, what would be the sample size required for estimating the birth
rate with 2.5 per cent CV if clusters of 300 persons each are taken and the
intraclass correlation coefficient is estimated at 0.001?
From equation (7.32), nc = 38,400 (1 4-300 x 0.001) = 49,920, an increase of
30 per cent.
The universe variance of the sample cluster mean yc is, from equation (6.17),
“ <r2/n = a2[l + (Mo - l)pc]/nM0
= (rffc! + MqC?)/(C — Co) from equation (7.34)
= C1<t2(1 + M0c2/C1)[l + (Mo - l)pc]/M0(C - co) (7.35)
If previous information is available from empirical or pilot studies, the
values of the equation (7.35) can be computed and plotted for different
values of Mo, to show the particular value of Mo that minimizes the ex
pression.
Putting
C — co — C — nc\ -|- nMoC2 (7.36)
the optimal cluster size is
£1 1 ~ pc
(7.37)
.c2 Pc .
In practice, instead of dealing with the costs c\ and C2, we might con
sider the man-days required to be spent on the different operations. If the
enumerators work singly, ci may be 2 to 3 man-days, and if 30 to 50 per
sons can be enumerated per day in a demographic inquiry, c2 will be from
1/30 to 1/50 man-day; the ratio ci/c2 will then range from 60 to 150. The
optimal size of cluster for some typical values of the ratio C1/C2 and of pc
are given in Table 7.4.
Taking a typical value of the ratio Cx/c2 as 100, and the value of the
intraclass correlation coefficient at 0.001 for the birth rate, the optimal clus
ter size is 316 persons; for the death rate, taking the intraclass correlation
138 Chapter 7
coefficient at 0.003, the optimal cluster size is 183 persons. The value of n
(number of sample clusters) is determined from equation (7.34), given the
values of C, co, ci, C2 and the optimum value of Mq.
Note that optimal cluster size is generally very broad (except for very
small values of the intraclass correlation coefficient), so that substantial
deviations from the optimal cluster size would not affect the cost very much.
Example 7.8
For a survey on birth rate, given that the total cost, neglecting overheads, is
fixed at $20,000, and the enumerator cost per month is $300, what is the optimal
size of sample if it is decided to select a cluster of persons, assuming that an
enumerator has to spend, on average, two days in contacting the clusters and in
other preliminary work; that he can enumerate an average of 40 persons a day;
and that the intraclass correlation is estimated at 0.001?
Here C = $20,000; 1 man-day = $300/30 = $10; c\ — 2.5 man-days = $25;
C2 = 1/40 man-day = $0.25, so that C1/C2 = 100; pc — 0.001. From Table 7.4,
the optimal size of cluster is 316 persons. Taking Afo at 300 persons, the total
cost (neglecting overheads) is, from equation (7.33)
or n = 200 sample clusters. The total sample size is nMo = 200 x 300 = 60, 000
persons.
small sample and with increasing resources build up a fully adequate sam
ple; the Current Population Survey of the U.S.A., for example, started in
1943 with 68 primary areas which were enlarged to the present 449. Fourth,
it is possible to combine smaller monthly or quarterly estimates into yearly
estimates, and the yearly estimates into estimates covering longer periods,
to provide estimates with acceptable precision. And finally, in the interest
of true accuracy, it may sometimes be better to conduct a smaller sample
with adequate control than try to canvass a much larger sample but with
poor quality data (see also Chapter 25).
Further reading
Cochran, chapter 4 and section 9.6; Deining (1950), chapter 14; Hansen et al.
(1953), vol. I, section 4.11 and chapter 6D, and vol. II, section 4.9; Hedayat
and Sinha, chapter 5; Levy and Lemeshaw, section 14.3; Kish (1965), section 2.6;
Murthy (1967), sections 4.6 and 8.3; Sukhatme et al. (1984), section 1.9, 1.10,
and 2.9-2.11; Yates, section 4.32.
Exercises
number of persons in the universe within 10 per cent, apart from a chance
of 1 in 20.
5. For the 30 villages listed in Appendix IV, if no village is assumed to have
less than 60 or more than 150 persons, estimate the s.d.
CHAPTER 8
Self-weighting Designs
8.1 Introduction
For a simple random sample, the design is self-weighting with respect to the
unbiased estimators of universe totals, averages, and ratios. For with the
notations used in Chapter 2, an unbiased estimator of the universe total Y,
obtained from a simple random sample of n units out of the universe total
of N units, is
n
Vo = N^yi/n (8-1)
141
142 Chapter 8
where yi (i = 1,2,..., n) is the value of the study variable for the ¿th sample
unit. The weighting factor for the ¿th sample unit it, therefore,
Wi = N/n (8.2)
which, when multiplied by t/, and added for the n sample units, provide the
unbiased estimator t/g °f the universe total Y. The multiplier is constant
for all the sample units.
Similarly, the unbiased estimator of the universe mean Y is given by the
sample mean
n
j/ = ^2i/,7n (8.3)
where the multiplier is 1/n, constant for all the sample units.
In the estimation of the ratio of two universe totals, the sample estimator
is
r = 12yi/12Xi (8-4)
Here, because the design is self-weighting with respect to the two totals,
the multiplier does not enter in the ratio; it does not also enter into the
estimating equation for the variance of the ratio.
Wi = 1/niTi (8.6)
which will not be the same for all the sample units, and the design will not,
therefore, be self-weighting.
Self-weighting Designs 143
However, if tt, = zijZ, where z,- is the ‘size’ of the ith universe unit and
Z= zi is the total ‘size’ of the universe, then
Vo = z^ri/n (8.7)
where r, = yi/zi.
If the ratios r,- = yi/z could be observed and recorded easily in the field,
so that the r,s could be considered as the values of a newly defined study
variable, then the design will be self-weighting with respect to the unbiased
estimators, for the multipliers for the r,- values are w' — Zin which are the
same for all the sample units. This has practical uses in crop surveys, as
we have seen in section 5.5.
Notes
1. A pps sample design can be made self-weighting at the tabulation stage by
selecting a sub-sample of the sample units with probability proportional
to the multipliers, but the sampling variance will be increased (see section
13.5).
2. We shall see later in Chapters 18 and 23 how in a multi-stage design the
sample can be made self-weighting even with pps sampling at some stages.
When the original design is not self-weighting, one of the following pro
cedures may be adopted to reduce the number of multipliers and thus to
achieve at least partial self-weighting at the tabulation stage.
1. The multipliers wts may be replaced by their simple average.
2. The multipliers may be rounded off to some convenient numbers as
the nearest multiples of ten, hundred etc.
3. The multipliers may be rounded off to a small number of weights by
a random process which would retain the unbiased character of the
estimators (Note 3(c) to section 12.3.3).
4. A sub-sample of n' ultimate units may be selected from the original
sample with probability proportional to their multipliers (Note 1 to
section 8.3).
The first two procedures will lead to biased estimates with possible de
crease in variance. However, the bias in the first procedure will be negligible
if in the sample the covariance between the sample values and the multipli
ers is small. The third and the fourth procedures give unbiased estimators
144 Chapter 8
STRATIFIED SINGLE-STAGE
SAMPLING
CHAPTER 9
9.1 Introduction
147
148 Chapter 9
Notes
1. That stratification may lead to a gain in efficiency per unit of cost may be
seen if we consider a universe composed of strata that are, with respect to
the study variable, internally homogeneous, but heterogeneous with respect
to each other: in this case, a very small sample from each strata would
provide estimates with relatively small sampling variances.
Taking an extreme case, consider the universe of six households with
respective sizes 4, 4, 4, 5, 5, and 5. The reader will verify that the variance
per unit in this universe is <r2 = 0.25. If we were to draw a simple random
sample of 4 households with replacement from this universe, an unbiased
estimator of the total size is Nÿ (where N = 6 and ÿ is the sample mean,
based on n = 4 sample units), with the sampling variance №<t2/n = 2.25.
Suppose, however, that the universe is subdivided into two strata, the first
with the three households each of size 4, and the second with the other three
households each of size 5; and an srs with replacement of 2 units is to be
drawn from each stratum. Then, as we shall see later, an unbiased estimator
of the total size is (Aij/j + N2^2), where M = 3, N2 = 3, being respectively
the total number of units in the two strata, and Vi and I/2 fhe respective
stratum means based on ni = 2 and 712 = 2 sample units; the sampling
variance of this unbiased estimator of the total is AfiTj/ni 4-jV2<72/n2 = 0,
as the variance per unit in the two strata a2 = er2 = 0.
2. Stratification may be carried out at different stages of sampling. The
most common type of stratification is by administrative and geographi
cal sub-division, such as by provinces, prefectures, counties, districts, and
rural/urban categories. In household surveys, stratification is often carried
out before the sample households are selected, by listing the households
from which the sample is to be drawn, and recording such of other charac
teristics that may be readily obtained, e.g. size, and social and economic
Stratified Sampling: Introduction 149
classes; the households are then stratified on the basis of the characteris
tics recorded and then sampled (with a different sampling fraction) in each
stratum.
3. Strata may be formed of units that are not geographically contiguous. Thus
in a rural socio-economic survey, the villages may be stratified according
to their population as given by the most recent census; or in a crop-yield
survey, the fields may be classified according as they are irrigated or not;
in the demographic inquiry in Mysore (a former State in India), conducted
by the United Nations and the Government of India in 1951-2, three strata
were formed in the rural areas; rural hills area with large-scale anti-malarial
operations, rural hills area without large-scale anti-malarial operations, and
rural plains (tank-irrigated area).
4. The data from a sample design that is not stratified, or is stratified ac
cording to some variable other than that desired, may, under some circum
stances, be treated as if coming from a sample stratified according to the
desirable stratification variable. This technique known as stratification af
ter sampling or the technique of post-stratification, is explained in Chapter
10 in connection with stratified simple random sampling.
In the following the fundamental theorems given in section 2.7 are extended
to the case of stratified sample designs.
1. The universe is sub-divided into L mutually exclusive and exhaus
tive strata. In the /ith stratum (/i = 1,2,... ,£), there are (> 2)
independent and unbiased estimators thi (i = 1,2,..., n/>) of the uni
verse parameter a combined unbiased estimator of Th is, from
estimating equation (2.28), the arithmetic mean
1=1
where
(9.3)
1=1
150 Chapter 9
«2 = 52(9-9)
h=l
where
”x
SPthi^hi — Uh}
i=l
«x / nx \ / «X \ /
= ~ ( Ylthi ) ( ^Uhi } / ^ (911)
rh = th/üh (9.13)
with estimated variance (from equation (2.34))
< = (s?x + rhsk ~ 2r^txüx)/ïïh (9-14)
For all the strata combined, a consistent but generally biased estima
tor of the ratio of two universe parameters R = T/U is the ratio of
the respective sample estimators t and u,
r = t/u (9.15)
Ph = stküJstsh (917)
and an estimator of the correlation coefficient p of the two study
variables at the overall level is
Notes
1. The question of formation of strata, and the allocation of the total number
of sample units into the different strata will be taken up later in Chapter
12.
2. Estimates are often required for higher levels of aggregation than strata,
in addition to the overall universe estimators. For example, for the Indian
National Sample Survey on Population, Births and Deaths (1958-9), the
strata consisted of tehsils or groups of tehsils (a tehsil being an admin
istrative unit comprising a number of villages and some small towns); a
sample of villages was selected from each stratum and the population and
the births and deaths during the preceding twelve months recorded. The
country comprises a number of states, each state comprising a number of
tehsils, and estimates were required not only for the country as a whole,
but also for the states separately. For estimates for the whole country, es
timating equations of the types (9.5), (9.6), (9.12), (9.15), and (9.16) were
used; for the estimates for the states, the same types of equations were used
excepting that for any state, the values of only those strata that fell within
the state were considered (see Example 10.3).
3. As noted in section 9.2, the selection probabilities might be different in
different strata.
4. For other notes, e.g. relating to estimators of ratios, see the notes to section
2.7.
We have seen in section 2.8 that for an unstratified single-stage sample, the
(100 — a) per cent confidence limits of a universe parameter T are set with
the sample estimator t and its estimated standard error s*, thus:
(2.40)
where t'a n-1 is the 100a percentage point of the /-distribution for (n — 1)
degrees of freedom. It was also noted that for large n, the percentage points
of the normal distribution could be used for those of the /-distribution.
The above holds for each stratum separately. Thus the 100a per cent
confidence limits of the universe parameter for the /ith stratum are
th i ia,nh-isth (9.19)
the sample stratum estimator th being defined by equation (9.1) and its
estimated standard error by equation (9.2).
For the confidence limits of the overall universe parameter T, the nor
mal distribution may be used if the estimators of variances of the stratum
Stratified Sampling: Introduction 153
estimators th, namely , are based on not too few degrees of freedom
and it is assumed that the overall universe estimator t, defined by equation
(9.5), is normally distributed. Then
t ± t'ast
are the (100 — a) per cent confidence limits of the universe parameter T
where is defined by equation (9.5) and t'a is the normal percentage point
for probability a.
If, however, the variance estimators in the different strata s? are based
on small numbers of degrees of freedom, the normal distribution cannot be
used. The /-distribution can be used by computing the effective number of
degrees of freedom.
II
n (9.21)
EX !(nk -1)]
which will lie between (the smallest of the) — 1 and ^L(nh — 1). The
(100 — o) per cent confidence limits of T are
<±X'S' (9.22)
where th\ and /^2 are the two unbiased estimators of Th from the two
selected sample units.
An unbiased estimator of the variance of th is
< = 024)
so that
Also,
The overall estimators (for all the strata taken together) are:
l L
t = }^,th — ~ \(thl + ^h2) (9.29)
h=\ 1
s? = EX.
h=l
= h=l
(9.30)
stu L 1 1
- ^S^ = ~^(thl _th2)(uhl _ Uh2) (9.31)
h=l h=l
Also,
L , L
r — t / u — ^(thi + ^2) / 52(whi + U/12) (9.32)
h=l ' h=l
Eh=i(wM + u^y
(9.33)
Thus the sums and differences of the two estimates obtained from the
two (first-stage) sample units in the different strata supply the required
data for computing the overall estimators and their variances, not only
for the totals but also for the ratios. This procedure has been followed in
the Indian National Sample Survey since its inception in 1950, and some
standard errors of ratios and ratio estimates were published in 1956.
Stratified Sampling: Introduction 155
(9.34)
\h=l h=l /
2 52 52<i,a (9.35)
h=l h=l
These are simpler to compute, but being based on only one degree of freedom will
be less efficient than the estimators defined earlier which are based on an effective
number of degrees of freedom, given by equation (9.21), that will lie between 1
and (Z — 1). Another unbiased estimator of the covariance of t and u can be
defined similarly, and so also the estimated variance of r.
Further reading
Cochran, chapter 5; Hansen et al. (1953), vol. I, chapter 5A; Hedayat and Sinha,
chapter 9; Murthy (1967), sections 7.1-7.3; Sukhatme et al. (1984), chapter 4;
Yates, sections 3.3-3.5.
CHAPTER 10
10.1 Introduction
In this chapter we will consider the estimating methods for totals, means,
ratios of study variables and their variances when a simple random sample
has been drawn in each stratum. The methods of estimation of proportion
of units in a category, the use of ratio estimators and stratification after
sampling will also be considered.
N = Y,Nk (10.1)
h=l
157
158 Chapter 10
and the combined unbiased estimator for the stratum total Yh, obtained
from the sample units in the hth stratum, is, from equation (2.43) or
(9.1),
nh
y*hO = 52 Vhi/n* = NhVh (10-4)
«=1
where yh = £2"
* yhi/nh is the mean of the yhi values in the hth stratum.
Universe totals, means, and ratios of two totals and their sample esti
mators are defined in Table 10.1. These follow from the results of section
2.9 for any stratum and of section 9.3 for all the strata combined. The
sample estimators of totals and means are unbiased (theoretical proofs are
given in Appendix II, section A2.3.9); those for ratios are consistent, but
generally biased.
Table 10.1: Some universe parameters and their sample estimators for a
stratified simple random sample
For a
study
variable:
Total Yh Y y
= Yh, (10.5)
= NhYK = Nhyh
Mean Y = YK/NK Vh Y = Y/N y/N (10-6)
For two
study
variables:
Ratio of Rk = Yh!Xk rh = Vbo/XHQ R = Y/X r = y/x (10.7)
totals = yh!xh
Stratified Simple Random Sampling 159
- Yh)2/Nh (10.8)
i=l
and the variance of the sample mean yh in a simple random sample (with
replacement) of n/> units is, from equation (2.17),
ak=<rh/nh (10.9)
so that the variance of the sample estimator y^0 of the stratum total is
= Nfâ/nh (10.10)
SKO = “ Vho)2/nh(nh ~ !)
= SSy
* hi/nh(nh - 1)
= Nh ~ yh)2/nh(nh - 1)
i=i
= NlSSyhi/nk(nh - 1) (10.11)
4. = sv-JNl (1012)
= SPyhiXhi/nh(nh - 1)
nh
= Nh^2(yhi -ÿh)(xhi -xh}/nh(nh - 1)
t=l
= N£SPyhixhi/nh(nh - 1) (10.13)
160 Chapter 10
4» = (4;0 + risk<, - X»
= (SSyhi + r^SSxhi - 2rhSPyhiXhi}/nh(nh - l)xj (10.14)
10.2.4 Sampling variances and their estimators for all strata combined
The variance of the estimator y of the universe total Y is
L L
= 52 <^0 = 52 N^/nh (10.15)
try/N2 (10.17)
y/N2
s2 (10.18)
Theoretical proofs of the above are given in Appendix II, section A2.3.9.
An unbiased estimator of the covariance of the estimators y and x is,
from equation (9.12),
L
svx = 52 syloxho (10.19)
Notes
1. Sampling without replacement. In srs without replacement, the sample
estimators of a universe total, mean, and ratio of two universe totals remain
the same as for srs with replacement. The sampling variance of y = y^0
is, however, of the form
Stratified Simple Random Sampling 161
where
c2 Nh 2
(10.22)
h ~ Nh - 1 °h
and
fh = nh/Nh (10.23)
(10.24)
where
4 = ¿(y
*
. -ÿ/>)7(«/> -1)
(10.25)
i=i
The sampling variance of the universe mean and its unbiased estimator are
obtained on dividing the respective expressions for the universe total by
№.
Theoretical proofs are given in Appendix II, section A2.3.9.
An unbiased covariance estimator syx is similarly defined and so also
the variance estimator of the ratio r = y/x.
The estimator of the overall universe mean Y is, from equation (10.6d)
L L
ÿ = y/N = ^o/N = y^nhÿh/N (10.26)
h=l h=l
i.e. the weighted average of the sample stratum means (yh), the weights
being Nh/N, the proportion of the total number of universe units contained
in each stratum. This will be the same as the simple unweighted mean of the
values of the study variables (which is the same as the simple unweighted
mean of the sample stratum means),
L nh L
(10.27)
h= l i=l h=l
only when
i.e. when the sampling fraction is the same in all the strata, in which case
«h = (n/N)Nh (10.29)
Thus the allocation of the number of sample units to the different strata is
in proportion to the total number of units in each stratum: this is known
an proportional allocation. It leads to a self-weighting design for a stratified
simple random sample.
3. For other notes relating for example to estimators of ratios, see the notes
to section 2.7.
Example 10.1
From village 8 in state I (Appendix IV), assume that for a family budget inquiry, a
preliminary listing of all the 24 households has been made along with information
on size. Divide these 24 households into two strata - one with households of
size 1-5 persons and the other with size 6 persons and above. Select a simple
random sample (with replacement) of 3 households from the first stratum and of
2 households from the second stratum, and on the basis of the data on income
and food cost of these sample households, estimate for all the 24 households the
total income, total food cost, income per household and per person, food cost per
household and per person and the proportion of income spent on food.
The details of the selected households in the two strata with the required
information are given in Table 10.2. The required computations are shown in a
summary form in Table 10.3 and the final results in Table 10.4.
Table 10.2: Size, total daily income, and food cost in the srs of households
in village no. 8, in state I (Apppendix IV)
2 3 33 21
9 4 39 24
11 3 35 22
16 6 62 31
22 6 61 29
Stratified Simple Random Sampling 163
Table 10.3: Computation of the required estimates for a stratified srs: data
of Table 10.2
3835 7565
*
(17) (Ea Wh’)2 /nh 1496.3333 1800
*
(26) 22" xhlwht 2399 3691
(3!) r2 X s
*
. o 349.993432$ 0.291480$ 294.770278
(r2 X r2)+
(33) row (20) + row (31) — row (32) 15.825201§ 25.389286§ 29.338254^
(36) srjl = sq. root of row (34) 0.00656§ 0.0117§ 0.00522 (sr)t
-
(9) s2 = s2. /TV2 3.1111§ 0.25§ 1.582222
V 7 xh xh0/ h
(4/TV2)t
0.777778§ 1§
V > s2w-K = s2*w hQ
(19) . f/TVh2 0.475309
t Not additive, i.e. the combined estimate is not obtained by adding the
stratum estimates.
Obtained by dividing the value of x by N (= 24).
§ Computations not necessary if separate estimates are not required for each
stratum.
Obtained by dividing the value of w by N (= 24).
Table 10.4: Estimates and standard errors of income and food costs, com
puted from the data of Tables 10.2 and 10.3: Stratified srs of households
1. Daily income
(i) Total $1037 $30.19 2.91
(ii) Per household $43.50 $1.26 2.91
(iii) Per person $9.51 $0.28 2.91
3. Proportion of income
spent on food 0.5687 0.00522 0.92
Stratified Simple Random Sampling 167
Example 10.2
From each of the three states (Appendix IV), select a simple random sample
(with replacement) of two villages each, and on the basis of the data on the
number of households and of persons in these sample villages, estimate for three
states separately, and also for the three states combined, the total numbers of
households and of persons, and the average household size with standard errors.
Here the states are the strata. The information for the selected sample villages
is given in Table 10.5. As two sample villages are selected from each state, the
simplified procedures of computation of section 9.5 will be used, where the sums
and the differences of the values of the study variables for the two sample units
will provide the required estimates. The required computations are shown in
Tables 10.5-10.7 and the final results in Table 10.8.
Example 10.3
In the Indian National Sample Survey (1958-9), the inquiry into population, births
and deaths in the rural areas was based on a single-stage stratified design. Rural
India was divided into 218 strata (composed of tehsils or groups of tehsils, a
tehsil being a administrative unit consisting of villages and a few towns), each
stratum containing approximately equal populations as in the Census of 1951. A
total period of survey of twelve months was divided into six sub-rounds each of
two months’ duration. A total of 2616 sample villages were covered, with 436
samples villages in each sub-round. In each stratum, two enumerators collected
the information independently, each surveying a sub-sample of six villages, i.e. one
village each in a sub-round. The allocation of the total sample of 2616 villages to
the different States of India was made on the basis of various factors, including the
total 1951 Census population, and the allocations were rounded to multiples of 12.
In each stratum, the sample villages were selected systematically with a random
start after arranging the tehsils in a particular order so as to increase the efficiency
of the estimators. In the selected villages, all the households were surveyed in
regard to demographic characteristics, including births and deaths during the 365
days preceding the enumeration. For any sub-round, the estimating equations will
take the forms in Example 10.2 as there were two sample villages in each stratum
in a sub-round. (For further details, see Murthy (1967), Chapter 15, and Som,
De, Das, Pillai, Mukherjee, and Sarma, 1961).
Some estimates and their standard errors are given in Table 10.9 for the first
sub-round (July-August 1958). The estimated rate of growth of population was
2 per cent per annum in 1958-9. The registration system for births and deaths is
still defective in India, and this was the first national survey which showed that
population in India was growing at a rate much faster than the annual 1.3 per
cent rate recorded during the 1941-1951 census decade, which was used in the
first two five-year plans covering the period 1951-61. This was later confirmed by
the Census of 1961 and the third five-year plan took account of this accelerated
rate of growth of population.
Table 10.5: Population and number of households in each of the two sample villages in the three
states (Appendix IV) and computation of stratum estimates: stratified srs of villages
O
OO
Number of Village Sample Number of households Population
villages serial village
no. no.
State Total Sample : In sample Stratum In sample Stratum
h Nh village estimate village estimate
Xhi = NhXhi y hi Vhi =
I 10 2 2 1 18 180 82 820
10 2 22 220 88 880
Total 448 91 ÄQ
Mean (= stratum estimate of total) 209 (<Q ) 1094.5 (yX0)
Difference 22 (dxk) 121(dyh)
j |Difference! (= estimated standard error) 11 (<o) 60.5 (sy. )
31 hO
Chapter 10
3 2 18 162 105 945
All zones
combined 584.5 26.521 703.25 2961.5 98.711 9744.25
(®) W (A (y) A)
t Not additive.
Table 10.7: Computation of the estimated average household size and sam
pling variance: data of Tables 10.5 and 10.6
1. Total population
Note that such a large-scale sample survey (with 2616 sample villages, about
234, 000 sample households and 1.2 million sample persons) used the simple design
described in this chapter, the data of which the reader should by now be able to
analyze by himself.
Example 10.4
For crop surveys starting from 1937 in the Bengal province of pre-independence
India and in the West Bengal State of India since 1948, the design has been
stratified single-stage. The whole area is cadastrally surveyed (i.e. surveyed for
tax purposes) so that village maps are available showing each plot. A number
of sample grids were selected, the size varying in different years. (A “grid” is a
square mesh on a plane formed by two sets of lines perpendicular to each other,
each line of each set being at a constant interval from the adjacent lines.)
With the help of the village maps, the field enumerators identified the indi
vidual plots wholly or partially included in each grid, indicating against each plot
the fraction of it that was sown with a particular crop. From these data and the
total area of each plot (which was known), the proportion of the total area of the
grid occupied by the crop was calculated. Let this proportion be denoted by phi
(A = 1,2,..., L; » = 1,2,..., nh).
If Ah is the total area of the Ath stratum, an unbiased estimator of the area
under the crop in the stratum, obtained from the Aith sample grid, is
Vhi = AhPh,
and the combined unbiased estimator, obtained from all the nn sample grids in
the stratum is, from equation (10.5b),
where ph is the simple (unweighted) average of the phi values in the Ath stratum.
An unbiased variance estimator of is given by equation (10.11).
For the whole universe, an unbiased estimator of the total area under the crop
is
L
y = J7 i/ho
an unbiased variance estimator of which is given by equation (10.16).
In surveys for estimating crop areas, the sampling fraction was generally 1 in
140 to 250, but in estimating crop yields it was necessarily much smaller, about 1
in 6 million - for details, see Mahalanobis, 1944, 1946a, and 1968. (For the crop
surveys recently being conducted in India by the National Sample Survey and the
Indian Council of Agricultural Research, the design is stratified multi-stage).
Some results for the crop survey in the State of West Bengal, India (1962-4)
are given in Table 10.10. The total geographical area of West Bengal is about 20
172 Chapter 10
Table 10.10: Estimated area under Aus Paddy (1962-3); Aman paddy, and
Jute (1963-4): West Bengal
million acres (1 acre = 0.404 hectare), and square grids of size 2.25 acres each
were the sampling units in each stratum.
For the estimation of the proportion of the universe units that fall into a
certain class or possess a certain attribute, the extension from the case of
unstratified simple random sampling (dealt with in section 2.13) to strati
fied simple random sampling is straightforward.
Let Nh be the number of units possessing the attribute, and the
total number of units in the hth stratum; a simple random sample of
units from the Nh units show that n'h of the sample of rih units possess the
attribute. We have already seen in section 2.13 that the sample proportion
Ph = Th/rih (10.30)
A = N'JNh (10.31)
(10.32)
The overall (i.e. for all the strata combined) universe proportion is
P = y\Nhph/N (10.35)
1 v A\2n(l~n)
(10.36)
№ nh
an unbiased estimator of which is
= XN^rJN2
_ 1 N%Ph(l ~ Ph)
(10.37)
N2 (nh - 1)
Note: p is a weighted average of the phs, the weights being Nh/N. If a constant
sampling fraction is taken in each stratum, i.e. nh/Nh = n/N, then
As for unstratified simple random sampling (Chapter 3), so also for strati
fied simple random sampling, the use of ancillary information may increase
the efficiency of estimators: we shall illustrate the most common use of
ancillary information, namely, the ratio method of estimation.
Following the methods of section 3.2, and using the same types of nota
tions as for the study variables, the ratio estimator of Y\, the stratum total
of the study variable in the Ath stratum, using the ancillary information,
is, from equation (3.6),
where
nh = Pho/z.hO (10.40)
174 Chapter 10
(10.41)
= N%(SSyhi + r2ihSSzhi - 2rihSPyhiZhi}/nh(rih - 1) in srswr
(10.42)
For the ratio estimator of the total of the study variable Y = Y^,
two types of ratio estimators could be used:
(the additional subscript ‘5” in YRs standing for ‘separate’ ratio esti
mator).
The variance estimator of YRs is the sum of the estimated vari
ances of the stratum ratio estimators; thus
(10.44)
L L
z = Y^ZhO = '^2Nhzh (10.46)
and
r = y/z (10.47)
Stratified Simple Random Sampling 175
Notes:
1. The combined ratio estimator Yrc does not require a knowledge of the
stratum totals Zh of the ancillary variable, but only of the overall total Z.
2. The separate ratio estimator will be more efficient than the combined ratio
estimator if the sample in each stratum is not too small and the universe
stratum ratio R\h — Yh/Zh vary considerably from stratum to stratum. If
the sample in each stratum is small, the separate ratio estimator will be
subject to a large bias.
3. If stratification is made with respect to an ancillary variable, then the
ratio method of estimation using the same ancillary variable is not likely
to further improve the efficiency of estimators.
4. The ratio method of estimation can be applied at intermediate levels of
aggregation between the individual strata and the overall universe; in the
example given in note 2 of section 9.3, the ratio method could be used not
only at the stratum and country levels, but also at the level of the states.
However, the ratio estimate obtained for the country or for a similar higher
level of aggregation by applying the ratio method of estimation at the lower
levels is likely to contain a bias that may be relatively large in relation to
the standard error.
Example 10.5
For the same data as for Example 10.2, given the additional information on the
previous census population in Table 10.11, obtain ratio estimates of the present
total population and the number of households for the three states separately and
also for all the states combined.
The required computations are given in Tables 10.11-10.13 and the final es
timates in Table 10.14. First, unbiased stratum estimates of the previous census
population are computed in Table 10.11. For the ratio estimates of the present
total population for the three states separately, we use equation (10.39); for all the
strata combined, the separate ratio estimate is obtained using equation (10.43),
and the combined ratio estimate by using equation (10.45), and the respective
176 Chapter 10
Example 10.6
For the Demographic Sample Survey in Chad (1964), the universe of population
(excluding nomads) was divided into nine strata - one for the urban areas and
eight for rural areas, where the stratification criterion was chosen on the basis
of the most dominant ethnic group. In each rural stratum, the sub-universe was
classified according to the administrative division (Préfectures, Sous-préfectures
and Cantons) and the size of the village. The rural strata were composed of
a number of Sous-préfectures. Each rural stratum was divided into three sub
strata according to the size of the village: (a) with up to 100 persons, (b) with
200 to 499 persons, and (c) with 500 or more persons, according to the previous
administrative census. Primary units were constructed in sub-stratum (a) by
grouping small villages, in (b) by considering the villages as the primary units,
and in (c) by grouping the localities of a village so that they contain an average
of 300 persons: a systematic sample of 1 in 20 primary units was chosen for the
survey, and the population, and births and deaths during the preceding 365 days
recorded. The relatively large sampling fraction was chosen in order to provide
reliable estimates at the stratum level.
Note: Sampling in rural sub-stratum (c) was stratified two-stage, to be considered
in Part IV.
The sample comprised 101 thousand persons in the rural areas and 11 thou
sand persons in the urban: the overall sampling fraction was 5 per cent. Using
the previous administrative census data, the ratio estimates of population were
2365±87.1 thousand in the rural section and 2530±91.1 thousand in the country.
On the basis of these results, the 95 per cent limits were placed at 2268 and 2444
thousand for the rural population and 2439 and 2629 thousand for the total pop
ulation of the country (excluding nomads). (For further details, see the report by
Behmoiras).
Table 10.11: Previous census population in each of the two sample villages,
selected in Example 10.2 in each of the three states and computation of
stratum estimates: Stratified srs of villages
I 10 2 2 1 81 810
10 2 80 800
Total 1610
Mean ( = stratum estimate of total) * Q)
805 (z
io (¿A)
Difference
2 |Difference| (= estimated standard error)
II 11 2 5 1 98 1078
11 2 86 946
Total 2024
Mean (= stratum estimate of total) 1012 (z
* 0)
Difference 132 (dzh)
i |Difference| (= estimated standard error) 66 (s2* )
Total 1881
Mean (= stratum estimate of total) 940.5 (z
* 0)
Difference 135 (d2h)
|Difference| (= estimated standard error) 67.5 (s2. )
' ho'
178 Chapter 10
All states:
Separate ratio
estimate 3022.2 36.91 1.22 598.8 20.84 3.48
Combined ratio
estimate 3023.2 26.98 0.89 596.2 21.28 3.57
If separate ratio estimators are used for the totals of two study variables,
namely
L L L
VRS - 5? y*hR - Y? ZhT-Lh - Zhyh/zh (10.43)
L L L
XRS = y^ X*hR = F ZhT2h = F ZhXhlZh (10.50)
where SyRtl and slRt, are given by estimating equations of the type (10.44),
and the estimated covariance is
L
SyRSZRS (10.53)
where
*
*hi
(y - rih^t)(r^. - r2hz'hi)
hH XKR (10.54)
Tlh(nh - 1)
(10.55)
n
Example 10.7
For the data of Example 10.2, estimate the gain in the efficiency of the sample
estimate of the total population due to stratification, as compared to that for an
unstratified srs.
The required computations are given in Table 10.15. The value of the expres
sion (10.55) is
(30 x 297, 072.5 - 2, 961.52 + 9, 744.25)/6 = 25, 239.5
The efficiency of the estimate from the stratified sample as compared with
that from the unstratified is thus
25, 239.5/9, 744.25 = 2.59 or 259%
Stratified Simple Random Sampling 181
where
Pi = (Nh/N) (10.58)
are the “weights”, i.e. the proportions of the total number of units in
the different post-strata, the values of being known or taken from
previous inquiries. This is a biased estimator.
A variance estimator of y is
Notes
1. If the sample is reasonably large, say over twenty in each stratum, and the
errors in the weights negligible, the method of post-stratification can give
results almost as efficient as proportional stratified sampling (see Chapter
12). This is also obvious from the fact that for large samples, the sample
units are likely to be distributed in proportion to Nh-
2. The method can also be used when the sample is already stratified according
to some variable other than that desired for the study in hand.
Stratified Simple Random Sampling 183
Further reading
Cochran, sections 5.1-5.4, 5.6, 5.10, 5A.8, 6.10-6.12; Hansen et al. (1953), vol. I,
chapters 5A, 5B, and 5D, and vol. II, sections 5.1-5.3, 5.5, 5.6, 5.12, and 5.13;
Hedayat and Sinha, chapter 9; Murthy (1967), sections 7.1-7.3, 7.5-7.7, 7.14, and
10.6; Sukhatme et al. (1984), sections 4.1-4.3, 4.5, and 4.6; Yates, sections 3.3-
3.5, 6.5-6.7, 6.10, 6.11, 6.13, 6.14, 7.6, 7.6.1, 7.6.2, 7.9-7.10, 7.13, 8.3-8.5, 8.5.1,
and 8.5.2.
Exercises
3. For the data of Example 10.5, estimate the average household size from
the separate ratio estimates of numbers of persons and households, and its
standard error.
4. A simple random sample of 125 farms out of a total of 2496 farms in
Hertfordshire, 1939, gave the data on (a) total acreage and (b) acreage of
wheat, classified by districts after selection (Table 10.18). Estimate the
total area of wheat and the number of farms growing wheat (i) directly
from the sample, and (ii) by stratification by size, given the total number
of farms in the specific size-groups in col. 2 in Table 10.2 to the answer
(Yates, Examples 6.6 and 7.6.b, modified to “with replacement” sampling).
184 Chapter 10
(a) The Federal Statistical Office uses the following estimators: the un
biased estimator of Y is y = 1005^^ nhyh> and the unbiased variance
estimator of y is 0.99 x 104 nhSSyht/(nh — 1). Verify these, not
ing that the finite multiplier is not ignored for the variance estimator
(Herberger, 1971).
(b) Also verify that
—2r SPyhiTht
(Hint: (a) Use respectively equations (10.5.d) and (10.24), noting that
fh = nh/Nh =0.01.)
(a) Estimate the total number and proportion of AIDS admissions for the
34 county area;
(b) How would you estimate the number of AIDS admissions by utilizing
the data on the total number of hospital beds for all the hospitals in
each stratum (given also in the bottom half of Table 10.19) (Levy and
Lemeshow, Exercise 12.2, adapted)?
Stratified Simple Random Sampling 185
Table 10.16: Total number of cattle in the srs of farms and the raw sum of
square in different strata
I 1411 43 84 98 0 10 44 0 124 13 0
II 4705 50 147 62 87 84 158 170 104 56 160
III 2558 228 262 110 232 139 178 334 0 63 220
IV 14 997 17 34 25 34 36 0 25 7 15 31
Table 10.18: Total acreage (a) and acreage in wheat (b) in the srs of farms
in Hertfordshire, U.K., 1939
1 1 72 20
2 87 49
2 1 99 38
2 48 23
3 1 99 38
2 131 78
4 1 42 7
2 38 28
5 1 42 26
2 34 9
6 1 30 18
2 76 20
Hospitals Beds
1 4 11 588
2 4 9 421
3 4 8 776
4 4 6 196
5 4 4 175
6 4 8 375
11.1 Introduction
L
N = ^Nk (11.1)
/1=1
189
190 Chapter 11
Table 11.1: Some universe parameters and their sample estimators for a
stratified varying probability sample
For a study
variable: Yh = y*ho = Y =
Total yhJn* Z2h=l yh0 (11-6)
Mean Yh = *hy o =
Yh/Nh yho/Nh Y = Y/N y/N (H.7)
(11.9)
svl0 = * hi/nh(nh - 1)
~ y'hofMrih - 1) = SSy (11.10)
1= 1
The sampling variance of the sample estimator of the hth stratum mean
is
(11.11)
(1112)
An unbiased estimator of the covariance of y^Q and x* hQ (z£0 being de
fined similarly for another study variable) is, from equation (5.8) or (9.10),
SVh0xh0 = ^(yhi-yho^hi-^/^nh-l)
= SPyLxhi/nh(nh- 1)
An estimator of the variance of rh is, from equation (5.10) or (9.14),
< = (s?;o +
= (SSy‘hi + r2
hSSy’hi - ’¿rhSPy‘hix‘hi)/nh(nh - l)x’h20 (11.13)
11.2.4 Sampling variances and their estimators for all strata combined
The sampling variance of the estimator y of the universe total Y is
L
(11.14)
<t2/N2 (11.16)
s2/N2 (11.17)
s2 = (s2+r2s2-2rssJ/x2 (11.19)
For the notes relating to the estimation of ratios, see the notes to section
2.7.
Example 11.1
From each of the three states (Appendix IV), select with probability proportional
to the previous census population and with replacement a sample of two villages
each, and on the basis of the data on the current number of households and of
persons in these sample villages, estimate for the three states separately and also
for the three states combined, the total numbers of households and of persons
and average household size, with standard errors.
Here the states are the strata. The sample villages are selected with proba
bility proportional to the previous census population by adopting the procedure
of cumulation of the previous census population in each state (section 5.4). The
information on the selected sample villages is given in Table 11.2; the Zh val
ues (the total previous census population in the states) are 863, 1010, and 942
respectively. As two sample villages are selected from each state, the simplified
procedure of section 9.5 for estimation will be followed, as was done in Example
10.2 relating to a stratified srs. The required computations are shown in Tables
11.2-11.4 and the final results in Table 11.5.
Example 11.2
For the Indian National Sample Survey (1957-8) in the rural section, the total
number of 2522 tehsils were grouped into a number of geographical strata. In
each stratum, two villages were selected from the total number of villages with
probability proportional to the 1951 Census population; in the sample villages
Stratified Varying Probability Sampling 193
Table 11.2: Population and number of households in each of the two sample
villages selected with probability proportional to previous census popula
tion in each of the three states (Appendix IV): Stratified pps sample of
villages
* Not additive
Table 11.4: Computation of the estimated average household size and sam
pling variance: data of Tables 11.2 and 11.3
S2. + r2 S2 . S2 -
State Average Sy
* *
X —
fh -
sh0 HO «ho h xho
household size 4 dyftdxh ~2rhsy
* ox* o col.(4)/x
* 2
h Th ~ yho/XhO
Table 11.5: Estimates and standard errors computed from the data
of Table 11.2: Stratified pps sample of villages
1. Total population
(a) Estimate 1052.7 1117.8 1028.2 3198.7
(b) Standard error 2.09 10.99 10.46 15.32
(c) Coefficients of variation (%) 0.20 0.98 1.02 0.48
thus selected, the existing households were listed for the purpose of sub-sampling
of households for socio-economic inquiries, the sampling design for which was thus
stratified two-stage (see section 1.9 and Chapter 14). However, at the time the
list of households in the sample villages was constructed, information on the total
number of births during the preceding 365 days in the premises occupied by the
households was collected, along with the household size from all the households.
The sample design for this inquiry on births was thus stratified single-stage pps.
The estimating equations for the total population, births and the birth rate and
their estimated variances will thus take the forms given in section 11.2.
Estimates of the birth rates and estimated standard errors for rural India as
a whole as also for the five zones are given in Table 11.6. The zones comprised
one or more states, each state containing several strata: for the zonal estimates,
the same types of formulae as for the overall estimates for the country as a whole
were used, excepting that only those strata constituting a zone were considered
(for further details, see Som et al., 1961). There were 924 villages, over 135,000
households, and over 680,000 persons in the sample.
By definition, births occurring in hospitals and other institutions were not
recorded, nor, for households occupying the premises for less than a year, the
births occurring in the previous residence. Even with the limited definition, it
appears likely that a number of births were not reported because of the lack of
emphasis and of probes and other associate items and of cross-checks on this
item in the “listing schedule”. The methods of obtaining adjusted estimates from
demographic data are described briefly in section 25.6.
196 Chapter 11
Table 11.6: Estimated birth rate (per 1000 persons): Indian National Sam
ple Survey, rural sector, 1957-8
Example 11.3
The universe of 112 villages was divided into three strata with 51, 37, and 24
villages respectively. From the first stratum, a simple random sample of 6 villages
was selected without replacement, from the second stratum a sample of 5 villages
with probability proportional to the cultivated area (the total cultivated area in
the stratum was 26,912 acres) and with replacement, and from the third stratum
two linear systematic samples of 4 villages each. For each selected sample village,
the total area under wheat was observed: this information is given in Table 11.7,
along with the total areas of the sample villages of stratum II. Estimate the total
area under wheat in each stratum separately and also for all strata combined,
along with the standard errors (Murthy (1967), Problem 7.3, adapted).
Stratified Varying Probability Sampling 197
Table 11.7: Area under wheat (г/д,) for all the sample villages and cultivated
area (х/й) for the sample villages of stratum II
For stratum I, the methods of Example 2.1 are followed. The estimated area
under wheat is 3247 acres with estimated variance 435,540 (acres)2.
For stratum II, we follow the methods of section 11.2. From equations (11.5b)
and (11.10), the estimated area under wheat is 10,507 acres with estimated vari
ance 153,506 (acres)2.
For stratum III, the methods of Example 4.1 are followed. The estimated
area under wheat is 9831 acres with variance 59,049 (acres)2.
The estimated total area under wheat in the three strata separately and com
bined are shown in Table 11.8, along with their estimated variance, standard
errors, and coefficients of variation.
11.4.1 Introduction
As for unstratified pps sampling (section 5.5), so also for stratified sampling,
the sampling of fields (or farms or plots) with probability proportional to
total (geographical) area (ppa) simplifies the estimating procedures, in ad
dition to possibly improving the efficiency of the estimators.
198 Chapter 11
Table 11.8: Estimated wheat acreage, obtained from the data of Table 11.7
/
TThi ~ ^hi / O-hi — ^fa/Ah
' i=l
where is the area of the ¿th field (z — 1,2,... ,Nh for the universe, and
i = 1,2,..., rih for the sample) and a^i = Ah, the total area of all
the Nh fields in the stratum, and if for the ¿th sample field, yhi denotes
the area under a particular crop, then an unbiased estimator of the average
proportion under the crop in the stratum is, from equation (5.13),
JU
Ph = Ylph^nh (11.20)
t=i
An unbiased estimator of the total area under the crop in the hth stra
tum is, from equation (5.11),
For all the strata combined, the unbiased estimator of the total area
under the crop Y is, from equation (11.6.d),
L L
4=ss7X2 (11.27)
Note: If the crop is such that it either occupies the whole of a field or no part of it,
or if the fields are small enough for this assumption to hold, i.e. the proportion
of the total area under the crop ph, is either 1 (whole) or 0 (none), then the
estimator of the variance of the total area under the crop in equation (11.25)
reduces to
L
s2y = ^A2hph(l-ph)/(nh-l) (11.28)
X= (11.33)
with an unbiased variance estimator
4 = ¿4-
r Z_-< rhO (11.34)
An unbiased estimator of the average yield per unit area in the universe
is
r — x/A (11.35)
with an unbiased variance estimator
s? = s2/A2 (11.36)
A generally unbiased but consistent estimator of the average yield per
unit of crop area in the hth stratum is
rh = xho/yho = rh/ph (11.37)
with a variance estimator
As for unstratified pps sampling (section 5.6), so also for stratified pps sam
pling the ratio method of estimation may be used to improve the efficiency
of estimators; and as for stratified srs (section 10.4) for the whole universe,
“combined” and “separate” ratio estimators may be used.
If w is the ancillary variable used for the ratio estimation, then an
unbiased estimator of the universe total Wh in the hth stratum from a pps
(the “size” variable being z) sample of units is, from equation (11.6.b),
nh nh
who = ^whi/nh = ^2whi/^hinh)
i=l i=l
= Zh^whi/{zhinh} (11.42)
i=l
where Zh = ¿hi, the total of the values of the size variable in the hth
stratum.
The ratio estimator of the total Yh in the stratum is (from equation
(5.27))
where
Nh
Wh = and rlh = yho/Who (11.44)
¿=1
(11.45)
where
^;0 = SSy^hi/nh(nh-l)
¿•o = *
SSw hi/nh(nh - 1)
= SPy^/nh^h-l) (11.46)
For the ratio estimator of the total of the study variable Y = Yh,
two types of ratio estimators could be used.
202 Chapter 11
r = y/w (11.50)
where
L
h=l
L I" nh
13 S Vhiwhi~ (11.52)
h=l
Note: For the ratio method of estimation to be efficient, the probability of se
lection should be appropriate for both y and w. See also the notes in section
10.4.
Stratified Varying Probability Sampling 203
For two or more study variables, the ratio method of estimation may be
applied, using the same ancillary variable. However, in computing the ratio
of the totals of two study variables, the ratio method is equivalent to using
the unbiased estimators of the totals for the stratum levels, as also for all
the strata combined if combined ratio estimators for totals are used.
If separate ratio estimators are used for the totals of the two study
variables, namely
L L L
VRS = ^AR = '^Wl>y-M/w'h0 = '^W^ (n-47)
L L L
ZRS = = £ M.«, = 12 (11.53)
where and s%Rt, are given by estimating equations of the type (11.48),
and the estimated covariance by
L
SynsTRs = 52 SKrx*hr (11.57)
where
As for stratified srs (section 10.6), so also for stratified pps samples it is
possible to examine the gain, if any, due to stratification as compared to
unstratified pps sampling.
We define
*h0 = 52 = Zhi!Z
as the sum of the probabilities of selection of the units of an unstratified
pps sample, the summation being over the Nh units in the hth stratum.
The gain due to stratification in the variance estimator of the total of
the study variable is
^(I/httao - l/n^)«2-o + -—
n
E£(l/7Tho ~ l)Sy»o
(11.59)
n
, N£.L NkfÎM-ÿr
^L(N/Nh - l)sj
____________________________ J ft 0
(11.60)
n
Note: The condition under which stratification would lead to a gain will be
discussed in Chapter 12.
Further reading
Cochran, sections 9A.13, 11.5, and 11.16; Hedayat and Sinha, chapter 9; Murthy
(1967), sections 7.8, 7.8b, and 10.6; Sukhatme et al. (1984), sections 4.17-4.19,
5.13, and 6.8; Yates, sections 3.10, 6.17, 7.5, and 7.16.
Exercise
1. Table 11.9 gives the acreage of wheat (yht) in sample parishes, selected
with probability proportional to size (the total acreage of crops and grass,
Xhi) in four districts of Hertfordshire, U.K. Estimate the total acreage of
wheat in these four districts taken together with its standard error, given
the total sizes of the districts at the bottom of the Table (Yates, Examples
6.17 and 7.16, modified).
Stratified Varying Probability Sampling 205
Table 11.9: Acreage of wheat (yhi) in sample parishes selected with proba
bility proportional to size (the total acreage of crops and grass, in four
districts of Hertfordshire, U.K.
Size of Sample
and Allocation to Different Strata
12.1 Introduction
In this chapter we will consider first the problems of the allocation of the
total sample size into different strata and of the formation of strata, and
second, the determination of the total size of the sample.
Given a total sample of size n, its allocation to the different strata would
be based on the same principle as in Chapter 7, namely, that for a specified
total cost of surveying the sample, the sampling variance should be the
minimum or vice versa.
The sampling variance of the sample estimator y of the universe total
207
208 Chapter 12
where is the variance per unit in the hth stratum, and rih is the sample
size in the /ith stratum. Thus in stratified srs (with replacement),
Nh _
VÀ = ir? = (12.2)
1=1
(see also equation (10.8)) and in stratified srs (with replacement) for pro
portions
hi--
Vh = YyYhd
* >"l
Ytf
* N'ik (12.4)
1=1
where 7thi — z.hjZh is the probability of selection of the ith universe unit
(i = 1,2,..., A\) in the hth stratum, Zhi is the “size” of the hith universe
unit, and Z/i is the total “size” of the hth stratum from equation (5.4).
A simple cost function in stratified sampling is
L
C = c0 + (12.5)
1
where c0 is the overhead cost, and the average cost of taking a sample unit
in the /ith stratum, which may vary from stratum to stratum, depending
on field conditions.
n ^\/(u/c/i)
Size of Sample and Allocation to Different Strata 209
Nh^(yh/ch)
nh = (C - c0) (12.6)
These are the optimum allocations of the total sample size n into differ
ent strata and show that a large sample should be taken in a stratum which
is large (Nh large), is more variable internally (Vh large), and sampling is
inexpensive (ch small).
Notes
1. Theoretical proofs are given in Appendix II, section A2.3.10.
2. The optimum nh values would have been the same had we considered the
estimation of the universe mean Y/N, rather than that of the total Y. In
comparing expression (12.7) relating to the total with equivalent expres
sions for the mean in other textbooks, note that the variance for the total
is № times the variance for the mean.
3. For stratified srs without replacement, the optimum allocation is similarly
obtained by taking the variance function (10.21) and the cost function
(12.5). The optimum nn is proportional to NhSh/y/ch (where Sh is defined
by equation (10.22)), and is given by the relation
nh _ NhSh/y/ch
~ ^NhSh/y/^
(a) A few items may be selected for optimization and an average taken
of the allocations taken of the optimum allocations for these items
(Cochran, section 5A.3).
210 Chapter 12
where hh is the actual and n'h the optimum sample size in stratum h.
This amounts to choosing
Tlh — n
N
where n]h is the optimum sample size in stratum h for variable j
(Cochran, section 5A.4).
(c) Other options have been suggested by Yates (1981); see also Cochran,
section 5A.4.
C = cq + nc (12.8)
(see also equation (7.27)) which determines the total sample size, given the
total cost C, thus
(see also equation (7.28)). The optimum allocation for a fixed total cost is
then (from equation (12.6))
=
(12.10)
of the study variable (Rh) may be substituted for y/Vh in the Neyman
allocation (also see section 7.3), i.e.
_ nNhRh
(12.11)
~ ^NhRh
Notes
1. The Neyman allocation is sometimes called the optimum allocation. For
theoretical proof, see Appendix II, section A2.3.10.
2. In the Neyman allocation, the stratum sampling fraction nh/Nh is propor
tional to ^/(Vh) (i.e., proportional to (Th in srs).
nh = nZh/Z (12.14)
In such a case, stratified pps sampling is always more efficient than un
stratified pps sampling. The wider the variation of the stratum proportions
YhfZh-, the greater will be the efficiency.
Note: If, further, nh is a constant, then the design can be made self-weighting
by selecting a fixed number of sample units with pps in each stratum, the “size”
being the stratum total Yh (see section 13.3).
where Ah is the total geographical area of the /ith stratum, nh the number
of sample units (fields, farms, or plots) in the hth stratum, and phi is the
proportion of the total area under the crop in the hith sample unit.
If the allocation of the sample units is made proportional to the total
areas of the strata, i.e. if nh is proportional to Ah or
nh-nAh/A (12.16)
where A = Ah is the total area of the universe, then the design becomes
self-weighting (section 13.3).
Notes
1. Similar considerations apply to yield surveys of crops if the sample units
are selected with probability proportional to area (section 11.4.3).
Size of Sample and Allocation to Different Strata 213
2. A special case of the above is when the crop is such that it either occupies
the whole of a plot (phi = 1) or not at all (phi = 0), when the sampling
variance of the estimated total area under the crop is
hPh(l-Ph)/nh
^A2 (12.17)
where Ph is the universe proportion of the area under the crop in the hth
stratum.
With nh proportional to Ah, the variance formula (12.17) becomes
(12.19)
52 Ahx/[Ph(! - Ph)]
deviations; in stratified pps sampling, on the other hand, the units within
each stratum should be homogeneous with respect to Yhilzhi\ in practice,
the values of the study variable relating to an earlier survey or those of a
related ancillary variable would have to be used. The variable on basis of
which stratification is made is called the stratification variable.
Note: For estimates of ratios (such as birth rate, death rate etc., in a vital
rate survey, or average income per person in a household budget inquiry),
the stratum sizes may be made equal with respect to a measure of the
size that is highly correlated with the denominator of the ratios (total
population in our examples).
different strata. A very good rule in this case is to form the strata
by equalizing the cumulatives of ^/[/(?/)] where f(y) is the frequency
distribution of the study variable. This makes Nh&h approximately
constant in srs, so that the Neyman allocation gives a constant sample
size riQ = n/L in all strata. As the optimum is generally flat with
respect to variations in n^, the use of the Dalenius-Hodges rule would
be highly efficient.
(C'-c.iEA'tVtW
Note: For stratified srs without replacement, the value of n is obtained on replac
ing y/Vh by Sh in equation (12.20) for optimum allocation.
V = av=^NN1lwh)ln (12.22)
n = (£№^))2/v (12.24)
Size of Sample and Allocation to Different Strata 217
and for proportional allocation, i.e. when Wh = Nh/N, the required total
sample size is
n = ^N^h/V (12.25)
Notes
1. As ah will not in general be known, it has to be estimated from a pilot
inquiry or from other available information, such as the range.
2. Estimation of the universe mean Y/N requires the same sample size as that
for the total Y.
3. If the margin of error d is specified (see section 7.2.2), then
V = (d/t)2 (12.26)
n' V + ^NhSl
n = '£SNi'p*
Q ’''>/NV (12.28)
Notes
1. In practice, an estimate ph of Ph will have to be used in formulae (12.27)
and (12.28).
2. Estimation of the total number of universe units in a certain category,
namely, NP, requires the same sample size as for that P.
218 Chapter 12
12.7 Examples
Example 12.1
The data on the number of cattle obtained from a recent census are given in
Table 12.1, in the 5 strata according to the total acreage of the farms, along with
the present total number of farms in these strata. The problem is to estimate
the present total number of cattle in the universe and its variance, by taking a
sample of 500 farms.
(a) Determine the allocations of the sample in the different strata according to
the following principles: (i) Neyman allocation, (ii) proportional allocation,
and (iii) allocation proportional to the total number of cattle in the different
strata;
(b) Also compute the expected variance for each of these (United Nations Man
ual, Process 4, Example 4, adapted).
(a) The required computations are shown in Table 12.2. Formula (12.10)
is used for the Neyman allocation, and formula (12.12) for the proportional al
location. For allocation proportional to the stratum totals, the stratum total
= N'hyh for the previous census is used, and the formula (12.13), modified as
nh = nY^/Y', is applied (Y' = Yh)-
Table 12.1: Number of cattle, obtained from a previous census, and the
present total number of farms in each stratum
500 N1
. s!
nh = r-v 7 ? (12.10)
t
nh = 500 Nh/N (12.12)
t
nh=500Y^/^Y^ (12.13)
(b) The expected variances are computed in Table 12.3. The relevant formula
for any stratum is
where sl2h are the variances per unit obtained from the previous census, and Nh
and nh respectively the total and sample number in the present inquiry.
Note that in this case, allocation proportional to the stratum totals will be
almost as efficient as the Neyman allocation. The results of an actual sample,
using proportional allocation, is given in exercise 1, Chapter 10; the actual com
puted variance of the estimated total number of cattle was 575,597, as compared
with the expected value of 640,997.
Example 12.2
‘ Nb’2h/nh
Table 12.5: Computation of the optimum allocations from the data of Table
12.4: Stratified srs of families
The required computations are shown in Table 12.5. With Vh given by equa
tion (12.3), the optimum allocations are given by equation (12.6). There are 6850
families in the first city and 4587 in the second, with a total of 11,437 sample
families. (As a check on the computations, it can be verified that the total cost
= 6850x Eth. $2.25 + 4587x Eth. $1.00 = $19, 999 or Eth. $20,000).
Further reading
Cochran, sections 5.5-5.12, and chapter 5A; Deming (1950), chapter 6, and
(1960), chapter 20; Hansen et al. (1953), vols. I and II, chapter 5; Hedayat and
Sinha, chaper 9; Kish (1965), chapter 3; Murthy (1967), chapter 7; Sukhatme et
al. (1984), sections 4.4-4.8, and 4.11-4.14.
Exercises
1. In a survey, using a stratified srs with five strata, rough estimates of the
universe units (A\), the standard deviation (07,) and the cost of surveying
one unit (c/, in a certain unit) in the different strata are given in Table
12.6. The total cost is fixed at 10,000 and the overhead cost at 500. Deter
mine the optimum total sample size and its allocations to different strata
(Chakravarti et al., Illustrative Example 4.1).
2. A survey is designed to estimate the proportion of illiterate persons in
three communities. Rough estimates of the total number of persons and
the proportion illiterate are given in Table 12.7. Assuming a stratified srs,
with the communities as the strata, how would you allocate a total sample
of 2000 persons in the strata so as to estimate the overall proportion of
illiterates?
222 Chapter 12
Table 12.6: Rough estimates of the total number of units, standard devia
tion, and cost of surveying one unit in the different strata
Table 12.7: Rough estimates of the total number of persons, and proportion
illiterate in three communities
I 60,000 0.4
II 10,000 0.2
III 30,000 0.6
CHAPTER 13
Self-weighting Designs
in Stratified Single-stage Sampling
13.1 Introduction
This chapter deals with the problems of making a stratified design self
weighting. We first consider stratified srs and then stratified varying prob
ability sampling: the method of making any stratified design self-weighting
at the tabulation stage will also be outlined.
In stratified srs (as also in stratified circular systematic sampling with one
sample), an unbiased estimator y of the universe total Y is
L nh
Vhi/nh (13.1)
h=i :=1
(see also (10.5.d)) where Nh and are the total number of units and the
number of sample units respectively in the Ath stratum, and yhi the value
of the study variable for the zth selected sample unit in the /ith stratum.
The weighting factor (or multiplier) for the hith sample unit is
223
224 Chapter 13
nh = ~ Nh = — (13.4)
N w0 v '
This is the proportional allocation case (see section 10.2, note 2, and
section 12.3.4).
Thus, a stratified srs will be self-weighting only with proportional allo
cation. Although the optimum and the Neyman allocations result in more
efficient estimators, these require a prior knowledge of the variability in the
different strata, which may not often be available. In such situations, there
may be some advantage in the proportional allocation.
With a self-weighting stratified srs (i.e. a stratified srs with propor
tional allocations), some unbiased estimators defined in section 10.2 take
the following form:
L L rih
and
L
(13.8)
The estimators of the ratio of the two universe totals (defined in esti
mating equations (10.7.b) and (10.7.d)) are
(13.9)
and
(13.10)
Self-weighting Designs in Stratified Single-stage Sampling 225
(13.11)
(see also (11.6.d)) where 7T^, is the (initial) probability of selection of the
¿th universe unit (i = 1,2,..., Nh) in the Ath stratum.
In pps sampling 7Thi = where Zhi is the value of the “size”
variable of the hith universe unit, and Zh — ^Nh is the total “size” of
the Ath stratum.
The multiplier for the hith sample unit is
If the ratio yhi/zhi can be observed and recorded easily in the field, then
the design will be self-weighting when the ratios Zh/nh are the same in all
the strata and equal to Z{n. This will be so if n\ is proportional to i.e.
nk = ^Zh (13.13)
A special case of this is when the strata are made equal with respect to the
total “sizes” and an equal number of sample units is selected with pps from
each stratum (sections 12.3.7 and 12.4).
The above rule is especially helpful in acreage and yield surveys of crops,
when a sample of fields (or farms or plots) is selected with probability pro
portional to total (geographical) area in each stratum. If the number of
sample units allocated to a stratum is made proportional to the total geo
graphical area of the stratum, the design becomes self-weighting as yhi/zhi
(the proportion of the area under the crop or the crop-yield per unit area)
can be observed easily (sections 11.4 and 12.3.8).
= (1314)
7=1
y'j being the value of the study variable of the jth unit in the sub-sample
(j = l,2,...,n').
An unbiased estimator of the variance of y^ is
~ ~ 1) (13.16)
Estimators of ratios and their variance estimators follow from the fun
damental theorems of section 2.7.
Further reading
Murthy (1967), sections 12.2 and 12.5
PART III
MULTI-STAGE
SAMPLING
CHAPTER 14
14.1 Introduction
1. Sampling frames may not be available for all the ultimate observa
tional units in the universe, and it is extremely laborious and expen
sive to prepare such a complete frame. Here, multi-stage sampling is
the only practical method. For example, in a rural household sam
ple survey, conducted at intercensal periods, the households in rural
areas could be reached after selecting a sample of villages (first-stage
229
230 Chapter 14
units), after which a list of households within the selected villages only
is prepared, and then selecting a sample of households (second-stage
units) in the selected villages. In a crop survey, villages may be first
selected (first-stage units); and next a list of fields prepared within
the selected villages and a sample of the fields taken (second-stage
units); and finally, a list of plots prepared within the selected fields
and a sample of these plots taken (third-stage units). In this way,
great savings are achieved as sampling frames need be constructed
only for the selected sampling units and not for all the sampling units
in the universe. Moreover in multi-stage sampling, ancillary informa
tion collected on the sampling units while listing the units at each
stage could help in improving the efficiency of sample designs, either
by stratification of the units, or by selecting the sample with proba
bility proportional to size when the ancillary information is available
before sample selection at that stage, or by using the ratio or regres
sion method of estimation.
2. Even when suitable sampling frames for the ultimate units are avail
able for the universe, a multi-stage sampling plan may be more con
venient than a single-stage sample of the ultimate units, as the cost
of surveying and supervising such a sample in large-scale surveys can
be very high due to travel, identification, contact, etc. This point
is closely related to the consideration of cluster sampling (Chapter
6). For instance, in a large-scale agricultural survey conducted in a
developed country, although an up-to-date list of farms may be read
ily available from which a simple random sample of farms can be
drawn, the cost of travelling and supervision of work on the widely
scattered farms may be extremely high. Therefore, the procedure to
be adopted would be to try to confine the sample of farms to certain
area segments.
3. Multi-stage sampling can be a convenient means of reducing response
errors and improving sampling efficiency by reducing the intra-class
correlation coefficient observed in natural sampling units, such as
households or villages. Thus, in opinion and marketing research, it
becomes necessary to select only one individual in a sample house
hold in order to avoid conditioned response, and also to spread the
sample over a greater number of sample households because of the
general homogeneity of responses of individuals in a household, even
if the “true” responses of all the members of the household could be
obtained.
Multi-stage Sampling: Introduction 231
Note: Multi-stage sampling is cheaper and operationally easier than srs but not
more than single-stage cluster sampling; considering sampling variability, how
ever, multi-stage sampling is generally less efficient than srs but more efficient
than cluster sampling: this is, of course, based on the assumption that the total
sample size is fixed. Some of the lost efficiency may, however, be regained by
using ancillary information.
Note: In such a plan, if all the N villages are included in the survey and a sample
of households mt (i = 1,2,.. ., TV) selected in each village, the design becomes
stratified single-stage (with total number of sample households, mt), and the
methods of Part II of the book will apply. If in the sample of n villages, all the
M, households are included in the survey, the design becomes single-stage cluster
sampling (with total number of sample households, A/t), and the methods of
Part I of the book will apply. Of course, if all the N villages are included in the
survey and all the Mt households in all these villages surveyed, the survey is one
of complete enumeration without any sampling (with total number of households,
T <’
■
4
>* Illi
*'
£ a
li III
It
■
s / <•
kl-l-l.: I
u= Ui/n (14.3)
r = t/u (14.6)
In a multi-stage sample design with u states (u > 1), at the /th stage
(t = 1,2,... , u), let ni2...(t—1) sample units be selected with replacement
out of the total Ai 2... (t_i) units and let the (initial) probability of selection
of the (12 ... i)th unit be 7Ti 2...t- We define
where t/i 2 . , .u is the value of the study variable in the (12 ... u)th ultimate
stage sample units, and the factors
(14.11)
234 Chapter 14
are the weighting-factors (or the multipliers) for the (12 ... u)th ultimate
stage sample units; the sum of the products of these multipliers and the
values of the study variable for all the ultimate-stage sample units provides
the unbiased linear estimator y of the universe total Y in equation (14.10).
An unbiased estimator of the variance of the sample estimator y in the
general estimating equation (14.10) is
/ n(n — 1)
" 52 W1 2 ...uVl 2 ... u (14.12)
1,2, ...,u
Vi = n ^2 w12...u’/12...u (14.13)
2,3,..,,u
the summation being over the second- and subsequent-stage sampling units.
The combined unbiased estimator of the universe total Y from all the
n sample first-stage units is, as we have seen from equation (14.1), the
arithmetic mean
yo = 52 y* !n (14.14)
Notes
1. Estimating equation (14.10) applies to single- as well as multi-stage de
signs, sample selection being done with equal probability with or without
replacement, or with varying probability with replacement or systematic
sampling with equal or varying probability.
2. For srs, the fs are the sampling fractions at different stages.
3. The above formulae, which will be illustrated in later chapters for two- and
three-stage designs, are adequate in obtaining estimates of universe totals,
means, ratios, etc. of study variables, and also estimates of their variances.
4. The variance of a universe estimator is built up of the variances at differ
ent stages of a multi-stage design. The equations for estimating variances
given above do not show these components separately and these are not
required for estimating universe totals, means, ratio etc., with their re
spective standard errors. However, the decomposition of the total variance
into the component stage-variances is essential in improving the design of
subsequent sample surveys (Chapter 17).
5. For the estimation of the variance of y by equation (14.12), it is not nec
essary that the number of sample units at stages other than the first be
two or more; they must, however, be so if estimates of stage-variances are
required.
The same results as for single-stage sampling given in section 2.8 apply also
for multi-stage sampling. The (100 — a) per cent confidence limits of the
universe parameter T are set with the sample estimator t and its estimated
standard error Sf, thus
t±t»,„-!«< (14.16)
Further reading
Murthy, sections 9.1 and 9.2.
CHAPTER 15
15.1 Introduction
In this chapter will be presented the estimating methods for totals, mean,
and ratios of study variables and their variances in a multi-stage sample
with simple random sampling at each stage: the methods for two- and
three-stage designs are illustrated in some detail and the general procedure
for a multi-stage design indicated. The methods of estimation of proportion
of units in a category and the use of ratio estimators will also be considered.
15.2 Two-stage srs: Estimation of totals, means, ratios and their vari
ances
(15.1)
(15.2)
237
238 Chapter 15
Table 15.1: Sampling plan for a two-stage simple random sample with
replacement
Grand total:
NM, N N
(15.3)
N
(15.5)
y- = Nyi (15.6)
The combined unbiased estimator of Y, obtained from all the n sample
fsu’s, is the simple arithmetic mean (from equation (2.43))
If, however, all the Mi ssu’s in the ith sample fsu are not completely
surveyed, but a sample of mt- ssu’s is taken, then the value of the study
variable in the sample fsu Yi (or yi above) has to be estimated on the
basis of these sample mt- ssu’s. Let the value of the study variable in the
jth sample ssu (j = 1, 2,..., m,) in the ith sample fsu be yij. Then an
unbiased estimator of the total Yi (or ?/, above) from the ijth sample ssu
is, from equation (2.42),
J=1 J=1
where
(15.12)
(15.13)
J=1
To obtain the unbiased estimator of the total Yi in the ¿th sample fsu in
an srs, we multiply this average by the actual total number of ssu’s, Mi, to
give
v’o = MiV, (15.10)
An unbiased estimator of the universe total Y obtained from the ¿th sample
fsu is given in an srs on multiplication of y* Q by N, the total number of fsu’s,
namely
y,' = (15.12)
and the combined unbiased estimator of Y from all the n sample villages
is the simple arithmetic mean
* .
Multi-stage Simple Random Sampling 241
(15.13)
where yij is the value of the study variable in the jth sample ssu in the ¿th
sample fsu, and
1 N Mi
wij = -,—7- =------------ (15.17)
/1/2 n mi
Putting this value of Wij in equation (15.16), we get
(15.18)
* V- nN Mi NMi
Vi =n2^ wijya =-------- y , va =------- y „ yij (15.19)
t—* nmi m, '
« j=i j=i
Notes
1. If all the first-stage units are known to have the same total number of ssu’s,
e. if
i. = Afo say, then the combined unbiased estimator ?/□ in equation
(15.13) becomes
(15.20)
2. If, in addition, the sample number of second-stage units in each sample fsu
is fixed, mt = mo say, then the design becomes self-weighting and
n
NMp
(15.21)
nmo
«=J J=1
where
J = '£(Yt-Y?/N
is the variance between the fsu’s, and
M.
is the variance between the ssu’s within the zth fsu. a?.
Vo is estimated unbi-
asedly by s^., defined by equation (15.14).
Notes
1. If sampling is simple random without replacement at both stages, j/q, de
fined in equation (15.13), still remains the (combined) unbiased estimator
of Y. The sampling variance and its unbiased estimator are given in section
17.2.4.
Multi-stage Simple Random Sampling 243
2. If sampling is simple random with replacement at the first stage and simple
random without replacement at the second stage, the estimator j/q, defined
in equation (15.13), remains unbiased for Y. Its sampling variance is given
in section 17.2.4, note 4, and the unbiased variance estimator is s2,, defined
in equation (15.14).
(15.25)
where X{j is the value of the other study variable in the jith sample ssu in
the ¿th sample fsu; and the combined unbiased estimator of X from all the
n sample fsu’s is, as in equation (15.13),
(15.26)
r = Vq/xq (15.27)
(15.28)
where
n
sy*x- = - xo)/”(” - !) (15.29)
Notes
1. If Mt = Mo, then from equations of the type (15.20), the estimator of the
universe ratio R =Y/X becomes
(15.31)
(15.32)
N n n
i.e. the average of the unbiased estimators of the fsu totals Y{.
Note the similarity of the structures of the universe value and its esti
mator in srs.
An unbiased estimator of the variance of ÿ is
(15.33)
(b) Estimation of the mean per second-stage unit. Two situations may
arise: (i) the total number of ssu’s Mj is known for all the N fsu’s, the total
being Mi; (ii) the total number of ssu’s is known only for the n sample
fsu’s, after these n first-stage units are enumerated for the list of ssu’s.
(i) Unbiased estimator. When the total number of ssu’s in the universe
is known, namely, Mi, then the unbiased estimator of the universe
Multi-stage Simple Random Sampling 245
(15.34)
Notes
1. If all the fsu’s are known to have the same number of ssu’s, i.e. if Af, = Mo,
then y becomes (from equations (15.20) and (15.34)),
V= J/o/^^o = (15.36)
.= 1 j=i
i.e. the simple mean of means (see (iii) of this section).
2. If, in addition, a fixed number mo of ssu’s is selected in the sample ssu’s
then from equations (15.21) and (15.34) the design becomes self-weighting
and y becomes
Similarly, if itJ denotes the yield of a crop in the ijth sample field, then the
average yield per field is estimated unbiasedly by an equation of the type
(15.34) and its variance by an equation of the type (15.35). Similar consid
erations apply when the total number of fields in the universe (J^ is
known.
246 Chapter 15
(ii) Ratio estimator. When the total number of ssu’s is known only
for the n sample fsu’s, the unbiased estimator of universe number of ssu’s
Mx is obtained from the sample. In two-stage srs, the average number
of ssu’s in the sample fsu’s,
^Mt/n
mo = N $2 Mifn
This can also be derived from the estimating equations of section 15.2.3
by putting yij — 1 for the selected sample units. Thus putting yX] = 1 in
equation (15.12), we get as the unbiased estimator of the total number of
ssu’s, obtained from the ¿th sample fsu,
* — N Mxmx/ mi = N Mi
m
= 52(mi - mo)2/n(” - 1)
[M, - (S" M,/n)]2
(15.44)
n(n — 1)
r = Z/o/^o
where
n
sy-om* = ~ Vo)(mi - m^/ntn - 1) (15.47)
Multi-stage Simple Random Sampling 247
Note: In crop surveys, even when the total geographical area or the total number
of farms in the universe is known, in estimating the proportion of area under a
particular crop, or the average yield per field, estimating equations of the ratio
type, namely (15.27) and (15.45) respectively, may be used (see part (iv) of this
section).
This is a biased estimator, unless all the Afts are the same (see note 1 to
(i) of this section) or the design is self-weighting.
As for single-stage srs (section 2.12), the unbiased estimator of the total of
the values of the study variable in the sub-universe Y' is obtained from the
estimating equation (15.13) by defining
The proportion of such ssu’s with the attribute in the zth fsu in the
universe is
Pi = M'/Mi (15.50)
In a two-stage srs (as for single-stage srs, see section 2.13), the estimated
total number of ssu’s in the universe with the attribute and its estimated
variance could be obtained respectively from estimating equations (15.13)
and (15.14) by putting = 1 if the sample unit has the attribute, and 0
otherwise.
Thus in an srs if m' of the mt sample ssu’s in the selected zth fsu has
the attribute, then the sample proportion
Pi = m'i/rrii (15.52)
* = NMiPi
m' (15.53)
If the Mi values are known for all the N fsu’s, then the unbiased esti
mator of the universe proportion P is
(15.56)
(15.57)
mo
/ mo = y? MiPi /
* Mi (15.58)
Notes
1. Even if 77 N M, is known, the use of the estimator given by equation (15.58)
may be preferred to that given in equation (15.56) as the former is likely
to more efficient.
2. If all the M, values are known for the universe and are the same, Mo, then
77 N Mi = NMo, and the estimator in equation (15.56) becomes
n
* /N Mp = ^yJ'pi/n=p
m'o (15.59)
i.e. the sample number of units possessing the attribute divided by the
total number of sample units.
250 Chapter 15
4= - p)2/n(n - !) (15.61)
is also known, where is the value of the ancillary variable of the ijth
universe unit. The unbiased estimator of Z in a two-stage srs is (from
equation (15.13))
(15.64)
z- = NMi ^2 zij!rai
The ratio estimator of the universe total Y, using the ancillary infor
mation, is (from equation (3.6))
where
H = Ho/Zo (15.67)
Multi-stage Simple Random Sampling 251
Example 15.1
For the Rural Household Budget Survey in the Shoa province of Ethiopia, 1966-
7, ten geographical strata were formed comprising a number of sub-divisions;
in each stratum, three sub-divisions were selected with equal probability out
of the total number of sub-divisions; in each selected sub-division, of the total
number of households listed by the enumerators, twelve were selected with equal
probability for the inquiry. Although the sample was not specifically designed
to provide estimates of demographic parameters, the example shows the method
of computation for estimating the average age of the household heads in one
particular stratum and its standard error. Table 15.2 gives the required data for
the stratum, which had a total of eighty sub-divisions.
The required computations are shown in Table 15.3, denoting by ytJ the age
of the head of the ijth sample household (i = 1,2,3 for the sample sub-division;
j = 1,2,... ,12 for the sample households); N denotes the total number of sub
divisions in the stratum = 80, and Mi and m, (= 12) respectively the total and
sample number of households in the ith sample sub-division.
From equation (15.13), the combined unbiased estimate of the total of the
ages of the households heads in the stratum is
n mt
(shown at the bottom of column 5), and from equation (15.43), the combined
unbiased estimate of the total number of households in the stratum is
n
Table 15.2: Total and sample number of households and total ages of the
sample household heads in one stratum: Rural Household Budget Survey,
Shoa Province, Ethiopia, 1966-7
i 222 12 474
2 42 12 503
3 913 12 590
Since
s2. = SSy
*
/n(n - 1) = 802 x 1,071,290,895/6
= *
/n(n
SSy - 1)
= 802 x [884,617 - 461,776]/6
= 802 x 422,841/6,
Multi-stage Simple Random Sampling 253
Table 15.3: Computation of the average age of household heads and its
sampling variance: data of Table 15.2
i M, T71| zLjJl
802 X
1 222 12 474 80 X 8769.00 76895.361
2 42 12 503 1760.50 3099.360
3 913 12 590 44889.17 2015040.277
802x
Total 80 x 55,418.67 2 095 035.998
Mean 80 X 18472.89
(J/Ô)
802 X
1 80 X 222 802 x 49 284 1 946 718
2 42 1764 73 941
3 913 833 569 40 983 809
802x
Total 80 x 1177 802 X 884 617 43 004 468
Mean 80 x 392.33
254 Chapter 15
(E" ">:)/"
y°mo n(n — 1)
= 802[43, 004, 468 — 21, 742, 590]/6
= 802 x 21,262,878/6.
Noting that r = 47.08, 2r = 94.16, and r2 = 2216.5264, we get for s2 (from
equation (15.46))
802[(l, 071, 290, 895 + 937, 238, 070 - 2, 002, 018, 432)/6]
802 x (392.33)2
= (6, 510, 533/6)/153, 925.44 = 7.0494 years2
so that the standard error of estimate of r is sr = 2.66 years, and the estimated
CV of r is 2.66/47.08 = 0.0565 or 5.65 per cent.
Example 15.2
In each of the simple random sample of 4 villages selected in Example 2.3 from the
list of 30 villages (Appendix IV), select a simple random sample (with replace
ment) of 4 households from the total number of households, and on the basis
of collected data on the households, the total daily income, and the number of
adults (over 18 years of age) of the sample households, estimate for the 30 villages
the total number of persons, the total daily income, the per capita daily income,
and the total number and proportion of adults, along with their standard errors.
The sample data are shown in Table 15.4 and the required computations in
Table 15.5, denoting by y,} the household size, by itJ the daily income, and by
y't} the number of adults, in the tjth sample household (i = 1, 2, 3, 4 for sample
villages numbered 15, 18, 19, and 24; j = 1, 2, 3, 4 for the sample households in
the selected villages); the respective means in the ith sample village are denoted
by j/,, it, and y- (defined by an equation of the type 15.11), N = 30 is the total
number of villages, and M, and mt (= 4) denote respectively the total and the
sample numbers of households in the »th sample village.
From equation (15.13), the combined unbiased estimate of the total number
of persons in the 30 villages is
n mx
yô = — \ \ ytJ = 30 x 108.5 = 3255 persons;
n THi ¿—J
>=i j=i
Multi-stage Simple Random Sampling 255
Table 15.4: Size, total daily income, number of adults (18 years or over)
and the agreement by the selected adult to an increase in state taxes for ed
ucation in sample households in each of the four sample villages of Example
2.3: two-stage simple random sample
1 8 $92 4 0
2 4 41 2 1
5 6 55 3 0
14 3 36 2 1
13 4 39 2 0
15 5 58 3 1
16 6 61 3 0
21 7 63 4 0
5 5 58 2 0
11 5 50 3 0
23 3 33 2 0
24 5 47 2 1
3 4 48 2 0
7 6 70 4 0
14 6 49 2 1
15 4 45 2 0
256 Chapter 15
= *
SSy /n(n - 1)
= 302 x [47,783.5 - 47, 089J/12
= 302 x 694.5/12
y'Q* = 22 Vo
*/n = 30 x 56.3125 = 1689
1=1 j = l t=l
N M, N
(15.73)
Table 15.6: Sampling plan for a three-stage simple random sample with
replacement
Extending the structure of a two-stage srs given in section 15.2.2, the design
will be three-stage srswr if in the zjth sample ssu (z = l,2,...,n; j =
1,2,.
.. ,mt), qij sample tsu’s are selected as an srs (and with replacement)
out of the total tsu’s, the total number of tsu’s in the sample, i.e. the
total sample size is £2tn=1
An example of a three-stage design for a crop survey has been given in
section 14.2. Extending the example of a two-stage design for a household
survey given in the same section, the design will be three-stage srs if a
sample of persons (tsu’s) is selected in the sample households (ssu’s) in the
sample villages (fsu’s), sampling at the three stages being simple random.
Or, to take another example, in an urban household inquiry, where all the
towns constitute the urban area, an srs of towns (fsu’s) may be first selected,
then a sample of blocks (ssu’s) in the selected towns, and finally a sample
of households (tsu’s) in the selected town-blocks. The sampling plan of a
three-stage srs is shown in summary form in Table 15.6.
Let yijk denote the value of the study variable in the ¿th selected tsu
(¿ = 1,2,..., qij) in the ijth sample ssu.
15.6.3 Estimation of the universe total Y and the variance of the sam
ple estimator
Extending the methods of section 15.2.3 for a two-stage srs, it can be seen
that an unbiased estimator of the universe total Y, obtained form the zth
sample fsu (z = 1,2,..., n) is
mi
NMj y? Qijûij
Vi (15.77)
mi
j=i
Multi-stage Simple Random Sampling 261
where
1
— T- 52 ^0
* (15.78)
fc=l
_ N Mi yA Qi j yA ~ _ N Mi yA _
Xi — mi n / j fc — m« / ^QijXij (15.81)
éi
where is the value of the study variable in the ijfcth sample third-stage
unit, and
1 n
(15.83)
r = Vg/xq (15.84)
262 Chapter 15
(15.85)
where
n
is the unbiased estimator of the covariance of j/q and Xq, from equation
(14.5).
(15.89)
Qo = 52 qi !n (15.90)
Multi-stage Simple Random Sampling 263
(15.92)
(15.93)
where
n
where
This estimator is both biased and inconsistent, unless the design is self
weighting (see note 3).
Notes
1. Comparison of the estimators. Although biased, the ratio-type estimator is
likely to have a smaller sampling variance than the unbiased estimator: the
latter is, of course applicable only when the number of second- and third-
stage units in the universe are known. The unweighted mean of means will
be both biased and inconsistent, unless the design is self-weighting, but the
observation made in section 15.2.6(b-iv) may also be noted.
264 Chapter 15
2. In crop-surveys, when the total geographical area (A) or the total numbers
of fields Mi) and plots (£2^ are known in the universe,
then in a three-stage srs with villages as the first-stage, fields as the second-
stage, and plots as the third-stage units, the proportion of area under a
particular crop in the universe is estimated unbiasedly by
J/oM (15-97)
where y* is the unbiased estimator of the area under the crop in the uni
verse, given by equation (15.79), ytJk denoting the area under the crop in
the ijTth sample plot. An unbiased variance estimator of y$ /A is
s^/A2 (15.98)
Sy., the unbiased variance estimator of y^, being defined in equation (15.80).
The average yield per unit area can be estimated similarly; and the average
yield per plot is estimated unbiasedly by equation (15.87), and its unbiased
variance estimator by equation (15.88), yijk denoting the yield of zjTth
sample plot.
As for two-stage srs, estimators for sub-universes are obtained from the
general results for the universe by defining
y'.jk = yijk if the sample unit belongs to the sub-universe; and
y'..k = 0 otherwise.
As for two-stage srs, the estimated number of tsu’s with the attribute
could be obtained by putting y^ = 1, if the sample unit has the attribute,
and 0 otherwise.
Multi-stage Simple Random Sampling 265
Thus if q'^ of the qij sample third-stage units in the selected ijth ssu has
the attribute, then the unbiased estimator of the total number of tsu’s in
the universe possessing the attribute (^2^ Qiy) from the ¿th sample
fsu is
mt
NMi Qi:i , _ W,-
Qi / > Qii y QijPn (15.101)
mi mi
j=i
= - 0 (15.104)
A consistent but generally biased estimator of the universe proportion
P is (from the general estimating equation (14.6))
P-q'q/Qq (15.105)
where q$ is defined by equation (15.90).
A variance estimation of p is given by the general estimating equation
(14.7).
Note: If all the QtJ values for the universe are known, then an unbiased estimator
of the universe proportion P is
N M,
(15.106)
i = 1 j' = 1
However, this estimator is likely to have a much larger sampling variance than
the estimator defined in estimation equation (15.105).
As for two-stage srs (section 15.5), so also for three-stage srs the ratio
method of estimation may be used to increase the efficiency of the esti
mators. The estimating procedures are the same as in section 15.5 for
two-stage srs, except that for the unbiased estimators of the study and
the ancillary variables and their estimated variances, formulae of the type
(15.79) and (15.80) would be used.
266 Chapter 15
Col.(4)
Total Sample =
Ai, m, NAf.x
col.(5)
302X 302x
1 20 4 4 1.00 30 X 20.00 400.0000 1100.000
2 23 4 3 0.75 17.15 297.5625 1190.250
3 25 4 2 0.50 12.50 156.2500 703.125
4 18 4 2 0.50 9.00 81.0000 405.000
302x 302X
Total 30 X 58.75 934.8125 3398.375
Mean 30 X 14.6875
)
(^
*
Example 15.3
In example 15.2 are given the number of adults in each of the sample households
in the srs of 4 villages out of the total 30 villages. In each of these sample
households, one adult member was further selected at random from the total
number of adults in the household, and asked if he/she agrees to an increase in
state taxes for education. This information is given also in Table 15.4. Estimate
the total number of adults who agree to an increase in state taxes for education
and the proportion they constitute of the total number of adults.
Extending the notation of example 15.1, let QtJ (= y't} in example 15.1) denote
the total number of adults, and qtJ (=1) the number selected for interview in the
ijth sample household (i = 1,2, 3,4; j = 1,2, 3,4). As only one sample third-
stage unit is selected, we can dispense with the subscript k in yijk by which we
had denoted the value of the study variable for tjTth third-stage sample unit.
Following the method of section 15.8, we put q't] = 1, if the selected adult in
the ijth sample household agrees to an increase in state taxes for education, and
0, otherwise. The required computations are shown in Table 15.7.
From estimating equation (15.102), the unbiased estimate of the total number
of adults who agree to an increase in state taxes for education is
n
q'o* = ^^q
*
/n = 30 x 14.6875 = 441,
Multi-stage Simple Random Sampling 267
/*
s2^» = SSq'i
* /n(n — 1)
= 302[934.8125 - 862.8906J/12
= 302 x 71.9219/12
P = 9o
*/9o = (30 x 14.6875)/(30 x 56.3125)
= 0.2608 or 26.08 per cent
As
/n(n
q
*
= SPq - 1) = 302 x 90.0156/12
2 •2
s p’ )/?0
302 71.9212 + (0.2608)2 x 290.6719 - 2 x 0.2608 x 90.0156
TT (30 x 56.3125)2
54.1636
= 0.0014234
12 x 3171.0977
so that the estimated standard error of p' is spi = 0.0377, with estimated CV of
14.46 per cent.
Further reading
Cochran, chapters 10 and 11; Deming (1950), chapter 5A; Hansen et al. (1953),
vols. I and II, chapter 6, 8, 9, and 10; Hedayat and Sinha, sections 7.3 and
7.7; Kish (1965), section 5.3; Murthy (1967), sections 9.3 and 9.10; Singh and
Chaudhury, sections 9.1, 9.2, 9.4, and 9.6; Sukhatme et al. (1984), chapter 8;
Yates, sections 3.8, 6.18, 6.19, and 7.17.
268 Chapter 15
Exercises
1. A hospital has received 1000 bottles of 100 tablets each. A simple random
sample of 6 bottles is taken and from each sample bottle a sample of 20
tablets is taken at random. Given the data in Table 15.8, estimate (a) the
average weight per tablet, (b) the proportion of sub-standard tablets, and
(c) the ratio between the composition of two active substances A and B in
the tablet, with their standard errors (Weber, Example 4.4).
2. Of 53 communes in an area, 14 were selected at random; from each of the
selected communes, of the total number of farms, a sample of farms (1 in
4) was taken also at random. Table 15.9 gives the required information on
the sample, including the total number of cattle in the sample farms in the
selected communes.
(a) Estimate the total number of cattle in the area;
(b) Estimate the average number of cattle per farm, by (i) using the ratio
of the unbiased estimators of the total number of cattle and the total
number of farms; (ii) using the unweighted mean of means; and (iii)
using the additional information that the average number of farms
per commune in the universe is 39.09.
Table 15.8: Total weight, number of sub-standard tablets, and the percent
age composition of two active substances in 20 sample tablets selected from
100 tablets in 6 out of 1000 bottles received by a hospital
Substance A Substance B
1 46 11 88 8.0000 368.0000
2 39 10 114 11.4000 444.6000
3 25 6 96 16.0000 400.0000
4 23 5 82 16.4000 377.2000
5 32 8 83 10.3750 332.0000
6 31 8 207 25.8750 802.1250
7 60 15 208 13.8667 832.0020
8 28 7 73 10.4286 292.0008
9 59 14 195 13.9286 821.7874
10 24 6 73 12.1667 292.0008
11 84 21 191 9.0952 763.9968
12 30 7 79 11.2857 338.5710
13 64 16 226 14.1250 904.0000
14 66 17 166 9.7647 644.4702
Cell 1 2 3 4 5 6 7 8 9 10
Estimate 1 61 38 25 0 71 95 32 50 10 0
Estimate 2 29 81 32 45 46 69 26 39 24 5
Total 90 119 57 45 117 164 58 89 34 5
16.1 Introduction
271
272 Chapter 16
Table 16.1: Sampling plan for a two-stage pps sample design with replace
ment
where is the value of another ancillary variable and ITt = £2^ u>ij is
known.
The universe totals and means are defined as for a two-stage srs in
section 15.2.1.
If yij is the value of the study variable in the jth selected ssu (j =
1,2,, mi) in the zth selected fsu (i = 1,2,... ,n), then from the general
estimating equation (14.13), or an extension from single-stage pps to two-
stage pps in the same manner as was done for srs in section 15.2.3(a), it
will be seen that an unbiased estimator of the universe total Y of the study
variable, obtained from the zth sample fsu, is
(16.3)
and the combined unbiased estimator of the universe total Y from all the
n sample fsu’s is
= (164)
Estimators of the means of study variables, and the ratio of two totals
follow directly from the fundamental theorems of section 14.4.
Multi-stage Varying Probability Sampling 273
Notes
1. Sampling variance of the estimator r/o ■ The sampling variance of y* in
ppswr at both stages is
(16.6)
(16.7)
(16.8)
3. If, in the above case, a fixed number mo of ssu’s is selected in each sample
fsu, then estimating equation (16.7) becomes
(16.9)
Table 16.2: Sampling plan for a three-stage pps sample design with replace
ment
yo = ~-^yi (16.13)
Estimators of the means of study variables and the ratio of two totals
follow directly from the fundamental theorems of section 14.4.
Note: As for two-stage pps sampling, so also for three-stage sampling, great
simplifications result if the same ancillary (“size”) variable is used for selection
in all three stages, i.e. if
7Tt = Zx/Z 7Ttj — ZXjjZx T^ijk — Ztjk/Zij (16.15)
(In our notation, zx = Z, and zX} = ZXJ). This is particularly useful in crop
surveys, as will be seen in section 16.5.
Multi-stage Varying Probability Sampling 275
Table 16.3: Sampling plan for a two-stage sample design with pps sampling
at the first-stage and srs at the second-stage
(16.16)
where
m,
yi = ^2yij/mi (16.17)
>=i
is the unbiased estimator of the mean T, = Yi/Mi in the zth sample fsu
(as sampling is simple random), and zr, — ZijZ is the (initial) probability
of selection of the zth fsu.
The combined unbiased estimator of Y from all the n sample fsu’s is
the arithmetic mean
(16.18)
Estimators of the ratio of the totals of two study variables, their means
etc. as also of their variances follow from the fundamental theorems in
section 14.4, and will be illustrated by an example.
Example 16.1
In each of the four sample villages selected out of 30 villages with probability
proportional to the previous census population in Example 5.2, select a simple
random sample of 4 households each from the total number of households in the
villages, and on the basis of the collected data on household size, the total daily
income and the number of adults (over 18 years of age) of these sample households
(given in Table 16.4), estimate for the 30 villages the total number of persons,
the total and per capita daily income, and the proportion of adults, along with
their standard errors.
The required computations are shown in Table 16.5, denoting by the yt} the
household size, by ztJ the daily income, and by y't] the number of adults in the
jth sample household of the ith sample village (i = 1,2,3,4 for sample villages
numbered 5, 6, 11, and 18 respectively; j = 1,2,3,4). The means in the ith
sample villages are denoted by yt, z,, and y{ respectively, defined by equations of
the type (16.17).
N = 30 is the total number of villages, and M, and mt (= 4) denote respec
tively the total and the sample numbers of households in the ith sample village.
From equation (16.18), the combined unbiased estimate of the total number
of persons in the 30 villages is y£ = 72” y*
/n = 3,349 (shown at the bottom of
column 7 of Table 16.5). An unbiased variance estimate of y* is (from equation
(16.19))
s2y, = (45,056,019.02 - 44, 870, 538.61)/12 = 15,456.70
Similarly, the combined unbiased estimate of the total daily income is Zq —
72n z* /n — $33,548 (shown at the bottom of column 11 of Table 16.5), and the
combined unbiased estimate of the total number of adults is
n
y'o
* = J? y'*! n = 1930
(shown at the bottom of column 15 of Table 16.5) with an unbiased variance
estimate 11,916.32.
The estimated daily income per capita is, from equation (14.6), the estimated
total daily income divided by the estimated total number of persons, or
T = To/yl = $10.02
As
= (450,668,357 - 449, 446, 581)/12 = 101,814.67
the estimated variance of r is (from equation (14.7))
Table 16.4: Size, total monthly income, number of adults (18 years or
over) and agreement by the selected adult to an increase of state taxes for
education in the srs of the sample of households in each of the four sample
villages of Example 5.2 selected with probability proportional to previous
census population
10 4 $37 3 0
14 5 53 3 0
20 4 40 2 1
23 5 48 3 0
4 6 52 3 0
5 5 42 2 1
15 3 35 2 0
17 6 60 4 0
2 4 45 2 1
4 6 69 3 0
10 5 46 3 0
13 6 57 3 0
3 5 60 3 0
12 5 43 3 1
18 5 52 3 0
20 6 65 4 0
Table 16.5: Computations of the estimated total number of persons, total
and per capita monthly income, number and proportion of adults for a
two-stage design: data of Table 16.4
Vo = = = MV (16.21)
n L' n '
where
n
y-^ydn (16.22)
which is the simple (unweighted) mean of the ?/, values, and an unbiased
estimator of the variance of y is
n
(16.24)
Example 16.2
In an area there are 315 schools with a total of 27,215 students. Eight schools
were first selected with probability proportional to the number of students, and
in each selected school, 50 students were selected at random. Table 16.6 gives for
each school the number of students with trachoma and the number with multiple
scars. Estimate (a) the proportion of student with trachoma, and (b) among
those with trachoma, the proportion with multiple scars (Weber, Example 4.5).
Here N — 315 schools, and M = 27,215 students. As the sample of 8 schools
is selected with probability proportional to the number of students (Af,) and in
each selected school, a sample of students is taken at random, the method of
section 16.3.2 will apply here. The estimating equations are further simplified as
a fixed number mo = 50 of students are selected in the sample schools.
(a) Defining ytJ =1, if the jth sample student in the ith sample school has
trachoma, and ytJ = 0, otherwise (i = 1,2,..., 8; j = 1,2,..., 50), an unbiased
estimate of the proportion of students with trachoma is (from equation (16.23))
n n
so that the estimated standard error of y is Sy — 0.03945 with CV 4.86 per cent.
(b) Similarly, defining = 1, if the ijth sample student has trachoma with
multiple scars, and j/,j — 0, otherwise, an unbiased estimate of the total number
of students with multiple scars is (from equation (16.23))
n
Table 16.6: Number of students with trachoma and number with multiple
scars among samples of 50 students each selected at random from each of
the 8 schools selected with probability proportional to the total number of
students
1 40 3
2 31 0
3 47 16
4 41 8
5 43 8
6 36 5
7 39 2
8 48 13
Total 325 55
Raw sum of squares 13,421 591
Raw sum of products 2426
Table 16.7: Sampling plan for a three-stage sample design with pps sam
pling at the first-stage and srs at the second- and third-stages
(16.25)
;=1fc=l
where yijk is the value of the study variable in the £th sample tsu in the
jth sample ssu in the ¿th sample fsu 7 = l,2,...,n; j = l,2,...,mt;
k = 1,2,.. . ,gij) and 7r, is the (initial) probability of selection of the ¿th
first-stage unit in the universe 7 = 1,2,..., TV).
The combined unbiased estimator of Y from all the n sample fsu’s is
the arithmetic mean
(16.26)
Multi-stage Varying Probability Sampling 283
Estimators of the ratio of the totals of the values of two study vari
ables, their means etc. as also their variances follow from the fundamental
theorems in section 14.4 and will be illustrated with an example.
Example 16.3
In Example 16.1 are given the data on the number of adults in the srs of 4
households in each of the 4 sample villages, selected with probability proportional
to the previous census population. In each of these sample households, one adult
member was further selected at random from the total number of adults in the
household, and asked if he/she agrees to an increase in state taxes for education.
This information is given also in Table 16.4. Estimate the total number of adults
who agree to an increase in state taxes for education and the proportion they
constitute of the total number of adults.
Extending the notation of Example 16.1, let QtJ (= y'tJ in Example 16.1)
denote the total number of adults, and qi3 (= 1) the number selected for interview
in the ijth sample household (i = 1,2, 3,4; j = 1,2, 3,4). As only one sample
third-stage unit is selected, we can dispense with the subscript k in yt]k by which
we had denoted the value of the study variable for the ijfcth third-stage sample
unit.
We define q't] = 1, if the selected adult in the ijth sample household agrees
to an increase in state taxes for education, and 0 otherwise.
The computations are shown in Table 16.8.
From equation (16.26), the unbiased estimate of the total number of adults
who agree to an increase in state taxes for education is q^ * = */»
9,' = 376
(shown at the bottom of column 4 of Table 16.8), where
i*
9.
Sample < =
village E”', Q<><, (M./ttOx
i col.(3)
*/<lo
P' = q'o = 376.15/1,929.88 = 0.1949
since
s . = (2,914,705.81 - 2, 903,602.92)/12 = 11, 102.89
the estimated variance of p' is
so that the estimated standard error of p' is 0.02013, with estimated CV of 10.33
per cent.
16.5.1 Introduction
As observed in the notes to section 16.2.1 and 16.2.2, great simplifications
result if the same “size” variable is used in all the stages of a multi-stage
pps sample. This is particularly useful in crop surveys when ancillary in
formation is available on each of the universe units.
Multi-stage Varying Probability Sampling 285
(16.28)
where pij = yij/aij is the proportion of the area under the crop in the zj th
sample field and A the total area in the universe
Pi = ^Pij/mn (16.30)
is an unbiased estimator of the proportion of the area under the crop in the
¿th sample village; and
n n mo
(16.31)
is the simple (unweighted) average of the sample pij values and an unbiased
estimator of the proportion of the area under the crop in the universe.
An unbiased variance estimator of p is
n
(16.32)
(16.33)
286 Chapter 16
x* = Ar (16.34)
where
n m0
rij/nrriQ (16.35)
is the simple average of the sample rtJ values and an unbiased estimator of
the average yield per unit area in the universe. Unbiased variance estima
tors of Eq and r are given respectively by estimating equations of the types
(16.33) and (16.32).
A consistent but generally biased estimator of the yield rate per unit of
crop area is the ratio
H THo t n m0
®o/s/o = 52 52 / 52 52pi> (16.36)
a variance estimator of which is given by the general estimating equation
(14.7).
Notes
1. If one crop cut of prescribed size and shape is located in the selected fields
at random and the yield rate obtained from it, then although the design
becomes three-stage, the same estimating procedure as above will apply.
2. In the far less common situation when the crop areas of all the fields in the
universe are known, the villages and fields may be selected with probability
proportional to their respective crop areas, and the total crop yield in the
universe will be estimated by
io = Yr (16.37)
where
is an unbiased estimator of the average yield per unit of crop area; r'i} =
ii}/Vij is the yield per unit of crop area in the ijth sample field; and Y is
the total crop area in the universe. Unbiased variance estimators of Iq and
f' are given respectively by equations of the types (16.33) and (16.32).
Multi-stage Varying Probability Sampling 287
Further reading
Cochran, section 11.9; Hansen et al. (1953), vol. I, sections 8.6 and 8.14, vol.
II, section 8.1; Hedayat and Sinha, sections 7.1-7.4 and 7.7; Murthy (1967), sec
tions 9.4, 9.5, 9.8c, 9.9, and 9.10; Singh and Chaudhary, sections 9.8 and 9.10;
Sukhatme et al. (1984), 9.1, 9.2, 9.5-9.7, 9.9, and 9.10; Yates, sections 6.19, 7.17,
and 8.11.
Exercises
1 0.0026 19 5 14
2 0.0098 23 5 82
3 0.0146 31 8 207
4 0.0167 40 10 124
5 0.0187 54 13 113
6 0.0187 54 13 113
7 0.0220 39 10 114
8 0.0249 55 14 242
9 0.0258 46 12 203
10 0.0298 83 20 256
11 0.0362 74 19 272
12 0.0370 70 17 131
13 0.0465 60 15 208
14 0.0465 60 15 208
1 13 3 30
2 15 3 58
3 19 5 14
4 28 7 73
5 39 10 162
6 41 11 88
7 46 12 102
8 46 12 102
9 48 12 203
10 51 13 134
11 59 14 195
12 74 19 272
13 83 20 242
14 83 20 242
Size of Sample
and Allocation to Different Stages
17.1 Introduction
289
290 Chapter 17
(171)
with an unbiased variance estimator
- Po)2/n(n - 1) (17.2)
It is also seen from equations (15.22) and (16.6) that the sampling vari
ance of ?/o in both srs and pps sampling (with replacement at both stages)
is of the form
Vi V?
V= — + — (17.3)
n nmQ
The value of the study variable of the second-stage unit is considered to
be the sum of two independent parts. One term, associated with the fsu’s,
has the same value for all the ssu’s in an fsu, and varies from one fsu to
another with variance
2 - V)2
Vi №cr62 = N in srs (17.4)
N
» V2
in ppswr (17.5)
where
m,
(17.6)
7=1
V2 =
N __ \2
= in pps (17.9)
Size of Sample and Allocation to Different Stages 291
where is the variance between the ssu’s within the ¿th fsu, and Y, the
mean for the ¿th fsu.
Thus, the sample as a whole consists of n independent values of the first
term, and nmo independent values of the second term.
Note: The variance estimator of y$ in estimating equation (17.2) does not explic
itly take account of the stage-variances; the estimation of the latter is shown in
section 17.2.4.
where cx is the average cost per fsu and c% that per ssu. The overhead cost,
not shown in the above cost function, may later be added to C. In general,
the cost per fsu (cj will be greater than that per ssu (02).
(17.11)
°“x/ÔW
and
(17.12)
1/îiSj + x/Ô^ïï
if the cost is preassigned (from equation (17.10)); or
n= (17.13)
Vx/WVi)
if the variance is preassigned (from equation (17.3)).
The optimum number of ssu’s can also be expressed in terms of the
intra-class correlation coefficient p between the second-stage units
1 - P
m0 = (17.14)
C2 P
292 Chapter 17
Note that this is of the same form as that of (7.37) for the optimal cluster
size.
Equations (17.11) and (17.14) show that a larger sub-sample should be
taken when sub-sampling is relatively inexpensive (ci large in relation to
C2) and the variability between the ssu’s within the fsu’s (V2) is larger than
that between the fsu’s (VS) i.e. the fsu’s are internally heterogeneous.
These equations often lead to the same value of mo for a wide range of
ratios c\lci- Thus a choice of the sub-sampling fraction can often be made
even when information on relative costs is not too definite.
1 M' _
J=1
Yii is the value of study variable of the ij'th universe unit (i = 1,2,..., JV;
Y = Y/NM = £M.yi/TVM
5=1
Size of Sample and Allocation to Different Stages 293
where yij is the value of the study variable of the ijth sample unit (z —
1,2,..., n; j = 1, 2,..., mt) and y{ = Vij/™i-
Denoting by
2 1 =y
(17.19)
-M- (17.20)
n \MJ mi
An unbiased estimator of V(z/q) in (17.15) is
(1-/1) s2 ■
bwi
(17.22)
n(n — 1)
where y* = NMiy{ is the unbiased estimator of Y from the ¿th sample fsu
(estimating equation (15.12)).
Notes
1. For the estimator of the mean, the stage-variances are obtained by dividing
the components in equations (17.15), (17.21), and (17.22) by N2M2.
2. When Mi = Mo, a constant, and m, = mo is fixed, the unbiased variance
estimator (equation (17.21)) for the total becomes
= (17.24)
+(17.25)
n nl nmo
294 Chapter 17
4. When sampling is simple random with replacement at the first stage and
simple random without replacement at the second stage, the sampling vari
ance of yo, the (combined) unbiased estimator of Y, is
N
<r2- = N2M2 â + —V M2(l- f2.) (17.32)
n n ¿-J m,
and being defined by equation (17.16) and (17.17). <7^. is estimated
unbiasedly by s2., defined in equation (17.2).
Size of Sample and Allocation to Different Stages 295
Example 17.1
so that the components of the variance of the mean are (from equation (17.3))
(b) From equation (17.11) the optimum number of ssu’s per sample fsu is
Vi | Vi 45 409
V= — +---- = 0.9 + 1.1686 = 2.0686
n nmo 50 350
Example 17.2
For Example 15.2 relating to a two-stage srs, and assuming sampling without
replacement, estimate the variance of the estimated total number of persons in
the form of expression (17.22). Also obtain an unbiased estimate of S%.
The required computations are shown in summary form in Table 17.2. Taking
Expression (17.22), the first term is, from the data of Example 15.2.
(1-4/30)
n(n — 1) ^2(y.‘ - i/o)2 = 4x3
x 625,050 = 45,142.67,
the estimated standard error was 228.2, and the estimated CV 7.01 per cent.
Size of Sample and Allocation to Different Stages
Table 17.2: Computation of the st age-variances of the estimated total number of persons in Example 15.2,
assuming sampling without replacement
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10)
to
CO
298 Chapter 17
2 625,050
Sb = 0.500811
N2M2(n - 1) 416,025 x 3
The above procedures can be extended to sample designs with three and
higher stages. Those for a three-stage design are given below: the proce
dures for higher-stage designs follows from symmetry:
Assume that the sample at the three stages is drawn with varying prob
abilities and with replacement, the respective numbers being n, mo and
go- Then the universe variance of the estimator (estimating equation
(16.13)) is
V1 t v2 | V3 (17.34)
n nmo nmoqo
where Vi, V2 and V3 are respectively the variations (1) between the fsu’s,
(2) between the ssu’s within the fsu’s, and (3) between the third-stage
units within the ssu’s, for an estimator based on any of the nmoqo possible
third-stage sample units.
Assuming the simple cost function (neglecting the overhead costs)
where ci is the average cost per fsu, C2 that per ssu, and C3 that per third-
stage unit, the optimum values of mo and qo are obtained as
(17.36)
n=
(17.37)
\/(^lcl) + \AV2C2) + \/(^3c3)
Further reading
Cochran, sections 10.6, 10.7, 11.13, and 11.14; Deming (1950), chapter 5A;
Hansen et al. (1953), Vols. I and II, chapter 6; Hedayat and Sinha, sections
7.5 and 7.6; Hendricks, chapter 7; Kish (1965), section 5.6; Murthy (1973), sec
tions 9.3-9.10; Singh and Chaudhary, sections 9.3, 9.4, and 9.7; Sukhatme et al.
(1984), sections 8.3, 8.7, 9.3, and 9.4; Yates, sections 7.17, 8.10.1, and 8.11-8.21.
Exercises
1. In a sample survey for the study of the economic conditions of the popu
lation, it is proposed to take a two-stage sample with villages as the first-
stage units and mo households per village as second-stage units, sampling
at both the stage units to be with equal probability and with replacement.
The variance of an estimate is V = (T^/n + a2w/nmo, where = 5.2 and
al = 11.2. The cost of the survey is given by C = co + nci + nmoC2, with
co = Rs 600, ci = Rs 160, and C2 = Rs 4. Determine the optimum values
of n and mo for C — Rs 10.000 (Chakravarti et al., Illustrative Example
4.2).
2. For Example 15.2, obtain estimates of the stage-variances for the total daily
income, assuming sampling without replacement.
3. To develop the sampling technique for the determination of the sugar per
centage in field experiments on sugar beets, ten beets were chosen from each
of 100 plots in a uniformity trial, the plots being the first-stage units. The
sugar percentage was obtained separately for each beet. Table 17.3 gives
the analysis of variance between plots and between beets within plots.
300 Chapter 17
(a) Estimate directly the estimated variance of the mean, and also sepa
rately its components.
(b) How reliable are the treatment means with 6 replications and 5 beets
per plot? (Snedecor and Cochran, pp. 529-531).
CHAPTER 18
Self-weighting Designs
in Multi-stage Sampling
18.1 Introduction
y= = 52 W12 (18.1)
sample ' 1 ' sample
(see also equation (14.13)) where y\2...u is the value of the study variable
in the (12... u)th ultimate stage sample unit, the factor
u 1
TI— = W12 u (18.2)
i
(see also (14.14)) is the multiplier (or the weighting-factor) for the
(12...u)th ultimate-stage unit; and ft = ni2...(t-i)7ri2..4, there being
ni2 (t-i) sample units each selected with probability 7ri2 t out of the total
M2.. .(t-i) units at the tth stage.
301
302 Chapter 18
For example, in a two-stage srs, where j\ = n/N and /2t — mi I Mi, the
multiplier for the sample ssu’s is
1 N Mi
tj fi ■ hi n mi
and this will be a constant wq, when
NMi
Won
In this case, the sampling fraction at the second stage should be the same
(= N/won) in all the n sample fsu’s.
In general, a multi-stage srs will be self-weighting if, in each stage, the
sampling fraction is a constant for selecting the next-stage sample units.
That is n\2...t/^i2...t should be a constant; e.g. the second-stage sampling
fraction should be a constant, the third-stage sampling fraction
71123/M23 another constant, and so on.
We shall further consider the requirement of having a fixed number, say
no, of the ultimate stage sample units in each of the selected penultimate
stage units. If these n0 units are selected with equal probability, then
TT12...(u-l) N12...(u-1) (18.4)
i. the penultimate stage units should be selected with probability propor
e.
tional to the total number of ultimate stage units contained in each, and
the sample number of penultimate stage units is given by
(18.5)
where the numerator refers to the “size” of the particular units to be se
lected and the denominator to the total size of all the units, and the design
will be self-weighting when:
Self-weighting Designs in Multi-stage Sampling 303
1. The variate value is the ratio of the characteristic under study 7/12...u
to the size in the ultimate stage units Z12.,.u; and
(18.7)
Notes
1. We have already seen in section 16.5 how two- and three-stage pps designs
for crop surveys can be made self-weighting by using the same size variable,
such as area, in all stages of sample selection.
2. For fractional values of the sample sizes at different stage, see section 13.4.
3. For making a multi-stage design self-weigh ting at the tabulating stage,
the same principles as for a stratified sample, given in section 13.5, will
apply, namely, selecting a sub-sample with probability proportional to the
multipliers.
Let us consider the sampling plan of section 16.3, with n fsu’s selected with
pps (7rt ) out of a total N, and m, ssu’s selected with equal probability out
of a total Mi units in each selected fsu. Here /1 = n^i and /2 — mi/Mi.
The multiplier for the ¿jth sample ssu (t = 1,2,..., n; j = 1,2,..., m,) is
mi = Mi/(womti) (18.9)
2. Find the overall sampling fraction which is the ratio of the number of
households to be sampled to the expected total number of households
in the universe. The reciprocal of this will be the constant multiplier
w0.
5. Give to the enumerators the values m'o and M- (the number of house
holds in the village at the time of the census) with instruction to
select at random (or systematically) m, = mz0Mt- /M- households out
of the Mi households listed in the ¿th village.
I//1/2/3 = (18.15)
mi — Qi/(woniTi) (18.16)
Further reading
Hansen et al. (1953), Vol. I, sections 7.12 and 9.11; Murthy (1967), section 12.3;
Singh and Chaudhary, section 9.9; Som (1958-59) and (1959); Sukhatme et al.
(1984), section 9.8.
PART IV
STRATIFIED MULTI-STAGE
SAMPLING
CHAPTER 19
19.1 Introduction
309
310 Chapter 19
th = ^thi/nh (19.1)
We shall see later in this section how to compute the thi values for a
stratified multi-stage design.
For another study variable, similarly defined, an unbiased estima
tor of the universe parameter
L
u= (19.6)
is the sum of the L unbiased stratum estimators Uh of Uh
L
u = ^uh (19.7)
= (19.8)
312 Chapter 19
L
= "y? shHh (19.10)
rh-îh/ùh (19.11)
For all the strata combined, a consistent but generally biased estima
tor of the ratio of two universe parameters R = T/U is the ratio of
the respective sample estimators t and u,
r = t/u (19.13)
u - 2rstu)/u2
s2 = (s2 4- r2S2 (19.14)
Ph = SthüJsîhsüh (19.15)
P — s^/Sts (19.16)
Stratified Multi-stage Sampling: Introduction 313
№ = E (II 777 12 -
stratum 't~1 '
sample
- ZL Wfl12 “^12 u (19.18)
stratum
sample
u 1
U 7- = WM2...U (19.19)
t=i
is the multiplier (or the weighting-factor) for the (hl2 ... u)th ultimate
stage sample units; the sum of products of these multipliers and the
values of the study variable for all the ultimate-stage sample units
in the stratum provides the unbiased linear estimator of the universe
total Yh in equation (19.18).
An unbiased estimator of the variance of the sample estimator
in the general estimating equation (19.18) is
2 _
s Vh. ~
nh(nh ~ 1)
(19.20)
314 Chapter 19
!/M=S (1922)
Notes
1. Setting of confidence limits. The results for stratified single-stage sampling,
given in section 9.4, also hold for stratified multi-stage sampling.
Stratified Multi-stage Sampling: Introduction 315
2. Selection of two first-stage units from each stratum. As for stratified single-
stage sampling (section 9.5), so also for stratified multi-stage sampling, the
computation of estimates is simplified considerably when in each stratum
two first-stage units are selected with replacement out of the total number
of units. The results of section 9.5 also hold here, thi and th2 denoting
the unbiased estimators of the universe parameter Th in the hth stratum,
obtained from the two sample first-stage units; and similarly for Uhi and
3. For other notes, see section 9.3, notes (2)-(4) and section 14.4, notes (l)-(5).
Further reading
Foreman, section 8.6.
CHAPTER 20
20.1 Introduction
In this chapter we will consider the estimating method for totals, means,
and ratios of the values of study variables and their variances in a stratified
multi-stage sample with simple random sampling at each stage: the meth
ods for two- and three-stage designs are illustrated in some detail, and the
procedures for four and higher stages indicated. The methods of estimation
of proportion of units in a category and the use of ratio estimators will also
be considered.
317
318 Chapter 20
Mh,
(20.1)
;=1
_ Nk _
Yh = Yh/Nh = ^MhiYKi/Nh (20.4)
(20.5)
L L Nh L Nh Mhi
y = Ey* = EEy- = EEEy>o (20.6)
Y = Y/N (20.7)
y=y/(z£^) (20.8)
Stratified Multi-stage Simple Random Sampling 319
Table 20.1: Sampling plan for a stratified two-stage simple random sample
with replacement. In the /ith stratum (h = 1, 2,..., ¿)
M mh!
yli = № — V yMj = NKMhiyhi (20.9)
fl
where
The combined unbiased estimator of Yh from all the n/, sample fsu’s is (from
equation (15.13))
(20.11)
320 Chapter 20
For all the strata combined, the unbiased estimator of the total Y is
(from equation (19.4)) the sum of unbiased estimators of the stratum totals
Yh,
L L nK TTlh.1
* Mhi
(20.12)
E mhi
j=l
is the variance between the ssu’s within the ¿th fsu in the /ith stratum.
The sampling variance of у = ^,L y^Q is
L
а у2
Note: The sampling variances and their unbiased estimators in srs, without re
placement at both stages and with replacement at the first stage and without
replacement at the second stage, follow from section 17.2.4 for any stratum; for
the sampling variance y of all the strata taken together, the stratum sampling
variances are summed up and an unbiased estimator is obtained on summing up
the unbiased stratum variance estimators.
= * hi/nh(nh - 1)
SSy (20.17)
Stratified Multi-stage Simple Random Sampling 321
= <20-20)
si, , an unbiased estimator of variance of zîn, is defined as for si, ,
XhO yhO
given in equation (20.17).
For the whole universe, the unbiased estimator of X is (from equation
(20.12))
L
* = £>;<, (20-21)
and s]., the unbiased estimator of the variance of x, is defined as for s^,
given in equation (20.18).
A consistent but generally biased estimator of the ratio of two universe
totals at the stratum level Rh = Yh/Xh is the ratio of the sample estimators,
rh — Vho/æhO (20.22)
with a variance estimator (from equation (19.12))
where
(20.25)
where
L
(20.27)
(a) Estimation of the mean per first-stage unit. The unbiased estimator
of the overall mean per fsu is
V = y/N (20.28)
4 = 4/№ (20.29)
(b) Estimation of the mean per second-stage unit. Two situations will
arise: (i) Mhi (the total number of ssu’s) is known for all the fsu’s;
and (ii) Mhi is known for the rih sample fsu’s only.
(i) Unbiased estimator. When Mhi is known for all the fsu’s, the
unbiased estimator of mean per ssu is
z / l Nh \
(20.30)
Stratified Multi-stage Simple Random Sampling 323
(20.31)
and
= Nhmhi (20.35)
where
L
sym = 52Syiornno
L nh
= 52 52^
* " Vho)(mhi - m* h0)/nh(nh - 1)
(20.38)
is an unbiased covariance estimator of y and m.
See notes to section 15.2.6(b-ii).
324 Chapter 20
(20-41)
The proportion of such ssu’s with the attribute in the ¿th fsu in the Zith
stratum is
Phi = (20.42)
Stratified Multi-stage Simple Random Sampling 325
m> = 52 (20.45)
where
mho = ^mhi/nh
(20.46)
p = m!/m (20.52)
where
L
Smm' = 22 Mo”»»
L nh
= 22 52(mh< " mho)(mhi ~ mho)/nh(nh - 1) (20.54)
Example 20.1
In each of the simple random samples of two villages, each from the three stages
selected in Example 10.2 from the list of villages in Appendix IV, select five
households at random, given the listing of households in the sample villages. On
the basis of collected data for the sample households on the size, number of adults,
and possession of TV satellite dishes, estimate the total number of persons, the
average household size, the proportion of adults, and the proportion of households
with TV satellite dishes for the three states separately and also combined, along
with their standard errors. The sample data are given in Table 20.2. The use of
the last column will be illustrated in Example 20.2.
As there are two sample first-stage units, namely villages, in each stratum, we
follow the simplified procedure mentioned in note 2 of section 19.3, illustrated also
in Example 10.2 for a stratified single-stage srs. We denote the household’s size
by yhij, the number of adults by XhtJ, the possession of TV satellite dishes by y'hi]
(= 1 for yes; and = 0 for no) in the sample households (h = 1,2,3 for the states;
i = 1,2 for the sample villages; and j = 1,2, 3,4, 5 for the sample households).
The respective means per sample household in the sample villages are denoted by
Stratified Multi-stage Simple Random Sampling 327
7 6 4 1 0
11 4 2 0 0
12 4 2 0 0
15 3 1 0 1
18 4 2 1 1
1 4 1 0 0
4 5 2 0 0
6 4 2 1 1
9 3 2 1 0
17 4 3 0 0
5 6 3 1 0
7 7 4 1 0
8 5 3 0 1
12 6 4 0 0
16 6 2 1 0
4 4 2 0 1
7 5 3 0 0
9 5 3 1 0
12 7 3 1 0
16 5 3 0 0
7 6 3 0 0
10 5 3 0 0
17 5 2 1 0
19 5 3 0 0
20 7 4 1 1
6 6 4 0 0
10 6 3 0 1
11 6 4 0 0
16 5 3 0 0
18 4 2 1 0
328 Chapter 20
ÿhi, ¿hi, and fnh,, defined by equations of the type (20.10). Unbiased estimates of
the total number of households in the three zones separately and combined have
already been obtained in Example 10.2. These along with the other required
computations, are shown in Tables 20.3-20.5.
The final estimates are given in Table 20.6. The coefficients of variation of
the estimation proportion of households with TV satellite dishes are rather high.
To reduce standard errors, information on items such as this, that do not require
detailed probing, could be collected for all the households in the sample villages
when a list of households is prepared before a sample of households could be
drawn for collecting other information. The present computations are merely
illustrative.
where
=M. — (20.57)
The ratio estimator of the total Yjj, using the ancillary information, is,
from equation (15.66),
VhR — Zhy
*ho/ zhC) — (20.58)
where
I 10 2 1 18 5 180
2 22 5 220
Total 400
Mean (= stratum estimate of total) 2OO(m‘o)
Difference -40 (dmh)
j ¡Difference! (= estimated standard error) 20
II 11 2 1 20 5 220
2 18 5 198
Total 418
Mean (= stratum estimate of total) 2O9(m‘o)
Difference 22 (dmh)
| ¡Difference! (= estimated standard error) 11
III 9 2 1 21 5 189
2 18 5 162
Total 351
Mean (= stratum estimate of total) *
175.5(m 0)
Difference 27 (dmh)
¡Difference| (= estimated standard error) 13 5
(continued)
I 1 21 4.2 756
2 20 4.0 880
Total 1636
Mean (= stratum estimate of total) 818(^0)
Difference -156 (dÿh)
¡Difference! (= estimated standard error) 78 (Xo)
II 1 30 6.0 1320
2 26 5.2 1029.6
Total 2349.6
Mean (= stratum estimate of total) 1174.8(y;0)
Difference 290.4 (dyh)
| ¡Difference! (= estimated standard error) 1452
Total 1933.2
Mean (= stratum estimate of total) 966.6(y;0)
Difference 183.6 (dyh)
i |Difference| (= estimated standard error) 918 (Xo)
(continued)
Stratified Multi-stage Simple Random Sampling 331
I 1 11 2.2 396
2 10 2.0 440
Total 836
Mean (= stratum estimate of total) 418«0)
Difference -44 (dxh)
|Difference| (= estimated standard error) 22 (s,. )
' xho'
II 1 16 3.2 704
2 14 2.8 554.4
Total 1258.4
Mean (= stratum estimate of total) 629.2(r;0)
Difference 149.6 (dxh)
¡Difference! (= estimated standard error) 74.8 (s_.
Total 1085.4
Mean (= stratum estimate of total) *
542.7(r 0)
Difference 48.6 (dxh)
¡Difference! (= estimated standard error) 24.3 (sx. )
' hO7
(continued)
332 Chapter 20
I 1 1 0.2 36
2 2 0.4 88
Total 124
Mean ( = stratum estimate of total) 62(m-)
Difference -52 Cm'h)
|Difference! ( = estimated s tandard error) 26
II 1 3 0.6 132
2 2 0.4 79.2
Total 211.2
Mean ( = stratum estimate of total) 105.6(771^)
Difference 52.8 Cm'C
|Difference| ( = estimated s tandard error) 26.4 (sm,.o)
Total 108.0
Mean (= stratum estimate of total) 54.0(77^)
Difference 43.2 (dm,h
~ |Difference| (= estimated standard error) 21 6
Stratified Multi-stage Simple Random Sampling 333
All states
combined 584.5 26.52 703.25 2959.4 188.66 35594.28
All states
combined 1589.9 81.67 6669.53 221.6 42.89 1839.52
Table 20.6: Estimates and standard errors computed from the data of strat
ified two-stage srs in Table 20.2
3. Proportion of adults
(a) Estimate 0.5110 0.5356 0.5615 0.5372
(b) Standard error 0.02183 0.00253 0.02819 0.01086
(c) Coefficient of variation (%) 4.27 0.47 5.02 2.02
the stratum ratio estimators y* hR being given by equation (20.58), the ad
ditional subscript S in yRs standing for separate ratio estimator.
The variance estimator of YRg is the sum of the stratum variance esti
mators , given by equation (20.60):
L
sE = EsL <20-63)
(ii) Combined ratio estimator. The ratio method is applied to the esti
mators of the overall totals, thus
and
r = y/z (20.66)
Example 20.2
For the data of Example 20.1, given the additional information on the previous
census population of all the villages as in Example 10.5, obtain ratio estimates
of the total population for the three states separately and also combined, along
with their standard errors.
Here the ratio method of estimation is applied at the level of the first-stage
units. Unbiased estimates of the previous census population in the three states
along with their standard errors have already been obtained in Example 10.5.
The rest of the computations are shown in Table 20.7 and the final results given
in Table 20.8.
Stratified Multi-stage Simple Random Sampling 337
2risyi)
Notes
1. The ratio estimates of the total numbers of households have already been
obtained in Example 10.2.
2. The ratio estimates of population are more efficient than the unbiased esti
mates, as shown by Table 20.6. No marked difference appears between the
separate and combined ratio estimates for the whole universe.
(b) Ratio of ratio estimators of totals of two study variables. Ratio es
timators of totals of two study variables may be used in estimating
their ratios. However, as noted in section 10.5 for stratified single-
stage sampling, for any stratum (and also for all strata if the combined
ratio estimators are used), the ratio of the ratio estimators of totals of
two study variables becomes the same as the ratio of the correspond
ing unbiased estimators. For the separate ratio estimators of totals
of two study variables, the ratio of the estimators and its variance
estimators have been defined in section 10.5 for stratified single-stage
srs. The same methods will apply for stratified multi-stage designs,
noting that the appropriate estimating formulae have to be used. For
an example, see exercise 3 of this chapter.
338 Chapter 20
Table 20.8: Ratio estimates of population and their standard errors com
puted from the data of the stratified two-stage srs in Table 20.2, using the
previous census population of the sample villages (data of Table 20.7)
All states
Separate ratio
estimate 3017.7 110.4 3.66
Combined ratio
estimate 3021.1 113.4 3.75
Y hi = Yhi/Mhi (20.72)
Total for the hth stratum:
Nh Mh,Qk,3 Nh Mh. Nh
Yh É£ £ = £ y.i
N,. Mh, _ N„ _
=LJvwyM (20.73)
_ Nh _
Yh = Yh/Nh = ^MhiYhi/Nh (20.74)
Yh
(20.76)
Table 20.9: Sampling plan for a stratified three-stage simple random sample
with replacement. In the hth stratum (h = 1, 2,.. ., ¿)
Y = Y/N (20.78)
- / L Nh \
Y = Y// (EE^j (20.79)
Y = Y (20 80)
* _ ni V V
where
Qhij
Vhij ~ yhijk/Qhij (20.82)
k=l
is the simple arithmetic mean of the yhijk values in the hijth ssu.
A combined unbiased estimator of Yh is the arithmetic mean
VhO ~ Vhi
h i^i
Nh v Mhi Qhij
- —7 ,----- 7 >-------> , yhijk (20.83)
nh
For all the strata combined, an unbiased estimator of the total Y is the
sum of the estimators of the stratum totals
L 1\T U mh ■ /Q
y= = (20-84)
z—' nh TUhi Qhii
h-1 h-1 n i=l nt j = l k=l
rh = y*ho/xho (20.87)
(20.88)
where
sytoxl0 = 52^
* _ VhoXA - xho)/nh(nh - 1) (20.89)
r — y/x (20.90)
where
L
Syz = (20.92)
Two situations will arise: (i) the total number of the tsu’s Qhij is known
for all the ssu’s in the universe; and (ii) Qhij is known only for the sample
ssu’s.
(i) Unbiased estimator. When Qhij is known for all the ssu’s in the
universe, an unbiased estimator of the universe mean per tsu Y is
. / L Nh Mht \
(20.93)
(ii) Ratio estimator. When Qhij is known only for the sample ssu’s,
an unbiased estimator of the total number of tsu’s in the universe, namely,
Y2Mh' Qhij, is given by an estimation equation of the type (20.84)
by putting yhijk — 1 for the sample tsu’s, i.e. by
Q = 52 qho (20.95)
where
Qho = ¿9hi/nh
(20.97)
and
= “ îho)2/nh(nh - 1) (20.99)
344 Chapter 20
r — y/q (20.100)
where
L
Syt! ~
L nh
= ~ yhoWhi - Qho)/nh(nh - 1) (20.102)
where yhij is the sample mean in the hijth ssu (equation (20.82)).
This estimator will be both biased and inconsistent: it will be unbiased
and consistent when the design is self-weighting.
For a comparison of the estimators and other observations, see notes to
section 15.6.5.
The proportion of such tsu’s with the attribute in the hith fsu is
Mh. . Mhl
Pki = Y,Q'hi <20 107>
= 52 qh0 (20.110)
346 Chapter 20
q'hij being the number of tsu’s possessing the attribute out of the sample
number qhij of tsu’s in the jth ssu in the zth fsu in the /ith stratum.
If all the Qhij values are known for the universe, then the unbiased
estimator of the universe proportion P (equation (20.109)) is
, / L Nh Mhl \
«/iEEEO'-d (20114)
with an unbiased variance estimator
, / l Nh MKt \ 2
*./ (EEE«^) <20115)
Example 20.3
In Table 20.2 are given the number of adults in the sample households in each of
the two sample villages, selected from the three states. In each selected household,
one adult member (18 years or over) was further selected at random from the total
number of adults in the household, and asked if he/she agrees to an increase in
state taxes for education. The information is given also in Table 20.2. Estimate
the total number of adults who agree to an increase in state taxes for education
and the proportion they constitute of the total number of adults in each of the
three states and also in all the states combined, along with their standard errors.
Extending the notation of Example 20.1, let QhtJ (= ih.j in Example 20.1)
denote the total number of adults in the ijth sample household in the hth stratum.
The number of adults selected for the interview (i.e. the number of third-stage
sample units) is qhtJ — 1. We can therefore dispense with the subscript k in yhijk
by which we had denoted the value of the study variable for the ijkth third-stage
sample unit in the hth stratum.
Following the method of section 20.3.8, we put = 1 if the selected adult
in the ijth sample household in the hth stratum agrees to an increase in state
taxes for education, and q'hij = 0 otherwise.
From estimating equation (20.113), we have an unbiased estimator of the
stratum total of the number of adults who agree to an increase in state taxes for
education
=Nh —
mhi
from the ith first-stage unit (here a village) in the hth stratum. The required
computations follow the methods of section 20.3.8, and are shown in Tables 20.10-
20.12, and the final results in Table 20.13.
As for a stratified two-stage srs (section 20.2.10) so also for a stratified three-
stage srs, the ratio method of estimation may, under favorable conditions,
improve the efficiency of estimators.
The procedures are the same as for stratified two-stage srs, given in
section 20.2.10, except that the unbiased estimator of the stratum total Zh
of the ancillary variable will be computed on the basis of the appropriate
estimating equation, namely, an equation of the type (20.83), thus:
(20.119)
348 Chapter 20
I 10 2 1 3 0.6 108
2 2 0.4 88
Total 196
Mean (= stratum estimate of total) 93(0
Difference 20(d,/h)
|Difference| ( = estimated standard error) 10(V. )
’ho
II 11 2 1 3 0.6 132
2 2 0.4 79.2
Total 211.1
Mean (= stratum estimate of total) 105.6(0
Difference 52.8(d,,h)
j ¡Difference! (= estimated standard error) 26.4(3 )
nO
Total 284.4
Mean (= stratum estimate of total) 124.2(0
Difference 54.0(0)
j ¡Difference! (= estimated standard error) 27.0(3 )
’ho
Stratified Multi-stage Simple Random Sampling 349
+p2hSko~
State Ph = Sx
* a'
*
XhOqhO
— SPK =
HhO hO
h 9ho/XhO 4 dxhdq/fa col. (4)/r‘2
Adults who agree to an increase State I State II State III All states
in state taxes for education combined
(a) Number
Estimate 98.0 105.6 124.2 327.8
Standard error 10.0 26.4 27.0 39.05
Coefficient of variation (%) 11.36 25.00 21.74 11.91
*;» = (2012°)
where Zhijk is the value of the ancillary variable in the hijkth selected
third-stage unit.
An unbiased variance estimator of z^0 will be given by an estimating
equation of the type (20.85).
Example 20.4
For the data of Example 20.3, given the additional information on the previous
census population of all the villages as in Example 20.2, obtain ratio estimates of
the number of adults who agree to an increase in state taxes for education in the
three states separately and also combined, along with their standard errors.
Here the ratio method is applied, as in Example 20.2, at the level of the
first-stage units. The unbiased estimates of the previous census population in
the three states have already been obtained in Example 10.2, along with their
standard errors.
We define
r3h = Qho/zh0
as the ratio of the combined unbiased estimator of the stratum number of adults
who agree to an increase in states taxes for education ^Nk Qhij t° that of
the stratum estimator of the previous census population Zh- The computations
Stratified Multi-stage Simple Random Sampling 351
follow the methods of sections 20.3.10 and 20.2.11 and are shown in Table 20.14.
The final results are given in Table 20.15.
Note: The ratio estimates obtained are more efficient than the unbiased estimates
of the number of adults who agree to an increase in state taxes for education in
Table 20.13. There is no marked difference between the separate and combined
ratio estimates for the whole universe.
Further reading
Foreman, section 8.6; Hansen et al. (1953), Vols. I and II, chapters 7-10; Kish
(1965), sections 5.6, 6.4, and 6.5; Singh and Chaudhary, section 9.5; Sukhatme et
al. (1984), sections 8.11 and 8.13.
352 Chapter 20
All states:
Separate ratio estimate 334.8 28.45 8.50
Combined ratio estimate 327.7 30.35 9.26
Stratified Multi-stage Simple Random Sampling 353
Exercises
1. For a survey on the yield of corn in a district, the villages were grouped
into 10 strata, and from each stratum two villages were selected at random.
From each selected village, two fields were selected at random from all fields
on which corn was grown. In each selected field, a rectangular plot of area
1/160 acre was located at random, and the yield of corn in the plot was
measured in ounces. The sample data are given in Table 20.16, which also
shows for each stratum the area under corn. Estimate for the district the
total yield of corn and its standard error (Chakravarti et al., Exercise 3.6,
adapted).
2. For the same data of Example 20.1, estimate for all the states combined
the average number of adults per household and its standard error.
3. For the data of Example 20.2, estimate the average household size from
the separate ratio estimates of number of persons and households and its
standard error.
CHAPTER 21
21.1 Introduction
21.2.1 General
The general estimating equations for totals of study variables and the ratio
of two totals, as well as those for their respective variances, follow from the
general estimating equations in section 19.3. The cases of stratified two-
and three-stage pps designs are mentioned briefly.
355
356 Chapter 21
Table 21.1: Sampling plan for a stratified two-stage pps sample design with
replacement. In the hth stratum (h = 1,2,... , L)
The universe totals and means are defined as for a stratified two-stage
srs in section 20.2.1.
If Uhij is the value of the study variable in the jth selected ssu in the
¿th selected fsu in the hth stratum (i = 1,2,..., nn; j = 1,2,..., mhi), then
from equation (16.3) or (19.21), an unbiased estimator of the stratum total
Yh (as defined by equation (20.3)) from the ¿th selected fsu is
-t ™ht ryr tit
. _L_y = (213)
hij
WTO/m “ * Zhmhi whij
and a combined unbiased estimator of Yh from all the fsu’s is the arith
metic mean
* X ♦ (21.4)
nh
Stratified Multi-stage Varying Probability Sampling 357
= <217)
Estimators of the ratios of totals of two study variables and their vari
ances follow directly from the fundamental theorems of section 19.3.
Notes
1. Sampling variance of the estimator y. From equation (16.6) the sampling
variance in ppswr at both stages is
Table 21.2: Sampling plan for a stratified three-stage pps sample design
with replacement. In the hth stratum (h = 1,2,..., ¿)
j = 1,2,..., mhik = 1,2,..., qhij), then from the general estimating equa
tion (19.21) or from equation (16.12), an unbiased estimator of the stratum
total Yh, obtained from the ith selected fsu is
1 1
E ^htj qhij
r7 XXr t/ Qhij
¿h Whi Vhij V-'' Vhijk
(21.10)
^hi^hi VhijIt
(2111)
with an unbiased variance estimator
where Zh = *1
z^i', Whi = £^= whij', and Vhij = X>k=i vhijk-
For the whole universe, an unbiased estimator of the total Y is
L
P = ¿L !^o
with an unbiased variance estimator
L
(21.14)
Estimators of the ratios of totals of two study variables and their vari
ances follow directly from the fundamental theorems of section 19.3.
Note: As noted in section 16.2.2, great simplifications result if the same size
variable is used for pps sampling in all the three stages.
Stratified Multi-stage Varying Probability Sampling 359
Table 21.3: Sampling plan for a stratified two-stage design with pps sam
pling at the first-stage and srs at the second-stage. In the /ith stratum
where yhij is the value of the study variable in the jth selected ssu (house
hold or field) or the ¿th selected fsu (village) in the /ith stratum, and
mh,
Uhi = ^yhij/mhi (21.16)
j =i
If Mm, the number of ssu’s, is known beforehand for all the fsu’s, and
the rih fsu’s in the hth stratum are selected with probability proportional
to Mhi, then, as shown in section 16.3.2 for one stratum, the estimating
procedure becomes simpler.
Example 21.1
In each of the two sample villages selected in each of the three states in Example
11.1 from the list of villages in Appendix IV with probability proportional to pre
vious census population and with replacement, select five households at random,
given the listing of the households in the sample villages. On the basis of the col
lected data for the sample households on the size and number of adults (persons
aged eighteen years or over), estimate the total number of persons, the number
and the proportion of adults for three states separately and also combined, along
with their standard errors. The sample data are given in Table 21.4. The use of
the last column will be illustrated in Example 21.3.
As there are two sample first-stage units, i.e. villages, in each stratum, we
follow, as in Example 20.1, the simplified procedures mentioned in note 2 of
section 19.3, illustrated also in Example 11.1 for the stratified single-stage pps
sampling. As in Example 20.1, we denote by yh,} the household size and by XhtJ
the number of adults in the sample households (h = 1,2, 3 for the states; i = 1,2
for the sample villages; and j = 1,2, 3,4, 5 for the sample households): note that
in Example 11.1 the total number of households in a sample village was denoted
by ihi, which we denote here by Mhi, and the sample number by mhi- Unbiased
estimates of the total number of households in the three states separately and
combined have already been obtained in Example 11.1, along with their standard
errors. These, along with the other required computations, are shown in Tables
21.5-21.7, and the final estimates in Table 21.8.
Stratified Multi-stage Varying Probability Sampling 361
Table 21.4: Size, number of adults (18 years or over), and agreement of the
selected adult to an increase in state taxes for education
1 5 2 0
6 5 2 0
16 4 2 1
18 6 3 0
22 3 2 0
4 5 3 1
8 4 2 0
13 5 3 0
16 4 2 0
18 4 2 0
3 5 3 0
5 6 3 0
7 3 2 0
14 5 2 1
15 3 2 0
4 3 1 0
12 6 4 0
14 7 4 0
16 5 3 1
18 5 2 0
3 5 3 0
5 6 3 0
11 6 4 1
14 5 2 0
15 8 5 0
7 5 2 0
8 6 3 1
12 5 3 0
13 7 3 0
15 8 5 0
362 Chapter 21
(continued)
t These are the estimated number of households in Tables 11.2, denoted by x^t.
In the present example, denotes the estimated numbers of adults, and
those of households.
Stratified Multi-stage Varying Probability Sampling 363
(Affc./Thjrh,
I 1 11 2.2 495.29
2 12 2.4 575.33
Total 1070.62
Mean ( = Stratum estimate of total) 535.31
(<o)
Difference -80.04
(d»n)
j |Difference|( = Estimated standard error) 40.02
(5r* )
h.0
II 1 12 2.4 564.49
2 14 2.8 620.26
Total 1184.75
Mean (:= Stratum estimate of total) 592.38
(rho)
Difference — 55.47
(<U)
j |DifFerence|(:= Estimated standard error) 27.885
(»<„ )
hO
Total 1141.57
Mean (:= Stratum estimate of total) 570.78
(rho)
Difference 59.47
(¿xh )
j |Difference!(:= Estimated standard error) 29.735
•(»
* )
ho
364
Table 21.6: Computation of estimated numbers of households, persons and adults: data of Tables
21.4 and 21.5
All states 635.77 9.473 89.7460 3203.65 70.443 4962.1779 1698.47 57.126 3263.3438
combined (m) ('m) ('m) (y) (',) (»$) (r) ('x) (¿)
Chapter 21
Stratified Multi-stage Varying Probability Sampling 365
'I- +r2<- -
rh = *
S*h0
V * ho
m =
vh0 hO sk =
Vho/mlo I dyhdmh. *
2rhSy *
m col.(4)/m^
‘'hO hO
Ph = V m * =
y?i0 hO hO yh0
xho/yho | dyhdXh 2Ph,<0^0 col(8)/Vho
Table 21.8: Estimates and standard errors computed from the data of strat
ified two-stage (pps and srs) design in Table 21.4
1. Number of households^
(a) Estimate 232.4 230.5 172.8 635.8
(b) Standard error 7.30 4.72 3.78 9.47
(c) Coefficient of variation (%) 3.14 2.05 2.18 1.49
2. Number of persons
(a) Estimate 1045.2 1104.4 1054.0 3203.6
(b) Standard error 9.59 69.54 5.86 70.44
(c) Coefficient of variation (%) 0.92 6.30 0.56 2.20
3. Number of adults
(a) Estimate 535.3 592.4 570.8 1698.5
(b) Standard error 40.02 27.88 29.74 57.13
(c) Coefficient of variation (%) 7.48 4.71 5.21 3.36
5. Proportion of adults
(a) Estimate 0.5122 0.5363 0.5416 0.5302
(b) Standard error 0.0336 0.0407 0.0252 0.0195
(c) Coefficient of variation (%) 6.56 7.59 4.65 3.69
Example 21.2
For estimating the total yield of paddy in a district, a stratified two-stage sample
design was adopted, where four villages were selected from each stratum, with
ppswr, the “size” being the geographical area, and four plots were selected from
each sample village circular systematically for ascertaining the yield of paddy.
Using the data given in Table 21.9, estimate unbiasedly the total yield of paddy
in the district and its standard error (Murthy (1967), Problem 9.7).
The required computations are given in Tables 21.10 and 21.11 denoting by
yhtj the yield of paddy (in kilograms) in the jth sample plot of the ith sample
village of the hth stratum (/¿ = 1,2,3, for the strata, i = 1,2,3, 4 for the sample
villages, and j = 1,2, 3,4 for the sample plots).
The estimated total yield of paddy in the district is 3976 tonnes (1 tonne =
1000 kg). The estimated standard error of estimate of this total is ^/(118, 352) =
344 tonnes, i.e. a CV of 8.65 per cent.
Stratified Multi-stage Varying Probability Sampling 367
Mhi
y*hi =
^hi
mht
Mhi
~ Qhijyhij (21.19)
'^hi^h * ¿=1
where yhijk is the value of the study variable in the fcth selected tsu of the
jth selected ssu of the ¿th selected fsu in the /ith stratum (h = 1,2,... ,L;
368 Chapter 21
Total 5588
Mean ( = stratum estimate of total) 1397
(vlo)
Total 2700
Mean (= stratum estimate of total) 675
(vio)
Total 7614
Mean (= stratum estimate of total) 1903.5
(Vio)
Stratified Multi-stage Varying Probability Sampling 369
Stratum h0
*
V ssy
* h, = A =
h col. (3) - SSy
* hJ\2
col. (4)
Table 21.12: Sampling plan for a three-stage design with pps sampling at
the first stage, and srs at the second and third stages. In the hth stratum
(fc = l,2,...,£)
1 First-stage
(village) nh ppswr
2 Second-stage
(household or
field) mh, srswr Equal = 1/Mhi
3 Third-stage
(person or
plot) Qhij Qhij srswr Equal = 1/Qh.j Qhlj / Qh.}
370 Chapter 21
~ y ^yhijk/qhij (21.20)
k= 1
Example 21.3
In Table 21.3 are given the number of adults in the sample households, randomly
selected out of the total number of households in each of the two sample villages,
selected with pps from the three states. In each selected household, one adult
member (eighteen years or over) was further selected at random from the total
number of adults in the households and asked if he/she agrees to an increase in
state taxes for education. The information is given also in Table 21.4. Estimate
the total number of adults who agree to an increase in state taxes for education
and also the proportion they constitute of the total number of adults in each of
the three states and in all the states combined, along with their standard errors.
Extending the notation of Example 21.1, let Qhij (= Xhij in Example 21.1)
denote the total number of adults in the ijth sample household in the Ztth stratum.
The number of adults selected for interview (i.e. the number of third-stage sample
units) is qhi} = 1. We can therefore dispense with the subscript k in yhijk by
which we had denoted the value of the study variable for the ijkth third-stage
sample unit in the hth stratum.
Similar to the procedures adopted in Examples 16.2 and 20.3, we put q'hlJ — 1
if the selected adult in the ijth sample household in the hth stratum agrees to
an increase in state taxes for education, and 0 otherwise.
From estimating equation (21.19), we have an unbiased estimator of the stra
tum total Q'h, of the number of adults with knowledge of development plans
1» Mhi
7 hi —
TThi ^hi
j=l
Stratified Multi-stage Varying Probability Sampling 371
from the ith first-stage unit (i.e. village) in the /ith stratum. The required
computations are shown in Tables 21.13-21.15, and the final results in Table
21.16.
Example 21.4
For the household inquiries in the Indian National Sample Survey (1953-4) in the
rural sector, the design was stratified three-stage. The total number of 2522 tehsils
were grouped into 240 strata on the basis of consumer expenditure (as estimated
in earlier surveys) and geographical continuity, such that each stratum contained
approximately equal population, as in the census of 1951. In each stratum, two
tehsils and in each selected tehsil, two sample villages, were selected, sampling at
both stages being with probability proportional to the 1951 Census population
or area and with replacement; within each selected village, nine households on
an average were selected by the enumerators systematically with a random start
from the lists of households in the village, which they had prepared on reaching
the villages. The sample comprised 8235 households and 49,177 persons.
With the previous notations, the reader will see that an unbiased estimator
of the stratum total Yh in the /ith stratum (Л = 1,2,..., 240), obtained from the
ith (i = 1,2) first-stage unit is (from equation (21.19))
2 4h IJ
>* i i v
"hi ~ о / j j yhJ'k
2 ТГ/ti * T^hijQhij
j=l k= l
where yhijk is the value of the study variable in the fcth selected third-stage unit
of the jth selected ssu of the ith selected fsu in the /ith stratum. A combined
unbiased estimator of Yh is
Ум — 2 (уhi + Ум)
h=l
State
h
Sample
village
I2j=l i I2j = l ’H* -
(Mhl,/%h.)x
i col. (4)
I 1 2 0.4 90.05
2 3 0.6 143.83
Total
Mean (= Stratum estimate of total)
Difference
j |Difference| (= Estimated standard error)
II 1 2 0.4 94.08
2 3 0.6 135.46
Total 299.54
Mean (= Stratum estimate of total) 114.77
Xo)
Difference -41.38
j |Difference| (= Estimated standard error) 20.69
(%- )
Total 242.75
Mean (= Stratum estimate of total) 121.38
(tfo)
Difference 39.85
|Difference| (= Estimated standard error) 19.925
*
(%' )
Stratified Multi-stage Varying Probability Sampling 373
State Ph = sx
* a1* =
h0ghQ
2
"lh0
4-/2.2
Xh0
A=
Ph
| dxhdqih col. (4)/r
g
*
~2PhSx' o'
*
1. Number
(a) Estimate 116.9 114.8 121.4 353.1
(b) Standard error 26.9 20.7 19.9 39.35
(c) Coefficient of variation (%) 2.99 18.03 16.42 11.14
Table 21.17: Estimates of vital rates per 1000 persons: Indian National
Sample Survey, rural sector, 1953-4
Estimate Standard CV
error (%)
results of a later survey, see Example 10.3. The other point to note is that the
reported birth, death, and sickness rates were obvious under-estimates, as could
be seen from available external evidence. One method of obtaining adjusted birth
and death rates from such defective data is described briefly in section 25.6.5.
21.5.1 Introduction
The estimating procedures in crop surveys are greatly simplified if the sam
pling units at each stage in a multi-stage design are selected with probability
proportional to area (section 16.5). The extension to stratified multi-stage
design is straightforward and will be illustrated with a stratified two-stage
design.
x* h0 = Ahfh (21.26)
where
n0 mo
rh = rhij/nomo (21.27)
¿=1 j=1
is an unbiased estimator of the average yield per unit area in the stratum.
An unbiased estimator of the total yield in the whole universe is
L
x = ^xho (21.28)
Further reading
Foreman, section 8.6; Hansen et al. (1953), Vols I and II, chapters 7-10; Kish
(1965), chapter 7; Sukhatme et al. (1984), sections 9.7, 9.9-9.10.
Exercises
Table 21.18: Summary data on household size and per household consump
tion of cereals in sample villages
22.1 Introduction
The procedure for the allocation of the total sample to the different strata
and stages in a stratified multi-stage design follow from those for a stratified
design (Chapter 12) and a multi-stage design (Chapter 17). The total
sample size is fixed in general by the availability of resources, especially the
available number of trained enumerators and the number of sample units
that can be surveyed by the enumerators during the survey period.
Extending the notations in sections 12.3 and 17.2, for a stratified two-
stage design (and sampling with replacement), and a fixed number moh of
sample ssu’s in each selected fsu, the variance of the unbiased estimator of
a universe total can be expressed as
V2h
(22.1)
h n h
where Vih is the variability between the fsu’s V2A that between the ssu’s
within the fsu’s and the sample number of fsu’s in the Tith stratum
(h = 1,2,..., L).
If the cost of travel between the fsu’s within a stratum is small and is
not taken into account, the following cost function may be adopted:
^nhcih + ^2nhrn°hC2h (22-2)
h h
where c^h is the cost per fsu and c^h the cost per ssu in the hth stratum.
379
380 Chapter 22
m0 = \Z(y2hC\h/VlhC-2h.) (22.3)
nh Nh\/(Vih/cih) (22.4)
It can be shown that a stratified two-stage srs will be more efficient than
an unstratified two-stage srs, with the same total sample size, when the
variation of the stratum means is large.
To combine the efficiencies of both stratification and multi-stage sam
pling, the strata have to be made heterogeneous with respect to each other,
with the fsu’s within a stratum internally homogeneous (see sections 12.3.2
and 17.2.3). A small number of sample fsu’s in each stratum will then
provide an efficient sample. When strata are formed to contain an approxi
mately equal sum of the sizes of the fsu’s in each, two additional advantages
ensue - first, an equal allocation of sample size per stratum, and second,
achievement of a self-weighting design.
Further reading
Cochran, section 10.10; Foreman, section 8.6; Hansen et al. (1953), Vols. I and
II, chapters 7 and 9; Kish (1965), chapter 8; Singh and Chaudhary, section 9.5;
Sukhatme et al. (1984), section 8.12.
CHAPTER 23
Self-weighting Designs in
Stratified Multi-stage Sampling
23.1 Introduction
(23.1)
sample 'i-1
(see also equation (19.18)) where yhi2...u is the value of the study variable
in the (12... u)th ultimate stage unit in the hth stratum, and the factor
(23.2)
(see also (19.19)) is the multiplier for the (12 ... u)th ultimate stage sample
unit in the hth stratum, and fht = n12,..(t_i)7rhi2...t, there being n^12...(t_i)
sample units each selected with probability 717,12...t out of the total
Nhi2...(t-i) units at the /th stage.
381
382 Chapter 23
The procedure for an unstratified two-stage design with pps selection at the
first stage and srs (or systematic selection) at the second stage, given in
section 18.3, can be extended readily to a stratified design. The multiplier
for the ijth sample ssu in the hth stratum is
1 Mhi
WMj — f f (23.4)
JhXJh2
and the design will be self-weighting with a constant multiplier w0, when
mhi = Mhi/(w0nh7vhi) (23.5)
In a household inquiry, if the number of households to be sampled in
each village is to remain a constant mo, then with a self-weighting design,
m0 = Mhi/(nhTrhiw0)
or
nh - A/h/(womo) (23.8)
Self-weighting Designs in Stratified Multi-stage Sampling 383
wq = M/{nmo)
— total number of households/number of sample households
— 1/sampling fraction (23.9)
2. Divide the whole universe into L strata by using some suitable criteria.
6. Give to the enumerators the values mg and M'hi (the number of house
holds in the villages in the census) with instruction to select at ran
dom (or systematically) m/,, = households out of the
Mhi households listed in the ¿th village in the hth stratum.
Extending the procedures of section 18.4, it can be seen that with the selec
tion of one individual from a household, the sampling design can be made
self-weighting if the households are selected with probability proportional
to the respective sizes.
Expanding the example given in section 18.4, if households are selected
with probability proportional to size (number of adults in this case), the
multiplier for the adult in the O'th sample household in the hth stratum
becomes
Qhi/fah‘Khimki') (23.10)
With simple random sampling, and assuming that Mhi = Mho, and
mhi = mho, it can be shown that the design will be self-weighting when
nh^ho/^h^ho = /o is a constant in all strata. On the other hand, the
optimum allocation gives
While the cost per ssu (c2h) may be approximately equal in fsu’s of dif
ferent sizes (i.e. with different total number of ssu’s), the variation among
the ssu’s (as measured by V2/1) may be greater in the larger fsu’s. A self
weighting design will not thus be necessarily optimum. But as the theoret
ically optimum design is not generally attainable while the optimum values
are generally broad, a self-weighting design will often be as efficient as the
optimum.
Self-weighting Designs in Stratified Multi-stage Sampling 385
As the above formulation assumes the same field cost function for both
the optimum and the self-weighting designs, while the latter would entail
considerably less cost for tabulation, there is an added justification in trying
to achieve a self-weighting design.
Further reading
Cochran, section 10.10; Foreman, section 8.6; Hansen et al. (1953), Vol. I,
sections 7.12 and 9.11; Murthy (1967), section 12.3; Singh and Chaudhary, section
9.9; Som, 1958-59, 1959; Sukhatme et al. (1984), section 9.8.
PART V
MISCELLANEOUS
TOPICS
CHAPTER 24
24.1 Introduction
389
390 Chapter 24
Notes
1. For selecting the second-phase sample, the first-phase sample serves as the
universe, but since the first-phase information is based on a sample, it is
itself subject to sampling errors, and this must be taken into account in
the estimating procedure.
2. Multi-phase sampling is structurally different from multi-stage sampling:
in the former the same sampling units are used throughout, but in the latter
a hierarchy of sampling units is used. The two may of course be combined.
3. An example of double sampling is found in the survey of level of living
among rural Africans in Cameroon, 1961-5. For the demographic sample
survey, which was conducted first, the design was stratified single-stage, a
systematic sample of villages (fsu’s) being taken in each stratum. For the
socio-economic inquiry, a sub-sample of the original sample fsu’s was taken
probability proportional to the existing number of households, and in each
such selected unit a fixed number of households (second-stage units) were
selected with equal probability and without replacement. For obtaining
estimates of totals and means, the ratio method of estimation was used.
Miscellaneous Sampling Topics 391
nl
v’ = Y,y'i/nl (24.1)
4 = <fyn' (24.2)
n1
(24.3)
where x and x' are the sample means of the ancillary variable, obtained
respectively from the first phase of n units and the second phase of n'
units, and (3' is an estimator of (3, the regression coefficient of y and x,
based on the second phase sample.
The variance of this regression estimator is approximately
(24.5)
n Tl'
where p is the universe correlation coefficient between the study and the
ancillary variables.
Clearly, the variance of the regression estimator will be less than that of
the unbiased estimator based on an srs of n! units (namely, cr^/n'), unless
392 Chapter 24
where ci and are the respective cost per unit of the original sample and
the sub-sample. Then for a fixed cost C, the optimum values of n and n1
are respectively
n=C (24.7)
and
= (C - ncx)/c2 = (24.8)
where p > 0.
In the comparison of single-phase and double sampling, it may be noted
that if a single-phase sample has to provide the same information as a
double sample, the cost per unit in the former would also be c2 , so that
with the given total cost C, the size of such a single-phase sample is
n — — — n — + n.' (24.10)
C2 c2
and the value of p at which the variance of the optimum double sample
becomes equal to that of an equivalent single-phase sample is
(24ii>
p must be considerably large for the survey design to benefit from a double
sample. For example, if c2lc\ — 10, i.e. the unit cost of collecting data
on y is ten times that for observing x, then the two variances are equal
Miscellaneous Sampling Topics 393
3' = SPy'^/SS^
and a variance estimator of is
(24.13)
where sy>, sxi, and p' = (3'sxi/syi are computed from the sub-sample.
Example 24.1
Estimate the average number of cattle per farm and its standard error using
the second-phase sample data (United Nations, Manual, Example 30; also see
Exercise 3 in Chapter 3 of this book).
Here ÿ' = 25,751/2,055 = 12.5309; x' = 62,989/2,055 = 30.6516 acres;
s'y = 11.551; s'x = 22.143; and 4' = 0.354551.
The regression estimate of the average number of cattle per farm in double
sampling is (from equation (24.4)) , ~yRcg = 12.7431.
As p = f3'sxi/syi = 0.679666, the variance estimator oiy'Reg is (from equation
(24.13)) 0.03573, so that the estimated standard error of y'Reg is 0.1891, with
estimated CV of 1.48 percent.
If no account is taken of the information in the original sample, then the
unbiased estimate of the average number of cattle per farm is y' = 12.5309, with
an estimated standard error of 0.2548 and CV of 2.03 percent.
Note: In double sampling, the ratio method of estimation may also be used, i.e.
but this estimator does not have any general superiority over the regression esti
mator, except that it is simpler to compute; it may however be better than the
unbiased estimator.
See also the notes to sections 3.2 and 3.3.
Further reading
24.3.1 Introduction
2. the average of the mean values for all the occasions combined
The methods of estimation for the above will be illustrated for sampling
on two occasions with srs. Owing to the generally high positive correlation
between the values on two occasions, it is better to retain the same sample
to estimate the change in the mean values, and to draw a new sample to
estimate the average for both the occasions combined. For estimating the
mean for the second occasion, the same initial sample or a new sample of
the same size would give equally precise estimators, but a more efficient
scheme would be to replace a part of the sample on the second occasion
and to use the double sampling method with the regression estimator.
Miscellaneous Sampling Topics 395
y2~y\ (24.15)
If the samples are independent, then the sampling variance of the estimator
(y2 — y^ in equation (24.15) is
If, however, the same sample is used on both the occasions, the sampling
variance of the estimator in equation (24.15) is
However, for the matched part, a more efficient estimator than the unbi
ased estimator y2m is obtained by using the double sampling method with
the regression estimator, i.e. from equation (24.4),
Notes
Panel surveys by Kasprzyk et al.-, Kish (1965), section 12.5; Moser and Kalton,
section 6.5; and Sudman, sections 9.11-9.17.
398 Chapter 24
F = Na n/nh (24.26)
yR = Zy/z (24.27)
If Yi = 1 for all i and zt = 1 for the marked units and 2, = 0 for the
other units, then Y = N; y = n; Z = N^; z = n^, and from formula (24.27)
we obtain as an estimator of the total number of units in the universe
* = Zn/z = Nhn/nh
N
A variance estimator of N
* is
Z \Z / n—1
(24.29)
Further reading
Bailey; Boswell et al.] Hâjek and Dupac, section VIII.8; Hammersley; Leslie; Pol
lock et al.] Ramsey et al.] Sen (1971, 1972, 1973 and 1991); Thompson, Chapter
18; United Nations (1993b).
Exercise
(24.30)
400 Chapter 24
where n is the total sample size with the n' required units.
An unbiased variance estimator of p is
2 _ PÜ ~p) A _ n
(24.31)
p n-2 V nJ
When N is large, the variance estimator of P can be taken as
2 _ P(i ~P)
(24.32)
p n-2
Further, with a small p, the sample size n will be fairly large, so that
the variance estimator s? in equation (24.32) could be approximated by
p(l — n)/(n — 1), which is the form of the unbiased variance estimator of
p = n1 /n in srs with replacement (equation (2.74)).
The estimation equations in inverse sampling are attributed to J.B.S.
Haldane.
where
*
Pi N yi in srs (24.34)
Pi IJi/TTi in pps (24.35)
where N is the total number of units, pi the value of the study variable of
¿th selected unit (? = 1,2, . . . , n), and 7rx- the probability of selection of the
¿th unit (i = 1,2,... ,N for the universe) in pps sampling.
An unbiased variance estimator of Pq is, as before,
n
SVo = 52^«’ _ Po)2/”(^ - 1) (24.36)
Miscellaneous Sampling Topics 401
2
a Vo 7Ti in ppSWr (24.38)
(see also equations (2.17) and (5.4)). The sampling variance of t/g in inverse
sampling is approximately
where a//. is given by equation (24.37) for srswr and by equation (24.38)
for ppswr, and f = n/N is the sampling fraction.
The sampling variance in inverse sampling is thus less than that of <Xy..
Further reading
Cochran, section 4.5; Hedayat and Sinha, section 12.1; Levy and Lemeshow,
sections 14.5 and 14.6; Sampford, sections 7.10 and 12.5; Sukhatme et al. (1984),
section 2.13.
Rare events, such as the incidence of a disease that affects, say, less than
10 per cent of a population, require a rather large sample to provide re
liable estimates. Three options are available in such cases the use of the
“capture-recapture” techniques; inverse sampling of proportions; and net
work sampling.
The first two options have been outlined in the preceding two sections.
We describe network sampling briefly with reference to a relatively rare dis
ease. If a patient (an “element”) with such a disease is treated at more than
one health center (one “enumeration unit”), then “the counting rule that
allows an element to be linked to more than one enumeration is called Multi
plicity counting rule” and “sampling designs that use multiplicity counting
are called network sampling” (Levy and Lemeshaw, 1991, section 14.5). For
a fuller introduction to network sampling, see Levy and Lemeshaw, op.cit.
Further reading
Kalton and Anderson (1986); Levy and Lemeshow, sections 14.5 and 14.6; Sud-
man, Chapter 9; Sudman, Sirken, and Cowan; U.N., 1993b.
402 Chapter 24
When two or more sub-samples are taken from the same universe by the
same sampling plan so that each sub-sample covers the universe and pro
vides estimators of the universe parameters on application of the same
estimating procedures, the sub-samples are known as interpenetrating net
works of samples. The sub-samples may or may not be drawn independently
and there may be different levels of interpenetration corresponding to dif
ferent stages in a multi-stage sample scheme. The sub-samples may also be
distinguished by differences in the survey procedures or of processing fea
tures. These are sometimes known as replicated samples. The technique
was developed by P.C. Mahalanobis in 1936; a variant of this is sometimes
called the Tukey plan (Deming, 1960, pp. 186-7, footnote).
This technique enables one to:
(a) examine the factors causing variation, e.g. enumerators, field sched
ules, different methods of data collection and processing;
(b) compute the sampling error from the first-stage units if these comprise
one level of interpenetration (both by the standard method as also by
a non-parametric test);
(c) provide control in data collection and processing;
(d) supply advanced estimates on the basis of one or more sub-samples
and provide estimates based on one or more sub-samples when the
total sample cannot be covered due to some emergency;
(e) provide the basis of analytical studies by the method of fractile graph
ical analysis.
The technique may be incorporated as an integral part of the standard
sample designs and has been used in a number of sample surveys in In
dia, including the Indian National Sample Survey, Peru, Zimbabwe, the
Philippines, and the U.S.A.
The first three uses of the technique are illustrated in section 26.6.5. In
this section we shall briefly deal with the other two uses of the technique.
Further reading
Cochran, sections 13.15 and 13.16; Deming (1960, Chapter 11; and 1963); Kish
(1965), section 4.4; Koop (1960 and 1988); Lahiri; Mahalanobis (1946a, 1958, and
1961); Murthy (1967), sections 2.12 and 13.10g; Rao, C.R. (1993); Som (1965);
Sukhatme et al. (1984), section 11.10; U.N. (1964a, section V18, and 1972, Part
I, section 11); Yates, sections 3.17, 5.4 5.6, 7.24 7.26; Zarkovich (1963, section
10.8).
As has been described before, the standard method of estimating the sam
pling variance of the estimator of a universe total is to base it on the
estimators of the total obtained from each of the sample first-stage units
selected with replacement. If a fixed number of sample fsu’s is selected in
each stratum, the computation of variance is simplified, the more so if two
sample fsu’s are selected in each stratum (section 9.5 and note 2 of section
404 Chapter 24
19.3). It has also been noted earlier that the computation of estimates
and their variances become very simple if the design is made self-weighting,
either at the field or at the tabulating stage. The use of interpenetrating
networks of samples in providing variance estimators will be discussed in
section 25.6.5.
Note: Whenever a sub-sample is drawn from the original sample for computing
the variance estimate of an estimate obtained from the large, original sample,
caution must be exercised in using the appropriate estimating equations. Suppose,
Miscellaneous Sampling Topics 405
for example, that from an srs with replacement of n out of N units, the unbiased
estimator yÔ of Y is computed by the method of section 2.9, i.e., yÔ = N Ui/n.
If a sub-sample of n units is selected at random from the n units, then an unbiased
estimator of a2, computed from the sub-sample of n units, will be
nz
s'2 = 52(i,'-ÿ')7(n'- 1) (24.40)
where y' — y'Jn', and y'x (i = 1,2,. .. , n') are the values of the study variable
in the sub-sample. To estimate the sampling variance of y* , defined above, this
value of s'2 is substituted for s2 in estimating equation (2.44), which then becomes
Further reading
Hansen et al. (1953), Vol. I, section 10.16, Vol. II, section 10.4; Kish (1965),
sections 8.6C and 14.3.
In the last two decades, a number of methods have emerged that are ap
plicable to non-linear functions of the sample data (and, of course, as a
special case, to linear functions) and do not involve the assumption of nor
mality. We discuss here two such methods - the bootstrap method and the
jackknife method: both require the re-sampling of the sample data.
24.9.1 Bootstrap
The bootstrap method was developed by Efron (1979) to provide estimates
of sampling variances for complex statistics.
Suppose, in a universe comprising N units, interest centers on a universe
parameter R which is a non-linear function of the universe variate(s) and
that a sample of size n has been drawn according to a specified sample
design. In the bootstrap method, the sample of size n is considered as the
universe, from which repeated samples are drawn (with replacement), each
of size n. Suppose, for example, that the sample size is n = 10. The sample
is re-sampled (with replacement), using a computer, to provide constructs
of n = 10 artificial data sets each. Then M (100 or more) such constructs
are obtained from the computer, in each of which the original ten numbers
would be selected by chance and some might not be selected at all. If there
are M re-samples, for each of which a copy r is made of universe parameter
406 Chapter 24
where
M
r= r^/M (24.43)
Further reading
New York Times of 9 November 1988 and the London Economist of 19 November
1988 gave popular accounts of the bootstrap method. For a detailed introduction
to the method, see J.N.K. Rao’s article on variance estimation (1988). For a
full treatment of the topic, see the book by Efron and Tibshirani (1993), and
Mammon (1992), and articles by Efron (1979, 1982, and 1989), Efron and Gong
(1983), and J.N.K. Rao and Wu (1988) and the papers for the 1990 Conference
on Exploring the Limits of Bootstrap, edited by LePage and Billard.
24.9.2 Jackknife
A method proposed by Quenouille (1956) for reducing the bias in ratio
estimators is used, in a modified form suggested by Tukey (1958), in vari
ance estimation of non-linear functions; Durbin (1959) and Schucany, Gray
and Owen (1971), among others, have further developed the method. It
Miscellaneous Sampling Topics 407
r == rj/n (24.46)
Further reading
For a more detailed introduction to the jackknife method, see J.N.K. Rao’s (1988)
article on variance estimation. For full details, see the books by Efron (1982),
and Gray and Schucany (1972), and articles by R.G. Miller (1974), and Schucany,
Gray, and Owen (1991).
the standard method. In such a situation, the strata may be combined into
a smaller number of groups such that each group contains at least two first-
stage sample units. Estimators of variances can then be computed from
the sample fsu’s in the newly formed groups. This is known as the method
of collapsed strata.
The variance estimator, computed from such collapsed strata, over
estimates the sampling variance. A very satisfactory approximation may,
however, be obtained when the number of strata is large and the strata are
combined into groups such that for each group the strata are about equal in
size (total Y). On the other hand, the variance estimator will be a serious
under-estimate if the groups are constructed by seeing the sample results
and making them differ as little as possible.
Further reading
Cochran, sections 5A.12 and 8.12; Hansen et al. (1953), Vols I and II, Chapter 10;
Kish, sections 4.3, 7.4D, and 8.6B; Singh and Chaudhary, section 3.12; Sukhatme
et al. (1984), section 4.15.
Further reading
Cochran, section 5A.5; Chaudhuri and Stenger, section 1.1.4; Foreman, section
11.8; Kish (1965), section 12.8; Singh and Chaudhary, section 3.15; Sukhatme et
al., (1984), section 4.16.
from equation (24.48). Also, from equation (24.48), the probability of oc
currence of Hj, given A, is
P(AHj) _ P(A | Hj'fPjH))
P(ty | A) =
P(A) p(A | H,)P(H,)
That is, the probability of Hj, given A, is proportional to the probability
of A, given Hj, multiplied by the probability of Hj . This is Bayes’ theorem
or rule for the probability of causes. P(Hj | A) is called the posterior
probability, P(A | Hj) the likelihood, and P(Hj) the prior probability;
H\, H2,. .. ,Hk are called the hypotheses or causes of A.
The theorem has many statistical applications but requires a knowledge
of the prior probabilities. Suppose a human population is divided into k
strata on the criteria of race, and let p, (z = 1,2,..., k) be the probability
that an individual chosen at random belongs to stratum Hi. Let the event
A denote the possession of blood group A, and let p' denote the conditional
probability P(A | Hj) that a person belonging to race Hj also has blood
group A. The probability that an individual chosen at random has blood
group A is, from equation (24.49), £2* PiPi- Given that a person has blood
group A, what is the probability of his belonging to race H?
From Bayes’ Rule (equation 24.50), the answer is
Further reading
Chaudhuri and Stenger, section 3.2.1; Ericson (1969) and other relevant articles in
Johnson and Smith (1969); Ericson (1988); Feller, sections V.l and V.2; Lindley
(1965 and 1972); Schmitt.
has found wide applications in consumer market research and, more re
cently, in political and opinion research. Focussed group interviews should
be “conceived of as an ecumenical research tool for use in every domain of
social life; it is not at all confined to market research as many evidently
assume is the case. For sound research results, it must be supplemented by
more systematic and typically quantitative research” (personal communi
cation from Robert K. Merton).
Group interviews have been used for over 35 years in the fields of psy
chotherapy, public opinion surveys and market research (Slavson, 1979, and
Higginbotham and Cox, 1979, cited by Scherr, 1980). Following the pio
neering work of Merton and his colleagues in the 1940s at the Columbia
Office of Radio Research during World War II and later at the Columbia
Bureau of Applied Social Research (into which the Office of Radio Research
had evolved), providing a theoretical underpinning of the basic concept of
focus (Merton and Kendall, 1946, Merton, Fiske, and Kendall, 1956), focus
groups have gained a huge currency, particularly in market research, all over
the world. In the United States alone, 100,000 focus groups were conducted
in consumer market research in 1992, with more than 1 million participants
(Impulse Research Corp.). Focus groups are now increasingly being used in
fields such as social security administration, population censuses and family
planning surveys.
In focus group interviews, a number of respondents or “participants”
are brought together for half an hour to two hours under the direction of a
“group leader” or “moderator” ; the size of the group varies according to the
topic of discussion and the type of participants, generally ranging from five
to twelve. In contrast to individual interviews, the prime concern in focus
group research is group interaction, and thus the suggested criterion for the
success of a focus group: in a productive focus group, the participants talk
to each other more than they talk to the moderator.
It has been argued that unlike individual interviews that are based on
structured questionnaires, a focus group discussion allows greater in-depth
probing of a topic and is, therefore, expected to provide more representative
data from the public point of view. The contrary argument is that groups
can inhibit individual articulation: private interviews are used, e.g. in KAP
studies, to encourage responses that are normally inhibited by the presence
of others (Stycos). As Merton has indicated, group and private focussed
interviews are complementary.
Focus group sessions are commonly tape recorded, sometimes video
taped, and are usually conducted in facilities equipped with one-way glass
to view participants without inhibiting them.
Focus groups have been found to be an appropriate tool for developing
insights and hypotheses and for exploring the range of pertinent attitudes,
412 Chapter 24
Further reading
For the use of focus groups and other methods of qualitative and motivation re
search in general, consult the books by Merton, Fiske and Kendall (1990) and
by Morgan (1988) and articles by Merton and Kendall (1946), Merton (1987)
and Forsyth and Lessler (1991). For the use of focus groups in consumer mar
ket research, consult articles by Calder (1977), Cox et al. (1976), Fern (1982),
Gage (1978), Grik, Parker, and Hetegikamana (1987), McDaniel (1979), Reynolds
and Johnson (1978), Sampson (1986), and Szybillo and Berger (1978); for its use
in population, family planning, and community health studies, consult the ref
erences cited in the text, the special issue on focus group research of Studies in
Family Planning (12(12), 1981), and articles by Grik et al. (1987), Knodel (1990),
and Sikes (1993); for the U.S. Census Bureau’s use of focus groups, see Market
Dynamics’s report, Alder (1985), Bush (1985), Freeman (1985) and U.S. Census
Bureau (1993a).
♦ _ Pi-P2(l-^)
P1 0
where pi = Hi/n.
An unbiased estimator of the variance of p* is given by
var(₽‘) = (24'52)
The estimated standard error of this estimate is, from equation (24.52)
0.0374, and the percentage error is 37.44 per cent.
The techniques sometimes returns inadmissible estimates - proportions
that are negative or greater than 1.
The putative advantage of RRT is that it is impossible to tell which
question the individual answered and therefore it is impossible to tell their
standing on the controversial item, so that the respondent can feel anony
mous. But respondents are often suspicious that they have been tricked
and lose their confidentiality, so they still might not answer honestly.
Randomized response technique has been used in studies, among others,
of induced abortion in urban North Carolina (Abernathy, Greenberg, and
Horvitz, 1970), and outcomes of pregnancies in Taiwan (I-Cheng, Chow,
and Rider, 1972). RRT is also being used in AIDS research (Discover, July
1987, p. 12, cited by Scheaffer et al.).
In addition to the limitations of the randomized response technique
mentioned earlier, another reason why the technique is not being used more
is that cross-tabulations of the sensitive topic responses by independent
variables are not possible. The above considerations call for a most judicious
application of this technique.
Further reading
For the admissibility criteria of RRT estimates, see Bourke and Dalenius (1974).
For a full treatment of the subject, see the references cited in the text and the
book by Chaudhuri and Mukherjee; consult also the papers by Horvitz, Greenberg
and Abernathy (1975). Droitcour et al. (1991) have discussed a number of
randomized response techniques. They also reviewed an experiment using the
item count method, due to J. Miller (1984), where each respondent is provided
with a short list of items describing behaviors and asked to count and report the
total number (not the names) of behaviors engaged in. In the simplest case, a
probability sub-sample (Sub-sample A) of respondents are shown a four-list item
that includes the socially disapproved behavior item; the remaining respondents
(Sub-sample B) are shown a list of three items (the original four-item list minus
the disapproved item). By comparing responses from the two sub-samples, an
estimate is obtained.
In line sampling, lines are drawn across a geographical area and all uni
verse units falling on the line, or intersected by it, are included in the
sample. If the lines are straight parallel equally spaced across the area, the
sampling becomes one variant of systematic sampling (Chapter 4).
416 Chapter 24
Further reading
Further reading
For details of the Health and Activity Survey in Canada in 1983-84, see Statistics
Canada (1986) and for the 1991 survey, see Statistics Canada (1991 and 1994);
for details of the Netherlands Health Interview Survey in 1981-91, see Nether
lands, Central Bureau of Statistics (1992); for the U.S. Health Interview Survey
in 1984-88, see Ries and Brown (1991); and for the 1983-85 Survey, see LaPlante
(1988). Also see Chamie (1989) for survey design strategies; for case studies
of the development of disability statistics in Egypt, Iraq, Jordan, Lebanon, and
Syria, see U.N. (1986) and for development of statistical concepts and methods
for household surveys, see U.N. (1988a).
The U.N. World Summit for Children, held in September 1990, established,
in line with the Rights of the Child, a series of specific goals to be achieved
by the year 2000: the major goals being reducing infant and child mortality
by one-third or to 50 and 70 per 1000 live births respectively, whichever
is less; halving of the maternal mortality rate; halving the prevalence of
severe and moderate malnutrition among children; providing universal ac
cess to safe drinking water and to sanitary means of excreta disposal; and
improving protection of children in especially difficult circumstances.
One of the instruments for achieving these goals is the monitoring of the
nutritional status of children, for child malnutrition is an accepted indicator
of social development that has been seen to be endemic not only in devel
oping countries, but also in a number of developed countries, particularly
among the urban poor and the “inner cities” in the United States. The
three recommended nutritional status indicators bounding the early child
hood period are: birth weight, weight-for-age, and height-for-age (WHO,
1962; FAO, 1984; Carlson, 1987).
The principal objective of a nutritional study of children in this context
is to assess the frequency and degree of underweight (indicator: weight-
for-age), stunting or chronic malnutrition (indicator: height-for-age), and
wasting or acute malnutrition (indicator: weight-for-height).
A sample survey is the only feasible data source for such information.
A linked nutritional status survey module that combines a survey on the
nutritional status of children with another compatible survey could be the
proper strategy for a field study (U.N., 1990d; Carlson and Jaworski, 1994).
In such a linked survey, the nutritional module is self-standing but with
links to the relevant identification and characteristics of the sample house
holds, so that studies of interrelationships of the nutritional status of chil
Miscellaneous Sampling Topics 419
live in rural areas (with exposure to cattle) but commute daily to the cities
for commercial sex activities (Bhattacharya).
Further reading
In addition to the references cited in the text, see the articles in Public Health
Reports, Vol. 105, No. 2, March April 1990, “Special Section: The Sentinel HIV
Seroprevalence Surveys.”
CHAPTER 25
25.1 Introduction
In the previous chapters, we had assumed implicitly that the basic data
collected either through a complete enumeration or a sample are free from
any error or bias and that the (unbiased) estimates obtained from a sample
are subject only to sampling errors. This is not so in actual survey con
ditions. From the stage of collection to that of preparation of final tables,
the data are generally subject to different types of errors and biases. In
this chapter will be reviewed the various types of errors and biases in data
and in estimates derived from them and the methods of measurement and
control of these errors and biases.
The importance of giving proper attention to errors and biases in data
and estimates can best be illustrated by the telling words of W. Edwards
Deming (1950, page 25):
“For what profiteth a statistician to design a beautiful sample when
the questionnaire will not elicit the information desired, or if the
universe has not been satisfactorily defined, or the field-force is so
badly organized that the results will not be worth tabulating? And
again, what is accomplished if a well-designed questionnaire and well-
disciplined field-force are used with a biased sampling procedure?"
423
424 Chapter 25
some households, persons, etc. are missed, i.e. not enumerated at all or
enumerated more than once, that is a coverage error: complete villages
in remote areas being missed in some African censuses are an extreme case
in point. If a particular unit is covered in the survey, but there is a mistake
in recording its relevant characteristics (the age of a person, for example,
who may not, for whatever reasons, know or report his exact age), that is
a content (or classification) error. There may be some balancing out of
these two types of errors.
The different types of errors in data and estimates can be classified first
according to the source: (a) errors having their origin in sampling and (b)
errors which are common to both censuses and samples.
(a) Sampling errors. A measure of the degree to which the sample es
timate differs from the expected survey value (which is obtained on
repeated applications of the same survey procedure) is given by the
standard error, the square of which is called sampling variance.
be subject to much smaller biases than the estimates of the numerators and
denominators, may be obtained for the less important items. A comparison
should in any case be made of the estimates based on proper weights and
simplified weights for some important items in order to obtain an idea of the
magnitude and direction of the bias in simpler estimates. Results from a
number of demographic surveys in India have shown that such a procedure
could be used with proper caution (Som, 1973, section 1.3.2(ii)b 1).
Second, biased estimates will result when estimates of variances and
covariances are computed from results at higher levels of aggregation. Re
sults (such as birth rate and proportion of childless women) are not gen
erally published for the ultimate strata but only for the higher levels of
aggregation, such as states and provinces: the components of these results
(estimated numbers of births and total population for the birth rate and
estimated numbers of childless women and total women for the propor
tion of childless women) are almost never published. Yet, in a study of
a demographic survey in Zaire in 1955-1957, the correlation coefficient be
tween (the estimated) birth rate and (the estimated) proportion of childless
women was computed from the published results of birth rate and propor
tion of childless women for the 28 districts in the country: this is a wrong
method and the desired estimates should be obtained with a design-based
procedure by building up estimates of variances, covariances etc. from the
ultimate strata in the sample.
As a last point, biases will arise when the sample is treated as if it
had come from a stratified design when the design is either unstratified or
stratified with respect to another variable (see section 10.7).
census screening questions and those who answered “no”: the proportion of
persons with some disability was estimated at 13 per cent, of whom “false
negatives” - i.e. the census screening missed - constituted 40 per cent and
the ratio of “false positives” - i.e. those falsely identified by the census
screening as having some disability when they had none - to the num
ber with actual disability was 1/16 (U.N., 1994; for details, see Statistics
Canada, 1992).
This inventory and description of errors and biases should not lead one
to suppose that all inquiries are worthless because all have errors. The er
rors are of varying types and degrees, and can occasionally be measured and
subtracted out; this could be the main aim in research on errors and biases
(Deming, 1950). On the other hand, a survey in which the magnitude and
direction of different types of errors are not evaluated, or at least indicated,
may give a false sense of accuracy in the collected data and constructed
estimates.
bi = yi - Zi
25.4.2 Census
yi = Zi 4- bi (25.1)
Errors and Biases in Data and Estimates 431
The relation between the survey value and the true value of the universe
total is given by summing both sides of equation (25.1),
N N N
or
Y=Z+B (25.2)
and that between the means by dividing both sides of equation (25.2) by
N
Y =Z+B (25.3)
where B is the error of the survey total (Y) and (B) that of the survey
mean (Y).
When Y > Z, i.e. B, the error in the survey total, has a positive sign,
Y gives an overestimate of Z\ and when Y < Z, i.e. B has a negative sign,
Y gives an underestimate of Z. And similarly for the means.
From equation (25.1) the variance of an individual survey value is given
by
+ °b + 2<tz6 (25.4)
where cr^, cr^, and cr^ are the variances (per unit) of ?/, z, and b respectively
and azb is the covariance of z and b, defined respectively by
N N
- YYIN; <r2
, = £(z, - Z)2/N-,
N N
(ii) Individual errors present, but total error absent. Here 6, 0, but
B = 0, so that
Thus in this case, the survey total or mean would be free from error,
but the variance will be affected; even if azb = 0, i.e. even if there is
no correlation between z and b, but a^ > a].
The above results show that even in a census, the results (total, mean
etc.) may be misleading unless some idea is obtained about the errors of
the data.
25.4.3 Sample
Here we assume that a simple random sample of n units is drawn with
replacement from the total of N units. The individual survey values are
Vi = Zi+ bi (25.6)
and that between the means (dividing both sides of equation (25.7) by n)
is
y— z+b (25.8)
where y is the survey mean, 7 the true mean and b the error of the survey
mean, for the n sample units.
When y > z, i.e. b has a positive sign, y will give an overestimate of 7;
and when y <7, i.e. b has a negative sign, y will give an underestimate of
7.
The expected value of the survey mean y is obtained on taking the
means of all the possible values of y from different samples all of size n
under the same survey procedure and is given by
Figure 25.1: Distribution of true and survey individual values and of survey
mean (adapted from Zarkovich (1963), Figure 1).
(a) Bias heavy, sampling error large (b) Bias heavy, sampling error small
(c) Bias small, sampling error large (d) Bias small, sampling error small
Figure 25.2: Distribution of the sample estimate (?/) in relation to the true
value (Z).
434 Chapter 25
In Figure 25.1, the distribution (I) of the true values z, around the true
mean Z has been shown by Curve I; the variance of the distribution is crj.
The distribution (II) of the survey values j/, around the survey mean Y is
shown by Curve II; the variance of the distribution is cr^. If a sample of n
units are taken from the distribution II, the estimates of the mean y will
have the distribution shown by Curve III around the expected survey mean
Y; a measure of this variation is given by cr^-, as defined by equation (25.10).
This does not, however, give any indication of the expected behavior of the
distribution of the sample estimate y around the true mean Z, which is
the basic aim in any survey. That is measured by the mean square error
defined below, which, in addition to the sampling error, takes account of
the expected value of the bias, measured by B = Y — Z.
Consider also the four special cases given in Figure 25.2(a)-(d). It is
clear that in the two cases (a) and (b), where the bias is heavy, the values
of both the estimate and its (estimated) variance would be misleading in
setting any confidence limits to the universe value: the ideal situation is, of
course, given by case (d) with both the bias and the sampling error small
(see also Deming, 1950, Fig. 1).
Mean square error. The variability of the survey mean around the true
value is measured by the mean square error, which is defined as the ex
pectation of the square of the difference of the survey mean and the true
mean:
Table 25.1: Effects of individual and total errors on the expected value, the
sampling variance, and the mean square error of the survey mean
Special cases:
= 0 =0 Z a^/n a2/n
Estimate, sampling variance, mean square error unaffected. (Note the classical formula for
the sampling variance of the mean for smple random samples.)
= fro ^0 Z 4- Bq a2/n £ a2 B2
(constant)
sampling error of the estimates. Response biases are the other type of
response errors that do not cancel out, and affect the estimates themselves,
while the variability of the estimates may or may not be affected (consider
in the preceding subsection the cases where b — 0; the sampling errors
will not be affected if 6, is a constant.) Thus, in a census or a sample,
if the differential effects of the enumerators cancel out, but there is a net
bias common to all the enumerators, these may be stated to constitute a
response bias.
Sampl Sampl
ing ing
error error
sample surveys make it possible to exercise better control over the collection
and processing of data by employing superior enumerators, giving them
intensive training, and requiring interviews in depth (see section 1.3).
The aim in an inquiry, be it a census or a sample, should be to obtain
estimates with the greatest possible accuracy by controlling the total error
(as measured by the root mean square error) and not merely the sampling
error.
More complex models of errors in data and estimates have been developed
by Cochran (1977); Fellegi (1964); Hansen and his colleagues (1951; 1953,
Vol. II, Chapter 12; 1961), and Sukhatme and Seth (1952), among others;
more recently developed models are reviewed in Measurement Errors in
Surveys (edited by Biemer el al., 1991), in particular by Biemer and Stokes
(1991). Following in general the formulation by Cochran, and elaborating
the model in section 25.4, we express ytj, the value obtained on the zth unit
(i = 1,2,..., n) in the j th independent repetition, as
Some types of response errors may be immediately evident in the data. The
misreporting of age is a well-known case in point, reflected in the heaping
Errors and Biases in Data and Estimates 439
of ages at some individual years (ending mostly in zeroes and fives) and
the corresponding deficiency at some others (mostly ending in odd digits
other than 5): very high or very low birth and death rates reported in some
demographic surveys and civil registration constitute another case.
Some methods of measuring response errors will be considered: external
record checks; re-surveys; interpenetrating networks of sub-samples; inter
nal consistency checks; and analysis by recall periods.
Such checks can be made only if accurate external data are available for each
unit of the universe. An example is the “Reverse Record Checks” made for
coverage errors in the 1960 Census of Population and Housing in the U.S.,
where probability samples of persons were drawn from different sources
of records, namely persons enumerated in the previous census, registered
aliens, children born in the intervening period, etc., or of special groups,
such as the aged social security beneficiaries and students enrolled in col
leges and universities: these were checked against the census returns (U.S.
Bureau of the Census, 1963b). In countries with good hospital records,
data on hospitalization as reported in household interview surveys can be
compared with hospital records: this was done in the U.K. and the U.S.
In a large number of countries of the world, this method obviously either
cannot be used or would be of very limited value.
Where feasible, a related procedure is to monitor responses which are
objective in nature. This is illustrated by the checking of the validity of the
telephone coincidental method for determining the in-home radio station
rating. In this method, telephone calls are made at random times during
each quarter hour period of the day to a random sample of individuals in
listed telephone households; the selected person is asked if the radio is on,
and if it is, to identify the station that was tuned in. In a survey conducted
the late 1960s in the New York Metropolitan area, this method was tested
as follows. Each interviewer who did the calling had at her disposal an
electronic device whereby she could transmit, over the telephone she was
using, the broadcast currently coming from any of the leading twenty A.M.
stations in the area by pressing any of the twenty buttons. In the telephone
interview, when the radio was reportedly on and the program was identified
by the respondent, the interviewer pressed the test button corresponding
to the station reported. If then the respondent reported that what was
coming over the telephone was the same as what was on his radio, the
original response was considered to be correct. Of the total 854 responses
validated in the test, 91 per cent of the responses were found correct by
this procedure (Frankel, 1969). Groves et al. (1989) have described more
440 Chapter 25
25.6.2 Re-survey
A standard method of measuring and adjusting for response errors in a
survey, whether a census or a sample, is the re-survey of a sub-sample
of units in the original survey, preferably using a more detailed schedule,
and, in a personal interview, with better staff (either the supervisors or
the better set of interviewers). For population and housing censuses, the
ultimate sampling unit in re-surveys should be compact areal units. In
the form of a post-enumeration check, such re-surveys are now almost
universal with population censuses.
A re-survey can form an integrated part of the original survey, as in
the Demographic Sample Survey of Guinea, in 1954-5, where in addition,
whenever infant deaths were considered to be grossly under-reported in
a village, a medical team was sent to re-interview the females (Ministère
de la France d’Outre-Mer, 1956). Unless conducted simultaneously with
the original survey or immediately after it, the re-survey introduces some
operational and technical difficulties.
The conducting of a re-survey, in the form of a post-enumeration
check, i.e. a Post-Enumeration Survey (PES), to evaluate a census is
a relatively recent phenomenon. For four countries with a long tradition
of census-taking, the year when the first population census was conducted
and the year when a PES was first introduced in the census are: the U.S.
- 1790 and 1950; the U.K. - 1801 and 1961; Canada - 1871 and 1951; and
India - 1872 and 1951.
Unitary checks with one-to-one matching (e.g. of persons in a popu
lation census and the PES) is considered as the essential, integral part of
post-enumerative checks of censuses to check both gross and net errors;
however, the difficulty of such matching with the available resources may be
so enormous in some countries that checks at the aggregate levels (e.g. of
the total number of persons in the selected areal units) may be considered
practical to check only the net errors.
The statistical model in a re-survey is
Etf) = Z
on the assumption that the bias element B = 0, the letters with primes
denoting the value obtained from the re-survey. The difference (y — y1)
therefore gives an estimator of the bias element B in the original survey.
Adjustments may also be made by regression and ratio estimators.
Some estimates of the under-enumeration of the total population (i.e.
the coverage error) in a number of censuses are as follows: 1/2 per cent in
Errors and Biases in Data and Estimates 441
the U.S.S.R. in 1959; 1.1 per cent in Canada in 1956; 1.1 per cent (and 3.5
per cent by analytical methods) in the U.S. in 1950; 3.5 per cent in Sierra
Leone in 1963; and 8.0 per cent in Swaziland in 1956. For the U.K., an
estimate at census date in 1951 exceeded the final count by 0.3 per cent;
and in 1961 a careful retrawl of sample areas and interviews with a small
sub-sample of households, carried out immediately after the census, gave
no evidence of significant under- or over-enumeration (Benjamin, 1968).
Note that the use of demographic-analytical techniques generally provides
a higher degree of under-count in the census, e.g. 3.6 percent (Coale, 1955)
as compared to 1.4 percent given by the PES in the 1950 population census
of the U.S.
In agricultural inquiries too, under-enumeration can be sizable. In an
experiment conducted in Greece, the farmers were found to have under-
reported by 36 per cent the number of parcels operated by them, and in
the 1979 Census of Agriculture in the U.S., the total number of farms were
under-counted by 8.4 per cent and the area of farms in lands by 6 per cent,
whereas the coefficients of variation of the estimated totals were of the order
of 1 per cent only (Sukhatme et al. (1984), section 11.19).
In crop surveys, whenever eye estimates are used, the results should be
calibrated by comparison with the physical measurements of a sub-sample:
this assumes a high positive correlation between the two sets of figures
(Yates, section 4.25, examples 6.12b and 7.15b, and sections 6.15 and 7.17).
Note: The net error is the resultant of positive and negative error, obtained on
a one-to-one matching of the orginal survey, e.g. a census, and the re-survey; and
the gross error is the absolute total of both positve and negative errors, ignoring
their sign. The net error can be considered to be a measure of the non-sampling
bias in the original survey and the gross error as a measure of the non-sampling
variation, i.e., the response variance (U.S. Bureau of the Census, 1963a).
and research tool (see Hansen, Hurwitz, and Bershad, 1961, and Hansen,
Hurwitz and Pritzker, 1964). Tentative consideration given to the possi
bility of adjustment of the results of the original survey on the basis of
the re-interview caused some doubt that it would be worthwhile, because,
on the whole, the differences were not large. It could not be done cur
rently, month by month, with the present sampling design, because the
re-interview sample was too small. The introduction of the re-interview
results by a double sampling procedure would, on the other hand, call for a
considerable change in sampling design. None of these seemed to have suf
ficient merit to be justified but would have caused additional work and cost
(Personal communication in 1961 from Morris H. Hansen, then Assistant
Director for Research and Development, U.S. Bureau of the Census).
In the Indian National Sample survey also, although there is on an av
erage one supervisor to four enumerators, supervision is not arranged on a
probability basis, primarily because there is reason to believe that the qual
ity of work of the primary enumerators in the sub-sample for supervision
would be different from that in the rest of the sample, because in actual
practice, it is extremely difficult, if not impossible, to keep the enumerators
ignorant of the supervisor’s sub-sample: the scheme itself then might en
courage negligence in the sample falling outside the supervisor’s sub-sample.
Moreover, the supervisor himself might be interested in demonstrating that
the enumerators working under him are doing a good job, otherwise his
competence as a supervisor might be called into question (Personal obser
vations made to the author in 1962 by D.B. Lahiri, then Advisor to the
National Sample Survey Department, Indian Statistical Institute).
Note: As mentioned earlier, this technique can be used to test the differential
bias of the primary enumerators versus that of the supervisors, or, in a health
survey, the lay enumerators versus medical or para-medical staff, or to test the
differences of estimates obtained by different types of schedules (one with direct
questions and another with fully detailed probes), or by different methods of
inquiry (retrospective versus periodic observations), and these different sources of
variation could be taken up in combination in different sub-samples. As it is very
unlikely that there will be a constant bias running through all these combinations,
one could meet one of the commonest objections to the use of this technique
(Mahalanobis, 1958, Preface; Som, 1965, Appendix 2).
From the sample data, analysis of variance could be computed for the
variations due to “between enumerators (sub-samples)” and “within enu
merators” with respective mean squares
k
Sb =
k m
sl = ZL -yey/k(m- !)
Errors and Biases in Data and Estimates 445
Krotki, 1966 and 1978). For the experiences of “Dual Record System”,
operated by the Laboratories for Population Statistics, University of North
Carolina at Chapel Hill, see Adlaka et al. (1977) and Myers (1976).
Assuming that the under-reporting in the survey and under-registration
operate independently, the adjustment (multiplying) factor for the total
number of events reported by either or both the agencies is 1/(1 — pi — p2),
where pi is the probability of an event not being reported in the survey,
Pl — ysr/^ysr T ysr)
P2 — ysf/^Psr T Psf}
the subscripts s and s denoting whether the event was reported in the
survey or not; and similarly for r and r. For the total number of events,
the estimator (“the Chandra Sekar - Deming formula"} is
Response errors can be controlled by the proper selection, training, and su
pervision of enumerators and the control of enumeration. From expression
(25.14) it might appear that, since an increase in the number of enumer
ators would decrease the intra-enumerator correlation, the contribution to
the total variance due to the variability between enumerators would also
decrease. But when a very large number of enumerators have to be em
ployed, one has to accept a lower level of staff, training and supervision,
resulting in a change in survey conditions that would increase the response
errors.
Errors and Biases in Data and Estimates 449
For a given total size of a sample, the sampling error can be controlled and
evaluated in a suitably designed survey with considerations of optimization
of stratification, allocation of the total sample size into different strata and
stages, probabilities of selection, etc., and by using appropriate formulae
for estimation, relating all these to the important variables to be studied.
The sampling error can be reduced by increasing the sample, but it may
introduce additional non-sampling errors in the estimates unless the sur
vey conditions remain the same. Some requirements of the reduction of
sampling errors may, however, come in conflict with those of the response
errors, e.g. while sampling consideration might call for a widespread sam
pling with little clustering, a concern for the response errors might lead the
survey designer to confine the sample to a sample of large clusters in which
450 Chapter 25
(a)
between errors and the constant bias will remain the same; i.e., and
might well increase with sample size. In large samples, the mean square
error will be dominated by these terms and the ordinary sampling variance
will be a poor guide to the accuracy of the results, unless the non-sampling
errors are controlled substantially.
To try to reduce sampling errors while a bias several times as large
is allowed to creep in is not only pointless but also a waste of resources
(Sukhatme and Sukhatme (1960), p. 543).
Reference has been made in section 1.3 to the study made at the U.S.
Bureau of the Census, which showed that for many of the more difficult
items in a census, such as occupation, industry, work status, income and
education, the enumerator variability is approximately the same as the sam
pling variance for a 25 per cent sample of households; the census results
were further seen to be subject to a bias which varied from one item to an
other and was 6 per cent on average. The census and the sample will have
approximately the same bias if the census enumerators collect the data for
the sample of 25 per cent of households as part of the regular census. Under
these assumptions, the root mean square errors that can be expected for
the complete enumeration, and the sample for various percentages of units
possessing a certain attribute, are given in Table 25.2 for areas with differ
ent population sizes. While the root mean square errors for the complete
enumeration and the sample are appreciably different in areas with small
populations, they converge with increasing population size and become al
most identical for areas with 50,000 persons. Thus, when the major census
results are published for areas with 50,000 or more persons, it is more ad
vantageous to take a sample for these particular items. This procedure has
been followed since the 1960 census of population in the United States (U.S.
Bureau of the Census, 1960).
We assume that the universe of N units can be divided into two strata, the
first consisting of Ni units for which measurements would be obtained and
the second of N2 for which no measurement would be obtained either in a
census or a sample. Let Y\ and Y2 denote the two stratum-means. When
the field work is completed for an srs of n units, consisting of units from
the first stratum and 122 units from the second stratum, we would have
measurement only from the ni sample units, giving the sample mean
Assuming that ni and ri2 are random samples from the two strata, the
452 Chapter 25
(i) Selecting random substitutes from the responding units. When the
proportion of non-response is low and the difference between the two
stratum-means is believed not to be substantial, a simple practical method
is to replace the non-responding units by a random sub-sample of the re
sponding units. This is relevant particularly in self-weighting designs in
order to keep the multiplier a constant.
whence
n = n'[l + (k- 1)(1-P)] (25.16)
Note that although we have assumed that all the n2 units respond at the
second attempt, they may not in fact do so. The process of call-backs could
be continued until all but a very insignificant proportion of the units re
spond: the Hansen-Hurwitz method has been extended by El Badry (1956).
Successive calls may help to diminish the bias and a graph (Clausen and
Ford, 1947) or a regression curve fitted to the estimates and the cumulated
rates of responses at successive attempts when extrapolated to the hypo
thetical case of 100 per cent response could give a better approximation
to the true value (Hendricks, 1949; 1956, Chapter XI). Some results of the
inquiry of fruit growers in North Carolina, U.S., are given in Table 25.3;
the initial response gave a very high average number of fruit trees per farm
(456) in North Carolina because a farmer’s interest in a fruit survey could
be expected to be positively associated with his scale of operation, but a
regression curve of the form y = axb, where y represents the total number
of trees picked by a sample when the (cumulative) percentage return is x
(and a and b are estimated from the data), when extrapolated to x = 100
per cent, gave an estimate of 344 trees per farm, as against the true value
of 329, which was known (Hendricks, 1956).
454 Chapter 25
In the 1946 Family Census of the U.K., the initial response gave a
very high birth rate due to the fact that the majority of the initial non
respondents were women with few or no children: of the 230,000 initial
non-respondents (who constituted 17 per cent of the total sample), 50,000
responded to the follow-up appeal and the replies of the first 12,000 of them,
when combined with the remainder of the sample with a weight of 230/12,
gave an adjusted birth rate which corresponded to that already known from
other sources (Glass and Grebenik, 1954; cited by Yates, section 5.22, and
by Moser & Kalton, section 7.4).
Er=o6nrl/r/(r + 1)
EL06nr/(r+1)
Further reading
Two books - Nonsampling Error in Surveys by Lessler and Kalsbeek (1992) and
Measurement Errors in Surveys (edited by Biemer et al., 1991) - describe recent
developments on the topics.
456 Chapter 25
Exercises
surviving infants, under one year of age, which would normally equal the
number of total population under one year of age at the time of the survey.
The latter number is given as 790, higher than the total births during
the year. The normal relationship may, however, be upset in the unusual
situation of a very heavy immigration of infants to the area selected for the
survey, who are enumerated in the survey but whose births are not covered
in the survey owing to adopted definitions (Som, 1973, Exercise 1.10.4)).
CHAPTER 26
Planning, Execution
and Analysis of Surveys
26.1 Introduction
W. Edwards Deming (1969, page 666) has stated the importance of the
preparation of survey procedures in the following forceful words:
459
460 Chapter 26
and forced labor in some countries, and, second, to explain the reasons for
the various questions to be asked. The publicity media may include films,
articles for the press, booklets, talks for radio, television and general press
release.
On the other hand, in some surveys, such as the Social Survey in the
U.K., publicity was never used for it was believed better to take the respon
dents by surprise: if they had had time to think out the matter they may
have decided to refuse or to present some socially acceptable response. But
faced with an actual human appeal from an interviewer on the doorstep, it
may be more difficult to take this decision.
Table 26.1: Budget for demographic sample survey, West Cameroon, 1965:
Sample of 170,000 with single-round team-enumeration
Expatriate personnel
(b) In France
Preparation of mission:
1 Survey Director for 15 days at
Fr.F. 2,400 per month 60
1 Field Organizer for 15 days at
Fr.F. 1,600 per month 40
Social Security contributions 40% 40
(continued)
Planning, Execution and Analysis of Surveys 463
2 Local Personnel
(continued)
464 Chapter 26
5 Miscellaneous:
1 CFA Franc = French Franc 0.02 (or US $0.004 or UK Z0.0017), at the then
rates of exchange.
Source: U.N. Economic Commission for Africa (1971).
Planning, Execution and Analysis of Surveys 465
The final responsibility for survey planning should rest with the executive
agency. The permanence of a survey organization, within the framework
of the statistical system of a country, has definite advantages in further
ing the development of specialized and experienced personnel who could
undertake the continuing research needed for evolving increasingly efficient
survey designs, and also for improving the quality of survey data by assess
ing and controlling non-sampling errors and biases. A permanent survey
organization can also undertake urgent ad hoc surveys at short notice.
In surveys using areal sampling units at some (or all) stages, the detailed
subdivision of the universe into identifiable areal units is one of the basic
important operations in a survey. Censuses of population, housing and
466 Chapter 26
At the very early stages of survey planning, the tabulation program should
be drawn up so that the procedures and costs involved may be investigated
thoroughly and the questionnaire tested to indicate whether it is possible
to gather the required information.
2. by mail inquiry;
3. by telephone;
4. from registration;
5. from transcription of records.
of focus groups”).
The objectives of the survey will determine the choice of sample design and
the type of survey. The following types of surveys have been distinguished
by the Sub-Commission on Statistical Sampling of the United Nations Sta
tistical Commission (U.N., 1947):
470 Chapter 26
(iii) Continuing surveys. The most usual example of these surveys occurs
where a permanent sampling staff conducts a series of repetitive sur
veys which frequently include questions on the same topics in order to
provide continuous series deemed of special importance to a country.
Questions on the continued topics can frequently be supplemented by
questions on other topics, depending upon the needs of the country.
(iv) Ad hoc survey. This is a survey without any plan for repetition.
Note: The following method of estimating the number of common units in two
lists is due to Deming and Glasser (1959):
Suppose there are two lists, one having M units and the other N units with
an unknown number D common to both. Samples of m and n units are selected
with srswor from the two lists, and d units are found to be common to the two
samples. Then an unbiased estimator of D is MNd/mn, and its unbiased variance
estimator is
(M N/ N/ mn) — 1] d
nical staff, and, second, the supervisors in turn will go to their field offices
and train the enumerators under them. It would be beneficial to organize
refresher courses for both the supervisors and the enumerators, not only
for the field procedures but also on the objectives and methodology of the
survey. The training method would normally comprise classroom teaching,
provision of manuals, demonstration enumeration, field work under supervi
sion, scrutiny of filled-in schedules, and oral and written tests. The manual
for the enumerators may contain the following topics: the survey objec
tives and program; preliminary work in the sample areas; how and whom
to interview; detailed instructions on the questions, including concepts, def
initions and examples; checking completed questionnaires; preparation of
summary results; and return of documents and equipment.
26.16 Supervision
Accurate field work can only result from thorough training of efficient enu
merators. Nevertheless, adequate supervision must be an integral part of
the field work.
The ratio of supervisors to enumerators is 1:6 in the U.S. Current Pop
ulation Survey and 1:4 in the Indian National Sample Survey; in the 1969
National Survey of Family Income and Expenditure in Japan it was 1:4. A
supervisor should undertake some field work, either independently or as a
check on the work of the enumerators, and this information can be used to
adjust the data collected by the enumerators.
It is preferable to have a copy of the filled-in forms made before they are
passed to the central offices for tabulation and analysis. The enumerators
should themselves carry out simple numerical calculations and consistency
checks of the data. When a copy of the enumerators’ filled-in schedules is
retained in the supervisor’s office, a percentage of both the original and the
copy should be scrutinized for copying and other mistakes; and, whenever
possible, gross mistakes should be referred back to.
Any correction made by the supervisor on the enumerators’ schedules
should not be done by erasing the enumerator’s entry but should be made
in such a way that both can be used for future checks and analysis.
The enumerators should also be required to keep records of the times
spent on the different aspects of the field work such as journey, identification
and contact of sampling units, listing, enumeration, etc., which may be
useful for designing subsequent inquiries.
Planning, Execution and Analysis of Surveys 477
or the sampling fraction should be small when they are selected without
replacement.
Checks should be introduced at every stage of processing. For data
entry, for example, some dummy entries might be made which would ensure
the checking of the punching and the verification work.
The computations required for a large-scale survey are made on a dif
ferent basis from those illustrated in this book. For each ultimate sample
unit, the multiplier or the weighting factor is computed and the product
of the multiplier and the value of the study variable for the sample unit is
obtained. When these are summed up for all the sample units by the ma
chine, the required estimate is obtained and a suitable procedure is adopted
for obtaining its standard error.
Suppose in a household inquiry (as in Example 21.1), the sample is
stratified two-stage, with villages, the first-stage units, selected with pps
and with replacement, and households, the second-stage units, selected at
random or systematically. For estimating a stratum total, the multiplier of
each household is given by
(Reciprocal of the probability of selection of the sample village)
(Number of sample villages in the stratum)
(Total number of households in the sample village)
(Number of sample households in the sample village)
This multiplier will be computed for each sample household. To es
timate the total consumption of food in a stratum, for example, the food
consumption in a sample household will be multiplied by its multiplier, and
the sum of the products for each stratum would give the estimated total
food consumption in it. These stratum estimates when summed up for all
the strata will provide an unbiased estimate of the total food consump
tion in the universe. When the ultimate units in the sample penultimate
units are selected with equal probability (as in this example), the multi
plier will be the same for the sample ultimate units in a particular selected
penultimate unit.
Of course, in a self-weighting design, for estimating the totals the multi
plier will be applied only once at the overall level, and it will not be required
for estimating the ratio of two totals and its variance estimator.
The advent of electronic computers in recent years has been in many
ways a great help in obtaining the required estimates speedily and accu
rately: for a sample survey, standard errors of a host of estimates can be
computed with their aid. These can also be programmed to tasks of edit
ing with consequent improvement of consistency of the primary data and
possibly of its accuracy. They make possible some forms of estimation on
a scale which would have been impractical with manual and mechanical
Planning, Execution and Analysis of Surveys 479
tabulations. For example, in some cases the use of the regression method
of estimation on previous surveys of a series with overlapping samples can
substantially improve the efficiency of the estimators. The use of com
puters makes possible changes in the allocation of resources between the
collection of data and processing. Computers can also be used in analy
sis and studies for interrelations of various factors by multivariate analysis
(United Nations, 1964b), for example. “Off-the-shelf’ software packages
and programs, such as the Integrated Microcomputer Processing Systems
(IMPS) of the U.S. Bureau of the Census, are now available for use in per
sonal computers for obtaining not only estimates of totals, means, ratios,
etc. of study variables but also their standard errors (more fully discussed
in the next chapter, “The Use of Computers in Survey Sampling”).
However, an efficient use of computers would be predicated upon the
availability of trained programmers, especially the local staff, continuous
service facilities, and intensive utilization of computer time.
A record of time, kept on the different processing operations, would help
to improve future operations.
6. numerical results;
7. main findings;
10. assessment - the extent to which the purposes of the survey were
fulfilled;
Exercise
27.1 Introduction
Computers have now become an indispensable tool for processing all types
of data in public and private sector activities, such as censuses and surveys,
financial transactions, telephone listing, payroll preparation, educational
testing, and space programs.
In a typical census, there is a fairly large amount of data input, but the
data structure and statistical tabulations are relatively simple, comprising
first-level tabulations and second- and third-level cross-tabulations; in con
trast, sample surveys require far less data input (5 per cent or less than
that of a corresponding census), but the statistical operations, such as the
estimation of universe totals and their variances, are complex.
Many countries with populations up to 10 million (such as Honduras
with 5.1 million and Niger with 7.7 million around 1990) have been in
creasingly using powerful Personal Computers (PCs), with shared (local or
remote) area network systems to process their entire census data (Toro and
Chamberlain, 1989). For example, of the 25 African countries that had un
dertaken the processing of their 1985-91 population censuses, 16 had used
only microcomputers, 9 only minicomputers and 3 a combination of both
(Dekker, 1991).
And, almost invariably, sample survey data are now being processed
on PCs. For example, the data of 48 Demographic and Health Surveys,
conducted in different countries of the world during 1985 through 1992,
were all processed on PCs, using an integrated software package especially
developed for these surveys and the first sets of tabulations were completed
within a fortnight of the receipt of data (see Appendix VI, section A6.3.3,
and Croft, 1992).
485
486 Chapter 27
The general trend in the developed world has been toward an increasing
use of PCs and, for the field, powerful portables, with facilities for wireless
or telephone communication. Perhaps countries in the third world could go
directly to using PCs (desktops) in the office with a Local Area Network
(LAN) and portables in the field without going through the Mainframe —*
Minicomputer —> Desktop 4- Portable route. Given resources and access
to electricity for recharging batteries, computer-assisted personal interview
could be adopted in these countries right now.
There is no escaping the computer revolution. Like Alice in Through
the Looking Glass, if you just run, you will remain in the same place: to
advance, you must run faster.
It has been argued that the experiences of developed countries may not
be appropriate to the developing countries for several reasons: significant
differences in the computerization and introduction of microcomputers; the
need for the developing countries to have stand-alone PCs, as compared to
linking micros to existing mainframe systems; and PC use in developing
countries not having immediate relevance to developed country PC appli
cations, such as in finance and law enforcement in the government sector
in the U.S.
For the developing countries, there are two important lessons. First, in
developing computers and their software packages, initial emphasis should
fall on computer engineering and program development. Later on, emphasis
should shift to end-users, both present and potential. Otherwise, computers
will continue to be considered esoteric. The time-frame of the history of
computers from the first electronic computing machine, the ENIAC (the
Electronic Numeric Integrator and Computer) in the mid-1940s and
the early mainframes (that needed highly skilled mathematicians to be
programmers) to the current crop of PCs that are being used by almost
everybody (most of whom are not programmers) has to be telescoped in
the developing countries. Second, experiences of the developed countries
will have to be internalized in the developing countries.
The speed and effectiveness of the transfer of microcomputer technology,
or for that matter of any technology, depend on the receptivity of the
population - receptivity that is embedded in the social and cultural milieu.
In cultures where hands-on machine use is still considered infra dig., the
PCs provided could become mere status symbols and office adornments.
The provision of equipment does not guarantee its use, as some potential
users could continue to be passive toward the technology (Sanwal, 1988).
488 Chapter 27
For selecting the sample, sampling frames are required. Such frames are
normally available at the central sample survey office, at least up to the
first few stages (in a multi-stage survey) in the form of a data base, from
which a computer program can be written for selecting the (first few stages’)
sample units. For example, a master sample may contain information on
small areal units (either census blocks or census enumeration areas) in the
form of a data base. If in a demographic survey, all the extant households
in the selected small areal units are to be enumerated, then these units
become the ultimate-stage sample units, that is, no further sampling need
be done in the field. Households and individuals then become observation
or recording units. If, however, a sample of households is to be selected
from the selected small areal units, sampling at the ultimate stage must be
carried out by the enumerators in the field.
For selecting the sample, the sampling frames are usually put in a data
base format, such as dBase III or IV, Fox Pro or Paradox, and the sample
drawn following the sample design.
The availability of computer power right at the point where data are en
tered now makes real-time error checking possible. The data entry software
packages can be programmed to refuse to accept an entry or to flash a warn
ing should the operator try to register a wrong code or if variables appear
to be inconsistent with each other; also, fields that are known to remain
unchanged from one record to the next may be duplicated automatically
(U.N. ESCAP, 1988).
Among the data processing operations of a census or a survey, data
entry is the most constricting bottleneck such that for a population census
it can take a year and often more. Data entry software packages have been
designed to relieve this constriction.
The Use of Personal Computers in Survey Sampling 489
Two data entry software packages that are in most common use in cen
suses and surveys in developing countries are CENTRY, developed by the
U.S. Bureau of the Census, and PC-EDIT, developed by the U.N.; both
include editing, verification, data modification, and statistics on operator
performance. CONCOR, a software package also developed by the U.S.
Census Bureau for data scrutiny and editing, can be run during data entry
with CENTRY.
Frankel, et al., 1992; Olsen, 1992; Speizer and Dougherty, 1991). CAPI
has thus become a common feature in many surveys in countries such as
the U.S. [for example, to collect supplementary data, in the U.S. National
Health Interview Survey (Rice et al., 1989) and in a survey to assess nu
tritional needs of the population (Rotschild and Wilson, 1989)] in Canada,
the Netherlands (in the labor force survey, van Bastelaar et al., 1988), Swe
den, the U.K. and Australia. CATI is being used, among others, in the
U.S. National Crime Victimization Survey and has been adopted by lit
erally hundreds of survey organizations as the preferred method of data
collection (Lyberg and Kasprzyk, 1991).
An experiment was conducted in Guatemala for the Demographic and
Health Survey there (see Appendix VI, section A6.3.3) by interviewing 300
women once by PAPI and then by CAPI that showed a saving of about 25
per cent in interviewing time in CAPI. (Cantor and Rojas, 1992).
There are three major data tabulation software packages in use in many
developing countries - CENTS and QUICKTAB, developed by the U.S.
Bureau of the Census, and XTable, developed by the U.N. Statistical Divi
sion.
27.7.1 CENTS
CENTS has been used successfully in many censuses, surveys and other ac
tivities requiring the production of statistical tables. It can produce tables
in virtually any format. The user “draws” the tables, then writes instruc
tions on the relationship between the data and table cells. Once tables are
defined, they can be produced automatically for up to five geographical
levels. CENTS is available for PCs only.
27.7.2 QUICKTAB
QUICKTAB is a menu-driven tabulation package for frequency distribu
tions and cross-tabulations. It produces cross-tabulations of up to three
dimensions and processes at a speed of 80,000 records per minute on an
average PC.
Tables produced by QUICKTAB are in a pre-specified format. That is,
QUICKTAB, unlike CENTS, does not allow for the flexibility of “drawing”
tables in any format. QUICKTAB is simple to use and can be learned in a
few minutes.
492 Chapter 27
27.7.3 XTable
XTable is an easy-to-use package that produces frequencies and summary
tabulations of censuses, surveys, vital and civil registration records, popu
lation statistical databases, or any other administrative data.
With the census data and tables being put in databases, data dissemination,
particularly between government agencies, has been facilitated by putting
the databases on-line. In Japan, the Statistical Bureau and the Statis
tics Center are jointly operating a statistical database called SISMAC
(Statistical Information System of Management and Coordinat
ing Agency; Ide and Kawasaki, 1991). In Latin America, on-line access to
census data by external users is not widespread except in Brazil; for the re
gion as a whole, micro data of the population censuses will be disseminated
using magnetic tapes and CD-ROM diskettes (Ellis, 1991).
Under integrated systems, we shall classify those software packages that are
designed for the specific purpose of covering all, or the major, data pro
cessing requirements of a survey to avoid the interfaces between separate
packages that can do only part of the job. Integrated systems are espe
cially useful for relatively unskilled users working on PCs. We shall review
The Use of Personal Computers in Survey Sampling 493
census offices and outside (e.g. businesses, press, universities, and research
institutions), such that data users (even those with no previous computer
experience) could examine all tables and accompanying text stored in a
computer readable format - whether published or unpublished - Carlos Ellis
of the U.N. Statistical Division developed a menu-driven software package
that makes it easy to select, retrieve, display and print statistical tables
from either the CENTS tabulation component of IMPS or from text files.
The Bureau of the Census elaborated the TRS system to form an integral
part of IMPS.
Operational Control. CENTRACK (CENsus TRACKing) is a man
agement and control package to help census managers monitor, control, and
track the various operations necessary between receipt of questionnaires
from the field and data entry.
Variance Calculation. CENVAR (Variance Calculation System) is
a statistical package for analyzing data from stratified, multi-stage sam
ple surveys. It provides estimates of totals, means, ratios, etc., and also
estimates of their sampling variances. Sub-population estimates can be
obtained by specifying classification variables, which can be crossed with
each other. In addition, CENVAR output presents for each estimate the
coefficient of variation, the number of observations and the “design effect.”
For demographic analysis, the Center for International Research of the
U.S. Bureau of the Census has developed a software package called “Pop
ulation Analysis Spreadsheets” for analyzing demographic data and
preparing projections. The accompanying manual, “Population Analysis
Using Microcomputers,” presents many useful and accepted methods of
demographic analysis.
IMPS and its modules are available free of charge to eligible, non
commercial users; these can also be downloaded from the Internet. For in
formation, contact International Systems Team-CASIC, U.S. Bureau of the
Census, Department of Commerce, Washington, D.C. 20233-3102, U.S.A.;
phone: 301-457-1453; fax: 301-457-3033; Telex: 62761615; Internet:
[email protected].
Automated coding
Interactive corrections
Batch editing
Imputation
Weighting
Tabulation
M anagement:
Survey/census design
Authoring
Payroll
Management information system
Some of the processes are still being evolved and many of them will not
be applicable to most developed countries at their present level of technol
ogy and infrastructure. CAPI, CATI and CADE - three components of
CASIC - have been dealt with earlier in section 27.6. Among the newer
technologies that are emerging, automated voice recognition, paperless fax
image reporting, optical character recognition, and touch-tone data entry,
are briefly described below.
Automated voice recognition. A study is underway for possible
use in the year 2000 census project with the goal of designing and testing
the feasibility of a spoken language system, over the telephone, which will
interact with a caller to elicit specific information about the caller (Cole
and Novick, 1993).
Paperless fax image reporting system and optical character
recognition. In the paperless fax image reporting system (PFIRS), using
optical character recognition (OCR), being evaluated currently at the U.S.
Census Bureau for possible use in its data collection and processing, the
process is: respondents fax questionnaires to the Bureau —* the Census
Bureau fax-server receives digital image (not paper copy) and the system
automatically identifies the questionnaire —+ OCR software converts survey
498 Chapter 27
phone (that are present in almost every household in the United States),
an interview is conducted from the central computer to the home computer
of the respondent; the respondent answers all the questions on the personal
computer and sends the results back to the central computer.
Although not formally a part of the CASIS system, an interesting de
velopment for reducing costs and for obtaining better quality of data is the
use of an interactive multimedia kiosk.
Interactive multimedia kiosk. Among the variety of technologies as
sessed by the U.S. Census Bureau in 1993 having the potential of reducing
both costs and differential under-count in the Year 2000 Census is the pro
totype interactive multimedia kiosk that offers a menu, in both English and
Spanish, to reach the public with the Census’ message, answer questions,
vend information products, and collect questionnaire data. This technol
ogy, housed in a publicly convenient kiosk (or through in-home access via a
computer and modem), is deployed or being considered in many U.S. fed
eral agencies, such as the U.S. Postal Service, the Internal Revenue Service,
the Veterans Administration, and the Social Security Administration.
For statistical analysis - for example for fitting regression and studying
correlation - commercially available software programs and packages, such
as BMDP, SAS, SPSS (SPSS/PC + for PCs), and SYSTAT, may be
used.
The census and survey tables can be stored in tables created with a spread
sheet software package. Recent versions of the commercially available
spreadsheet software packages such as Excel, Lotus 1-2-3 and Quattro Pro
include considerable graphic facilities for preparing charts and graphs -
bar diagrams, pie charts, and graphs depicting, e.g., the fertility levels in
different socioeconomic strata of population - mostly without recourse to
graphic software package.
Mapinfo (Mapinfo Corporation), and InfoShare for the city of New York
(by the Queens College of the City University of New York).
Atlas GIS (for Windows) can access data from dBase, ASCII, Lotus
1-2-3, Microsoft Excel, and industry-specific marketing information files,
from single U.S. city blocks up through the entire world on diskettes or
CD-ROM, street maps of the entire U.S. (based on TIGER, see later in
this section), Atlas demographic data from the 1990 Census, projected to
1998, business and consumer data from exclusive and syndicated sources,
and a library of available marketing information and competitive business
locations and data. Atlas GIS is produced by Strategic Mapping Corporate
Headquarters, 3135 Kifer Road, Santa Clara, California 95051; phone: 408-
970-9800; fax: 408-970-9999.
Mapinfo for Windows, version 3.0 includes satellite raster image
support to see natural terrain on-screen and chart new roads or other details
on old maps; remotely imported data can be displayed in thematic maps
or charts in a broad range of colors.
GIS is being used in a rapidly growing number of disciplines. In an
thropology, economics, history, sociology, and environmental and urban
planning, for example, GIS is providing new insights into geographically-
linked data and enabling the spatial display, analysis, manipulation and
querrying of data at increasing levels of detail; in other applications, three-
dimensional conceptual maps, such as response- or cost-surface, are being
created with GIS.
A National Center for Geographic Information and Analysis
has been established by the U.S. National Science Foundation to carry out
fundamental research into GIS technology and issues central to its effective
use. It has covered topics such as utility management in urban areas and
earthquake modeling and data keeping.
Censuses of population, agriculture, industrial establishments and oth
ers provide a veritable “demographic treasure-trove” for preparing digital
mapping for use in war and peace. In the 1991 Gulf War, GIS systems
of the American Defense Department helped guide soldiers and missiles to
targets (Dataquest, 1991).
The U.S. Census Bureau sells computerized city-block maps of America
for around $10,000. For the first time, the 1990 census maps are gen
erated from a computerized data base, the TIGER (Topographically
Integrated Geographic Encoding and Reference System).
GIS systems, combining information from censuses and other govern
mental information, have great potential for developing countries in their
national development planning. A thematic mapping software package al
lows the production, at specified geographical levels, of thematic maps, e.g.,
maps showing demographic levels, such as population density, fertility and
The Use of Personal Computers in Survey Sampling 501
27.14.1 PopMap
PopMap is an integrated software package, developed by the U.N. Statis
tical Division in collaboration with the Institute of Computer Science in
Viet Nam, which offers graphics and mapping capabilities with a spread
sheet for collecting, storing, tabulating and mapping information. This
software package was prepared to support planning and administration of
population activities with important geographical or logistic context or to
facilitate geographic or graphic expression of population indicators or re
lated data. The system comes with data base modules for declaring the
structure and content of the geographical and statistical data base, a map
editor for entering map outlines, boundaries, borders, rivers, routes and
individual facility locations, and a system for selective retrieval of statisti
cal and facilities data, preparing maps, graphs and spreadsheet tables for
on-line study and analysis, and for printing and publication.
For further information, contact Project Coordinator, Computer Soft
ware and Support for Population Activities Project, Statistical Division,
U.N., Room DC2-1570, New York, NY 10017, U.S.A.; phone: 212-963-
4118; fax: 212-963-4116.
Special purpose software packages, such as Aldus Page Maker, have ex
panded microcomputer use to the business of publishing: they make it
possible to produce, right on the PCs, camera-ready reports for the photo
offset process, thus bypassing the regular production process. Many re
ports, studies, and books are now being produced, at reduced costs and
more speedily, using desktop publishing. This has been of particular bene
fit to developing countries, where census and survey reports may languish
for years waiting their turn at the government printing offices.
Some word processing software packages, such as WordPerfect ver
sion 6.0 and Word for Windows, integrate page layout capabilities with
word processing, and are capable of producing all but the most design
intensive projects (Lichty, 1994; Parker, 1994a and 1994b).
Further reading
In a rapidly evolving field such as that of microcomputers, any list of references
runs the risk of being out of date soon: that caveat should be sounded in con
sulting studies prepared before the late 1980s.
For a review of statistical software programs and packages for processing
censuses and surveys, see U.N. ESCAP (1988). On the use of PCs in processing
census data, see Toro and Chamberlain (1988) and U.N. (1989 and 1990c) and on
the use of computers in processing household survey data, see U.N. (1982a). For
GIS, see Arbis and Coli (1991), and Maguire, Goodchild, and Rhind (1991); on
other topics, see the references cited in the text. For the role of microcomputers
in national development, particularly in India, see Sanwal (1988) and Som (1993).
APPENDIX I
503
504 Appendix I
A2.1 Introduction
In this appendix we present first the elements of probability and then the
theorems themselves required for proving some theorems in sampling.
As f < t,
505
506 Appendix II
Example
A six-faced die is thrown. What is the probability that (a) the number 4 turns
up? (b) an even number turns up?
The six possible cases are 1, 2, 3, 4, 5, 6, and these are exhaustive and
mutually exclusive; they may also be considered equally likely if the die is perfect
in shape and homogeneous, and is thrown such that no one face gets any particular
preference over others. That is, 1 = 6.
For (a), there is only one favorable case, namely, the figure 4, so that the
probability of 4 turning up is f/t = and this holds for each of the other
figures.
For (b), the number of favorable cases is f = 3 (namely, the even numbers
2, 4, and 6), and therefore the probability that an even number will turn up is
If in addition the events are exhaustive, i.e. at least one of the events must
occur, then
n
P(A,+ • + Л.) = £/’(Л,) = 1 (A2.4)
(b) Events not mutually exclusive. If the n events Ai, A2, ■ ■., An are
not necessarily mutually exclusive, then
Elements of Probability and Proofs 507
■P(Xi + A2 + • • • + -A«)
n
» i,j i,j,k
i<j i<j<k
- (-1)" P(AM2...An) (A2.6)
where the summation is taken over all these possible values of Yi.
For example, the mathematical expectation of the number being thrown
up by the tossing of a perfect die is
lx — + 2 x — 4- • • • 4- 6 x — = 3.5
6 6 6
N N
E(Y) = = a^Pi = a (A2.10)
= ¿(Y-EY)2p.
= E(Y?) - (EYi)1 (A2.16)
from the definition of expected values, given in equation (A2.9), EYi de
noting for brevity E(Yi); Y\,Yz,... ,Yn are all the possible values of Yi and
Pi>P2> ■ • • >Ptv are the probabilities associated with these possible values.
When the variables are independent, the covariance terms are zero and
V(Xi + Yi + Zi + ■•■) = + V(Yi) + V(Z,) + ■ • • (A2.24)
Corollary
V^Xi-Yj) = V(Xi) + V(y>)-2Cov(Xi,y>) (A2.25)
= V(Xi) + V(Yj)
when Xi and Yj are independent (A2.26)
Let us denote by yi (i = 1,2,..., n) the value of the variable for the ith
sample unit (i.e. the unit selected at the ¿th draw). The sample mean is,
by definition,
1 1 Xn A
y = - (?/i + V2 + • • • + yn) = - (A2.29)
n n
We shall obtain the expectation and variance of the sample mean sep
arately for sampling with and without replacement.
Hence
E
j.
n
1 2
E ¿JM - EyÔ
n2
n
J_ £ E(Vi - Eytf + 4 È E(K - Eyi)E(yj - Eyj)
n2
i=l
2 . 7.
n
_1_
Var(j/i) + ^2 52 Cov^‘ ’ ) (A2.32)
n2
512 Appendix II
Now
Var(ÿi) = Efy,-Ÿ)2
= ¿(y'-7)2pte = y')
Cov(W1!0) = ^¿(y-yx^-ÿ)
»■J
= 772£(y-y)£(y>-y) = ° (A2.35)
» j
for each i,j (2 / j), since ^>N(Yi — Y) = ^,N(Yj — Y) = 0, being the sum
of the deviations from the mean.
From equation (A2.32), using equations (A2.33) and (A2.35), we obtain
—1 na - a
= —
n2 n
Note: A shorter proof would be as follows.
since the j/,’s are independent (from equation A2.24). Using equation (A2.33)
the final result is obtained.
Elements of Probability and Proofs 513
2 _ cr2 N — n
Var (y)
y n N—1
S2
-a-/)
n
(A2.37)
where
S2 jv_i£«-
1 r>2
J^a
N-1
2 (A2.38)
Proof. As before
In sampling without replacement also, can take any one of the values
Yi, Y2, • • • >Yn with the same probability 1/N (see note 1 below). Therefore
£(») = Ÿ.
To prove (A2.37), we proceed up to (A2.34) as for srs with replacement.
In srs without replacement, however,
P(^ = Y- yj = Yj)
= P(yi = y^yj = Yj I yi = Yi)
1 1
if i ± j
N ' N—l
since yj can take any value excepting Y (the value which is already known
to have been assumed by j/,-) with probability 1/(A — 1). So
n N
1
' n(n-l)<r2
¿^Cov(?/t-,^)=------
that a universe unit will be selected at any given draw is the same as that
on the first draw, namely 1/N (see note 1). The above proofs have been de
rived from this characteristic and the theorems on probability. Alternative
proofs which are derived from the definition of simple random sampling, if
required, will be found in the advanced theoretical textbooks.
Corollaries
N
(A2.40)
Var(Wj/) = № (A2.41)
n
with srs with replacement.
N-n
Var(Æÿ) = N2 — (A2.42)
n N-1
in srs without replacement.
(A2.43)
to prove that
(A2.45)
= ¿to-n2 + n(5-i7)2-2n(sz-Y)2
= ¿(w-y)2-n(5-y)2
E l>/-n2 £>to-n2
n
= Var(yt) by definition
= na2 (A2.46)
from equation (A2.33).
In sampling with replacement, from equation (A2.31)
E[n(ÿ-Y)2] = nE(ÿ - y)2 = nVar(ÿ)
= na2/n = er2 (A2.47)
Therefore, in sampling with replacement, taking the expectation of both
sides of equation (A2.43) and using equations (A2.46) and (A2.47), we
obtain
^(s2) = n ! ff2) = <72
n— 1
In sampling without replacement, from equation (A2.37)
E[n(y-Y)2] nE(y — y)2 = n Var(y)
a2 N—n 2 N—n
n — -------- = a --------- (A2.48)
n N-1 N -I
Therefore, in sampling without replacement, taking the expectation of
both sides of equation (A2.43) and using equations (A2.46) and (A4.48),
we obtain
1 N-n cr2 N(n — 1)
£(s2) =
n—1 N-1 n-1 N-1
Na2
A-1 “
Elements of Probability and Proofs 517
(A2.49)
(1-/) (A2.50)
E Çn2 = Var(Aj/)
(A2.51)
The results follow immediately from equations (A2.49) and (A2.50) mul
tiplying both sides by N2.
518 Appendix II
where n' is the number of sample units possessing the attribute, and
n the total number of sample units; y, = 1, if the ¿th sample units
has the attribute and y, = 0, otherwise.
Taking the expectation of both sides of equation (A2.54), we obtain
E(p) = E(y) = Y = P
from equation (2.67).
2. Sampling variance of p. We have seen in equation (2.68) that the
universe variance of Yi is
a2 = P(1 - P) (A2.55)
An unbiased estimator of cr2 has been defined in equation (2.73),
namely
n
(A2.56)
(A2.57)
in srswor (A2.58)
n N- 1
Elements of Probability and Proofs 519
p(l -p)
= (1-/) in srs without replacement
n—1
(A2.60)
Proof. Since
s2 =
₽)
£(
* = E = ^~= in srswr
\n) n n
and
N—n
= A- 1
P(1 - P) N -n
= (T? in srswor
n N— 1
2<^1 ~ ^yx)
a? = -^2 (<7y + R(i)
(A2.61)
If we assume that |ez| < 1, i.e. x lies between 0 and 2X, which is likely
to happen when the sample size is large, the term (1 4- ez)_1 in expression
(A2.62) may be expanded by Taylor’s theorem. With this assumption and
taking the expectation of both sides of (A2.62), we get
(ii) Bias of r. If we assume that terms involving third and higher powers
of e and e' are negligible, we have
E(r)-R =
= «(CK2 - pCV,CVx)
= T [/iVar(x) — Cov(ÿ, x)] (A2.65)
X2
where p is the correlation coefficient between the two variables, and CVy
and CVX are the coefficients of variation of y and x respectively.
To the order of approximation assumed, the bias will be zero, if
R =. Cov(y, x)/Var(z)
and this will occur when the regression line of y on x is a straight line
passing through the origin (0,0).
The bias of the estimator r, given in expression (A2.65), can be esti
mated by replacing X, R, Var(r) and Cov(j/,z) by their respective sample
Elements of Probability and Proofs 521
estimators. The estimator of the bias is itself subject to bias. If the distri
butions are not skewed, the coefficients of variation will not be very large,
and |p| < 1, in which case the bias, given by the above expression, will be
small. An upper bound of the magnitude of the relative bias | E^r^~R- j is
given by CVyCVx.
using equation (A2.62) and simplifying. If we assume as before that |e'| < 1
and that terms involving second and higher powers of e and e' are negligible,
then r is considered to be unbiased to this order of approximation, and the
variance and the mean square error become the same. In this case,
Var(t/o) (A2.68)
„2 = ¿(î/
* -J/o)2/”(n~ 1) —
E sy* 0 ~ ay* 0 (A2.69)
522 Appendix II
*W) = *
ft)=£^ =y
so that
^«) = èÈ£:W) = énr = y
To prove (A2.68),
= - y)2^ = *
Y?/
< -y 2 (A2-7°)
(A271)
= ~ (A2.74)
= [l + (^o-l)pc] (6.9)
+ 2(yi-y)£(yij-yi)
j J
» j i
(Yi - Y)
2 _ 1
ab ~ NM2 E Dy->-y)
i j
0 L * j *
= + NM0(M0 - l)a2pc]
(A2.75)
(b)
L
2 V“"*' 2
y ¿—J 2/ho
(A2.76)
where
S2
— — (1 — fh) in srswor (A2.78)
rih
and
fh = Kh/Nh
(c)
= = (A279>
Proof, (a) From equation (A2.40), we know that y^0 is an unbiased esti
mator of Yh in every stratum, so that
L L
= £>Mo) = £n = y
(b) For each stratum, from equation (A2.41) in sampling with replace
ment, we have
2 a h2
(T
y*h0 nh
Then,
= Var(j/) = Var
= £>"(»«) = £X;0
as the covariance terms vanish because samples are drawn independently
in the different strata.
The proof for sampling without replacement follows similarly, noting
that for each stratum from equation (A2.42)
1 n\.
s«» = n7(nt-- 1) ~ ÿ»o)2
№ nh
= (A2'81)
4/№ (A2.82)
Note: The theorems for stratified varying probability sampling can be proved on
similar lines as for stratified srs.
(M (M * (£ (A2.83)
(E^m) (£ (A2.84)
Elements of Probability and Proofs 527
(£>M
)
* > [^AW'ilvslF (A2.85)
=ArAhx/(I4/ch) (A2.87)
i.
e. rth should be proportional to A^\J(Vh/ch)-
Summing both sides of (A2.87), we get
nAhy(K7^) (A2.89)
The above expression for presumes the value of n, which would de
pend on whether the total cost or the variance is specified. If the total
cost is fixed at C', then we substitute the values obtained from equation
(A2.89) in the cost function (12.5), and solve for n, which gives
Notes
1. See notes 2 and 3 to section 12.3.2.
2. The Neyman allocation follows from the above formulation as indicated in
section 12.3.3. A direct proof may also be derived noting that Ch = c, a
constant, and n = (C — co)/c.
3. The proofs for stratified srs without replacement (note 3 to section 12.3.2,
note to section 12.6.1 and note 4 to section 12.6.2) can be derived similarly.
APPENDIX III
Statistical Tables
Table A3.1: Random numbers
03 47 43 73 86 36 96 47 36 61 46 98 63 71 62 33 26 16 80 45 60 11 14 10 95
97 74 24 67 62 42 81 14 57 20 42 53 32 37 32 27 07 36 07 51 24 51 79 89 73
16 76 62 27 66 56 50 26 71 07 32 90 79 78 53 13 55 38 58 59 88 97 54 14 10
12 56 85 99 26 96 96 68 27 31 05 03 72 93 15 57 12 10 14 21 88 26 49 81 76
55 59 56 35 64 38 54 82 46 22 31 62 43 09 90 06 18 44 32 53 23 83 01 30 30
16 22 77 94 39 49 54 43 54 82 17 37 93 23 78 87 35 20 96 43 84 26 34 91 64
84 42 17 53 31 57 24 55 06 88 77 04 74 47 67 21 76 33 50 25 83 92 12 06 76
63 01 63 78 59 16 95 55 67 19 98 10 50 71 75 12 86 73 58 07 44 39 52 38 79
33 21 12 34 29 78 64 56 07 82 52 42 07 44 38 15 51 00 13 42 99 66 02 79 54
57 60 86 32 44 09 47 27 96 54 49 17 46 09 62 90 52 84 77 27 08 02 73 43 28
18 18 07 92 46 44 17 16 58 09 79 83 86 19 62 06 76 50 03 10 55 23 64 05 05
26 62 38 97 75 84 16 07 44 99 83 11 46 32 24 20 14 85 88 45 10 93 72 88 71
23 42 40 64 74 82 97 77 77 81 07 45 32 14 08 32 98 94 07 72 93 85 79 10 75
52 36 28 19 95 50 92 26 11 97 00 56 76 31 38 80 22 02 53 53 86 60 42 04 53
37 85 94 35 12 83 39 50 08 30 42 34 07 96 88 54 42 06 87 98 35 85 29 48 39
70 29 17 12 13 40 33 20 38 26 13 89 51 03 74 17 76 37 13 04 07 74 21 19 30
56 62 18 37 35 96 83 50 87 75 97 12 25 93 47 70 33 24 03 54 97 77 46 44 80
99 49 57 22 77 88 42 95 45 72 16 64 36 16 00 04 43 18 66 79 94 77 24 21 90
16 08 15 04 72 33 27 14 34 09 45 59 34 68 49 12 72 07 34 45 99 27 72 95 14
31 16 93 32 43 50 27 89 87 19 20 15 37 00 49 52 85 66 60 44 38 68 88 11 80
68 34 30 13 70 55 74 30 77 40 44 22 78 84 26 04 33 46 09 52 68 07 97 06 57
74 57 25 65 76 59 29 97 68 60 71 91 38 67 54 13 58 18 24 76 15 54 55 95 52
27 42 37 86 53 48 55 90 65 72 96 57 69 36 10 96 46 92 42 45 97 60 49 04 91
00 39 68 29 61 66 37 32 20 30 77 84 57 03 29 10 45 65 04 26 11 04 96 67 24
29 94 98 94 24 68 49 69 10 82 53 75 91 93 30 34 25 20 57 27 40 48 73 51 92
16 90 82 66 59 83 62 64 11 12 67 19 00 71 74 60 47 21 29 68 02 02 37 03 31
11 27 94 75 06 06 09 19 74 66 02 94 37 34 02 76 70 90 30 86 38 45 94 30 38
35 24 10 16 20 33 32 51 26 38 79 78 45 04 91 16 92 53 56 16 02 75 50 95 98
38 23 16 86 38 42 38 97 01 50 87 75 66 81 41 40 01 74 91 62 48 51 84 08 32
31 96 25 91 47 96 44 33 49 13 34 86 82 53 91 00 52 43 48 85 27 55 26 89 62
56 67 40 67 14 64 05 71 95 86 11 05 65 09 68 76 83 20 37 90 57 16 00 11 66
14 90 84 45 11 75 73 88 05 90 52 27 41 14 86 22 98 12 22 08 07 52 74 95 80
68 05 51 18 00 33 96 02 75 19 07 60 62 93 55 59 33 82 43 90 49 37 38 44 59
20 46 78 73 90 97 51 40 14 02 04 02 33 31 08 39 54 16 49 36 47 95 93 13 30
64 19 58 97 79 15 06 15 93 20 01 90 10 75 06 40 78 78 89 62 02 67 74 17 33
05 26 93 70 60 22 35 85 15 13 92 03 51 59 77 59 56 78 06 83 52 91 05 70 74
07 97 10 88 23 09 98 42 99 64 61 71 62 99 15 06 51 29 16 93 58 05 77 09 51
68 71 86 85 85 54 87 66 47 54 73 32 08 11 12 44 95 92 63 16 29 56 24 29 48
26 99 61 65 53 58 37 78 80 70 42 10 50 67 42 32 17 55 85 74 94 44 67 16 94
14 65 52 68 75 87 59 36 22 41 26 78 63 06 55 13 08 27 01 50 15 29 39 39 43
17 53 77 58 71 71 41 61 50 72 12 41 94 96 26 44 95 27 36 99 02 96 74 30 83
90 26 59 21 19 23 52 23 33 12 96 93 02 18 39 07 02 18 36 07 25 99 32 70 23
41 23 52 55 99 31 04 49 69 96 10 47 48 45 88 13 41 43 89 20 97 17 14 49 17
60 20 50 81 69 31 99 73 68 68 35 81 33 03 76 24 30 12 48 60 18 99 10 72 34
91 25 38 05 90 94 58 28 41 36 45 37 59 03 09 90 35 57 29 12 82 62 54 65 60
34 50 57 74 37 98 80 33 00 91 09 77 93 19 82 74 94 80 04 04 45 07 31 66 49
85 22 04 39 43 73 81 53 94 79 33 62 46 86 28 08 31 54 46 31 53 94 13 38 47
09 79 13 77 48 73 82 97 22 21 05 03 27 24 83 72 89 44 05 60 35 80 39 94 88
88 75 80 18 14 22 95 75 42 49 39 32 82 22 49 02 48 07 70 37 16 04 61 67 87
90 96 23 70 00 39 00 03 06 90 55 85 78 38 36 94 37 30 69 32 90 89 00 76 33
Table A3.1 continued
53 74 23 99 07 61 32 28 69 84 94 62 67 86 24 98 33 41 19 95 47 53 53 38 09
63 38 06 86 54 99 00 65 26 94 02 82 90 23 07 79 62 67 80 60 75 91 12 81 19
35 30 58 21 46 06 72 17 10 94 25 21 31 75 96 49 28 24 00 49 55 65 79 78 07
63 43 36 82 69 65 51 18 37 88 61 38 44 12 45 32 92 85 88 65 54 34 81 85 35
98 25 37 55 26 01 91 82 81 46 74 71 12 94 97 *
24 02 71 37 07 03 92 18 66 75
02 63 21 17 69 71 50 80 89 56 38 15 70 11 48 43 40 45 86 98 00 83 26 91 03
64 55 22 21 82 48 22 28 06 00 61 54 13 43 91 82 78 12 23 29 06 66 24 12 27
85 07 26 13 89 01 10 07 82 04 59 63 69 36 03 69 11 15 83 80 13 29 54 19 28
58 54 16 24 15 51 54 44 82 00 62 61 65 04 69 38 18 65 18 97 85 72 13 49 21
34 85 27 84 87 61 48 64 56 26 90 18 48 13 26 37 70 15 42 57 65 65 80 39 07
03 92 18 27 46 57 99 16 96 56 30 33 72 85 22 84 64 38 56 98 99 01 30 98 64
62 95 30 27 59 37 75 41 66 48 86 97 80 61 45 21 53 04 01 63 45 76 08 64 27
08 45 93 15 22 60 21 75 46 91 98 77 27 85 42 28 88 61 08 84 69 62 03 42 73
07 08 55 18 40 45 44 75 13 90 24 94 96 61 02 57 55 66 83 15 73 42 37 11 61
01 85 89 95 66 51 10 19 34 88 15 84 97 19 75 12 76 39 43 78 64 63 91 08 25
72 84 71 14 35 19 11 58 49 26 50 11 17 17 76 86 31 57 20 18 95 60 78 46 75
88 78 28 16 84 13 52 53 94 53 75 45 69 30 96 73 89 65 70 31 99 17 43 48 76
45 17 75 65 57 28 40 19 72 12 25 12 74 75 67 60 40 60 81 19 24 62 01 61 16
96 76 28 12 54 22 01 11 94 25 71 96 16 16 88 68 64 36 74 45 19 59 50 88 92
43 31 67 72 30 24 02 94 08 63 38 32 36 66 02 69 36 38 25 39 48 03 45 15 22
50 44 66 44 21 66 06 58 05 62 68 15 54 35 02 42 35 48 96 32 14 52 41 52 48
22 66 22 15 86 26 63 75 41 99 58 42 36 72 24 58 37 52 18 51 03 37 18 39 11
96 24 40 14 51 23 22 30 88 57 95 67 47 29 83 94 69 40 06 07 18 16 36 78 86
31 73 91 61 19 60 20 72 93 48 98 57 07 23 69 65 95 39 69 58 56 80 30 19 44
78 60 73 99 84 43 89 94 36 45 56 69 47 07 41 90 22 91 07 12 78 35 34 08 72
84 37 90 61 56 70 10 23 98 05 85 11 34 76 60 76 48 45 34 60 01 64 18 39 96
36 67 10 08 23 98 93 35 08 86 99 29 76 29 81 33 34 91 58 93 63 14 52 32 52
07 28 59 07 48 89 64 58 89 75 83 85 62 27 89 30 14 78 56 27 86 63 50 80 02
10 15 83 87 60 79 24 31 66 56 21 48 24 06 93 91 98 94 05 49 01 47 59 38 00
55 19 68 97 65 03 73 52 16 56 00 53 55 90 27 33 42 29 38 87 22 13 88 83 34
53 81 29 13 39 35 01 20 71 34 62 33 74 82 14 53 73 19 09 03 56 54 29 56 93
51 86 32 68 92 33 98 74 66 99 40 14 71 94 58 45 94 19 38 81 14 44 99 81 07
35 91 70 29 13 80 03 54 07 27 96 94 78 32 66 50 95 52 74 33 13 80 55 62 54
37 71 67 95 13 20 02 44 95 94 64 85 04 05 72 01 32 90 76 14 53 89 74 60 41
93 66 13 83 27 92 79 64 64 72 28 54 96 53 84 48 14 52 98 94 56 07 93 89 30
02 96 08 45 65 13 05 00 41 84 93 07 54 72 59 21 45 57 09 77 19 48 56 27 44
49 83 43 48 35 82 88 33 69 96 72 36 04 19 76 47 45 15 18 60 82 11 08 95 97
84 60 71 62 46 40 80 81 30 37 34 39 23 05 38 25 15 35 71 30 88 12 57 21 77
18 17 30 88 71 44 91 14 88 47 89 23 30 63 15 56 34 20 47 89 99 82 93 24 98
79 69 10 61 78 71 32 76 95 62 87 00 22 58 40 92 54 01 75 25 43 11 71 99 31
75 93 36 57 83 56 20 14 82 11 74 21 97 90 65 96 42 68 63 86 74 54 13 26 94
38 30 92 29 03 06 28 81 39 38 62 25 06 84 63 61 29 08 93 67 04 32 92 08 00
51 29 50 10 34 31 57 75 95 80 51 97 02 74 77 76 15 48 49 44 18 55 63 77 09
21 31 38 86 24 37 79 81 53 74 73 24 16 10 33 52 83 90 94 96 70 47 14 54 36
29 01 23 87 88 58 02 39 37 67 42 10 14 20 92 16 55 23 42 45 54 76 09 11 06
95 33 95 22 00 18 74 72 00 18 38 79 58 69 32 81 76 80 26 92 82 80 84 25 39
90 84 60 79 80 24 36 59 87 38 82 07 53 89 35 96 35 23 79 18 05 98 90 07 35
46 40 62 98 82 54 97 20 56 95 15 74 80 08 32 16 46 70 50 80 67 72 16 42 79
20 31 89 03 43 38 46 82 68 72 32 14 82 99 70 80 60 47 18 97 63 49 30 21 30
71 59 73 05 50 08 22 23 71 77 91 01 93 20 49 82 96 59 26 91 66 39 67 08 60
532 Appendix III
Table A3.3: Values of Coefficient of Variation (CV) per unit in the universe
for different values of the universe proportion P
= -P)/P\
Statistical Tables 533
Table A3.4: Sample size (n) required to ensure desired coefficient of varia
tion of sample estimator (e) in sampling with replacement
Desired
CV of Value of universe CV per unit (%)
sample
esti-
mator 5 10 20 30 40 50 60 70 80 90 100 150 200
25 1 1 1 1 3 4 6 8 10 13 16 36 64
20 1 1 1 2 4 6 9 12 16 20 25 56 100
15 1 1 2 4 7 11 16 22 28 36 44 100 178
10 1 1 4 9 16 25 36 49 64 81 100 225 400
5 1 4 16 36 64 100 144 196 256 324 400 900 1600
4 2 5 25 56 100 156 225 306 400 506 625 1406 2500
3 3 11 44 108 178 278 400 544 711 900 1111 2500 4444
2.5 4 16 64 144 256 400 576 784 1024 1296 1600 3600 6400
2 5 25 100 225 400 625 900 1225 1600 2025 2500 5625 10000
1 25 100 400 900 1600 2500 3600 4900 6400 8100 10000 22 500 40 000
Table A4.1
I 1 8.7 69 17 7 5 5 4 6 2 3 5 5 6 76
5 4 4 4 5 3 3
2 10.6 82 18 6 5 4 5 4 5 6 5 3 5 82
4 4 5 3 3 5 6 4
3 15.0 110 26 6 6 3 5 3 4 5 5 4 4 116
4 3 7 5 4 6 2 5 5 6
1 5 5 4 6 3
4 6.2 80 18 6 3 6 3 6 3 4 5 4 4 76
4 5 6 3 5 1 3 5
5 9.6 92 24 5 4 6 5 4 5 6 5 4 4 112
7 6 6 5 4 4 5 6 3 4
3 3 5 3
6 7.3 65 17 3 4 4 6 5 7 3 5 4 5 77
4 6 4 5 3 3 6
7 4.5 72 20 6 4 4 5 4 5 6 4 3 5 88
4 6 5 5 2 4 5 4 3 4
8 10.6 108 24 5 3 3 7 4 4 6 6 4 5 109
3 7 6 4 5 6 3 5 1 3
5 6 4 4
9 5.4 106 24 5 3 5 3 4 6 5 4 6 5 111
6 3 6 5 6 6 3 5 4 4
5 4 6 2
10 3.5 80 22 4 4 5 5 4 4 5 4 3 5 88
5 3 4 5 4 3 4 3 5 4
4 1
(continued)
535
536 Appendix IV
II 1 5.8 72 15 8 4 4 6 5 5 5 7 3 5 78
6 6 6 4 4
2 11.4 102 22 9 6 5 5 4 3 5 5 3 10 112
4 4 6 5 4 7 4 5 3 6
4 5
3 5.8 73 17 6 5 5 3 6 5 3 7 3 5 80
5 6 6 5 3 4 3
4 7.8 84 19 6 4 6 4 5 4 4 5 7 6 93
4 6 4 5 6 5 3 4 5
5 6.5 98 20 8 4 3 5 6 3 7 5 5 5 105
6 6 7 3 6 6 7 2 5 6
6 9.0 84 19 4 3 5 6 5 6 5 5 5 3 93
5 4 5 5 3 9 4 5 6
7 7.3 85 19 7 5 5 3 5 4 4 5 8 4 95
4 6 5 7 4 5 5 5 4
8 7.0 102 23 4 5 5 4 6 5 5 3 4 4 114
8 5 4 4 5 6 4 5 4 6
7 5 6
9 10.5 122 25 8 4 4 5 5 4 7 5 6 4 127
5 5 8 5 3 6 3 4 6 5
4 7 3 5 6
10 11.1 102 23 7 4 5 6 5 4 6 4 4 7 113
5 2 5 4 4 9 6 5 6 2
4 4 5
11 6.3 86 18 4 5 4 4 7 5 5 7 5 9 94
6 5 7 4 3 5 4 5
III 1 10.0 78 15 8 6 7 4 6 5 5 2 7 7 83
7 5 6 4 4
2 14.2 112 21 9 4 5 8 6 3 6 5 5 5 121
6 2 8 5 8 4 5 7 5 7
8
3 8.2 97 18 8 8 4 8 3 6 6 6 4 6 105
6 8 7 6 4 5 6 4
4 12.5 117 21 7 5 6 7 8 4 5 6 6 6 129
6 5 7 5 9 8 6 6 6 4
7
5 6.5 106 20 7 3 5 6 6 7 5 6 5 5 114
4 10 7 6 5 8 6 4 6 3
(continued)
Hypothetical Universe 537
Case Study:
Indian National Sample Survey, 1964-5
A5.1 Introduction
Along with the Current Population Survey (with probability sampling start
ing in 1943) in the U.S. and the Family Expenditure Survey (continuing
since 1957) in the U.K., the National Sample Survey in India has a long tra
dition of undertaking probability-based household surveys on a continuing
basis.
As a case study, the main features of the planning, execution and anal
ysis of the Indian National Sample Survey (NSS) for the period July 1964-
June 1965 are given in this chapter, based on Technical Paper on Sample
Design, The National Sample Survey, Nineteenth Round, July 1964-June
1965 by A.S. Roy and A. Bhattacharyya (1968).
At the instance of P.C. Mahalanobis, Honorary Statistical Adviser to
the Cabinet of the Government of India, the National Sample Survey was
started in 1950 with the object of obtaining comprehensive and continuing
information on economic, social, demographic, and agricultural character
istics through sample surveys on a country-wide basis. The information
collected is utilized for planning, research and other purposes by the Cen
tral and State Governments, the Planning Commission and other interested
organizations. The NSS is a continuing, multi-subject, integrated survey
and is conducted in the form of successive “rounds”; each round covers
several topics of current interest in a specific survey period. The scope,
period, sample design and program of each round are fixed by taking into
account the requirements of its users and the resources available for that
period. Since 1958-9, the survey period has been made one complete year
coinciding approximately with the agricultural year.
539
540 Appendix V
The survey for 1964-5 was designed to provide information on the following
topics: population, current and historical fertility and current mortality;
employment and unemployment; indebtedness of rural labor households;
land utilization, acreage and production of the major cereal crops; rural re
tail prices; and integrated socioeconomic activities of the households. The
survey covered the whole of India, but excluded a few specific areas, the lat
ter accounting for less than 0.5 per cent of the total estimated population.
Hotel residents and persons in boarding houses were included, but inmates
of hospitals, nursing homes, jails etc. were not covered by the survey: per
sons without fixed abode were included only for the demographic inquiry.
The results were required for the nineteen state and union territories and
separately for the rural and the urban sectors, but the crop survey was
designed to provide estimates for all the major cereal crops taken together
for rural India as a whole. The total cost of the field survey was about Rs.
8 million ($1 US = Rs. 7.50 and £\ Sterling = Rs. 18.00 at the then official
exchange rates) and an equivalent sum is estimated to have been spent on
the tabulation, analysis and preparation of reports.
From the early planning stages, preliminary budgets were prepared on the
basis of the experiences of a few of the preceding rounds and were examined
on the basis of the time records of the staff engaged in field enumeration
and processing.
by an enumerator during the whole survey period. With the work load of
an enumerator being more or less fixed, the demand for a big sample in
order to provide estimates at the state and lower levels had to be met by
increasing the number of enumerators. The total number of enumerators in
the nineteenth round was 752 for the central sample (with a reserve of 10
per cent), of whom 706 were to survey both the rural and the urban areas
and 46 in the urban areas only. The saving arising from the multi-subject
nature of the survey was also utilized for having a large number of sample
fsu’s. The total number of first-stage units for the central sample was set
at 8,472 villages and 4,572 urban blocks.
For the reasons mentioned in section 26.13.4, the survey period was
the agricultural year, divided into six equal periods of two months each,
called sub-rounds, and one-sixth of the sample surveyed in each sub-round.
This ensured firstly the employment of a smaller number of skilled and
well-trained enumerators, and secondly the representation of all the four
seasons so that the seasonal fluctuations were taken into account.
For the household inquiries, sampling was the most intensive for the
population schedule, with overall sampling fractions of 0.2 per cent in the
rural sector and 0.4 per cent in the urban for the central sample, and
least intensive for the integrated schedules 17 and 17(suppl.), with overall
sampling fractions of 0.01 per cent in the rural sector and 0.02 per cent in
the urban sector.
The time required for enumerating the different schedules was obtained
on the basis of the records in the previous rounds and, for the schedules
which were canvassed for the first time, on the basis of a try-out. In large-
scale surveys, journey time accounts for an appreciable portion of the total
time and depends on the size of the area an enumerator covers, the num
ber of sample first-stage units and the general transport facilities. In the
present survey, an enumerator’s area of operation in the rural sector was
the stratum in which he was posted. Most of these strata were less than
4,000 square miles in area and this was considered to be a manageable size
for an enumerator. When a stratum exceeded 4,000 square miles in area,
that stratum was sub-divided into two or more parts known as investiga
tion zones and the sample villages to be surveyed by an enumerator were
selected from one of these zones. An enumerator was allotted twelve vil
lages in his stratum and a similar procedure was followed for the urban
blocks, which achieved some savings in the journey time, but only marginal
savings could be achieved in the enumeration time. For the socioeconomic
inquiries, the preparation of sampling frames for selecting households was
simplified and a schedule was canvassed in a sub-sample of households se
lected for another schedule. For crop surveys, a sample of clusters of plots
instead of a direct sample of individual plots was chosen: this was because
544 Appendix V
the time taken for identifying the plots is comparatively large in relation
to the time taken for recording the land utilization of such plots. Also, the
same sample of clusters of plots was surveyed in all the crop seasons instead
of surveying fresh clusters during each season. Finally, when the sample
village or the urban block was big, the enumerators were allowed to con
fine the survey only to part of a sample village or the urban block so that
their individual work load remained within limits. For household inquiries,
each sample village and urban-block was visited and surveyed only once
during the whole round. The crop survey was conducted in each of the four
seasons - autumn, winter, spring and summer. The crop-area survey was
conducted in all the twenty-four sample villages of the rural sector while
the crop-yield survey was taken up only in one-fourth of the villages. The
price inquiry was conducted in a fixed set of sample villages which has been
continuing since the sixteenth round, July 1960-June 1961. These sample
villages were 491 in number. The time standards for enumerating differ
ent schedules and the average work load of an enumerator within a sample
village or an urban block are shown in Table A5.1.
Table A5.1: Average time requirements for different schedules and journey
between the average work-load within a sample village/urban block: Indian
National Sample Survey, 19th round, July 1964-June 1965
(continued)
546 Appendix V
Notes:
1. Schedules 5.0, 5.1, 5.2, and 17 (S) were taken up in each of the four crop
seasons.
2. For Schedule 5.1, the sample size may go up to 12 cuts, depending upon
the nature of sowing of pure and mixed crops.
3. “hh” denotes household(s); “S” denotes Supplementary.
in any other population size class was taken as the ratio of the average
population on the class to the average population of villages in the first class,
rounded off to a suitable integer. The total size of a tehsil was obtained by
cumulating the sizes of all the villages contained in that tehsil. Similarly,
the total size of a region was obtained by cumulating the sizes of all tehsils
contained in that region (there were forty-eight regions, formed by grouping
contiguous districts within a state mainly on the basis of information on
topography, crop pattern, and population density: the ultimate rural strata
were formed within these regions).
The all-India central sample of 8,472 villages was allocated to the dif
ferent states (and union territories) on a joint consideration of the rural
populations, area under food crops, and the enumerator strength. This
allocation was modified to ensure a minimum sample size of 360 villages in
Case Study: Indian National Sample Survey, 1964-5 547
6. The original size (say Z') of a stratum (i.e. sum of the sizes of villages
in it) should not differ by more than 10 per cent from the planned
average figure Z, because otherwise the large adjustment to be done
in order to equalize this Z' to Z would lower the sampling efficiency
of the pps design.
In all, 353 compact strata were formed by grouping contiguous tehsils
such that the first, second, and the sixth conditions were completely sat
isfied and conditions three, four and five satisfied to the extent possible.
Finally, the original stratum size Z1 was made equal to Z by slightly in
creasing or decreasing the sizes of the villages belonging to the stratum.
In each stratum, four independent sub-samples of 12 villages each were
selected circular systematically with probability proportional to the size,
the size being defined earlier. Out of these 12 villages in a sub-sample,
the six villages with odd orders of selection constituted the central sample
and the rest, with even orders of selection, the state sample. The linking
of central and state samples ensured a better spread of samples over the
whole stratum and consequently a better estimate when the central and
the state sample data are pooled together.
The interval for systematic selection for either the central or the state
sample was obtained as I = Z/6; the interval I, an integer, was the same
for all the strata in a state since stratum sizes were equalized during the
formation of the strata. Four independent random starts were used for
selecting the four sub-samples of the 12 villages each.
Because of pps selection, larger villages occurred more frequently in the
sample. This resulted in too heavy a work load for some enumerators. The
planned work load involved the listing of 120 households per sample village
on an average, i.e. a total of 1,440 households per year per enumerator. To
reduce the work load of household listing in large villages, the hamlets were
grouped in such a manner as to contain approximately the same population
and the survey confined to one hamlet group, selected at random with equal
probability. The number of hamlet groups to be formed in a village was
specified by the Indian Statistical Institute, but the formation and selection
of the hamlet groups were done by the enumerators on reaching the sample
villages.
For selection of households for the socioeconomic inquiries, the house
holds were not stratified according to the available information for this
would have complicated the computation of the estimates and would have
rendered difficult the task of making the design self-weighting. The same
objectives of stratification were to a large extent achieved by arranging the
households in the manner described in section 4.3.3. From the sampling
frame thus prepared, a sample of 22 households on an average was selected
linear systematically with the interval and a random start prescribed for
Case Study: Indian National Sample Survey, 1964-5 549
the village was meant for just acreage survey or both acreage and
yield-rate surveys. All the plots possessed by these sample households
within a 5 mile radius of the sample village were taken up for survey.
where
- Z° nr
6 zhi
is the multiplier for yhij- The objective is to make Whij a constant, wo,
within a state. Now Zh = Zhi (sum over i) = Zq is the same for all the
Case Study: Indian National Sample Survey, 1964-5 551
strata within a state, and the values of Zhi and Dhi are already determined
by the village populations. So the only item that can be properly chosen
so as to equalize all the Whij values is I hi', i.e.
_ 6w0 Zhi
(A5.2)
hi~ ZQ Dhi
which determines the interval for selecting the combined sample in a sample
village, when the value of the multiplier wo is fixed.
The unbiased estimator y then becomes
L 6 mhl
yy^ (A5.3)
h i j
L 6
m = ^2 ^mhi
= wq x number of sample hh per ss in the state (A5.4)
First, wo is to be obtained from equation (A5.6) for each state, and then
the values of Ihi are determined from equation (A5.2) for a sample village,
noting that the factor ^wq/Zq is the same for all the sample villages in a
state.
When, as in most cases, Ihi as computed was a fraction, it was rounded
off in the randomized manner, as described in note 3 to section 12.3. The
rounded off (integer) values of Ihi were given to the enumerators.
The intervals obtained by equation (A5.2) ensured that the total sample
size (number of households) in the state would be near the desired value
nmo — 132 L, but they do not ensure anything for the individual sample
villages. Some enumerators were allotted too many big villages, and as
each enumerator had to survey 12 sample villages, the total sample size
for them was much larger than the planned figure of 12 x 22 households.
In such cases, to provide relief to the enumerators, they were allowed to
survey a smaller number of sample households than was strictly required
for a self-weighting design: the shortfall was made good at the scrutiny
stage by repeating some of the filled-in schedules.
Once the design was made self-weighting for the combined sample, it
became automatically so for individual household inquiries, for constant
fractions of the combined sample households were surveyed for each of these
inquiries in all the sample villages.
(A5.7)
where Iq = Zq/6.
Schedules 10.1, 12, 16, and 17:
L 6 rn, j
y = yhii (A5.8)
h i j
where n' is the number of reporting villages in the state and the sub-sample
considered.
In the urban sector, as there were 2 strata in each state (i.e. L = 2),
the unbiased estimators took the following form:
Schedule 0.2:
= (A5.10)
/1=1 1=1 hl
where Ih = Zh/nh.
Schedules 10, 12, 16, and 17:
2 nh mht
y = ^2wh^2^yhij (A5.ll)
71 = 1 i= l j= l
where
_ t ^hi M'hj /A r .
aj — 7 / >Q/t»j (A5.13)
^Zhi mhi
planned for survey, and ahij is the area under the specified cereal crop for
the jth sample plot or in all plots of the jth sample household in the ith
sample village.
An estimator of the yield rate (Schedule 5.1) for a cereal crop in a
season from sub-sample 1 or 2 is given by
1 1
= 4 + i/2 + 2/3 + 2/4) = (yh 1 + dh2 + Uh3 + 2/h4) (A5.15)
/1=1
where yhi is the /th sub-sample estimator of the /ith stratum total.
Case Study: Indian National Sample Survey, 1964-5 555
The ratio R = Y/X was estimated by r/ = yi/xi from the /th sub-sample
and the combined estimator of R was
r _ yo _ yi + 3/2 + y3 + 3/4
(A5.16)
xq Zi + a?2 + x3 + x4
The combined estimator of the area under a crop was similarly the mean
of the four sub-sample estimates. The combined production estimator yo for
a season is the mean of the two production estimators y\ and 7/2 • Whenever
season-wise total crop estimates were obtained, the estimate for the year
was the sum of the season-wise estimates.
= 72 52 E(^-^)2 (A5.17)
h = l i=l
and
Syo = ^2 ~ (A5.18)
where
Vh ~ 4 52 yhl
The previous 18 rounds, covering the period 1950-64, acted as pilot in
quiries, the results of which were used to design the sample for the 19th
round effectively, including information on the variability and the cost. In
addition, special tryouts were organized for new schedules introduced in
this survey.
The type of enumerators has been mentioned in paragraph A5.4. The train
ing of enumerators was done in two phases, the supervisors being trained
by the technical staff and in their turn training the enumerators (section
26.15).
The schedules were first scrutinized by the field supervisors and then by
the technical staff of the Indian Statistical Institute. Queries of a technical
nature from the field were answered by the technical staff.
The edited information was put into punch cards and the final tabulation
done by mechanical processing. Electronic processing is also resorted to for
special computations and analysis.
Some results of the survey obtained from the central sample are given in
Table A5.2. The coefficients of variation of the estimates are seen to be
reasonably small.
Further reading
Full details of the survey methods of the Indian National Sample Survey, 1964-
65, are given by Roy and Bhattacharyya; Murthy (1967) gives in Chapter 15 the
survey methods of the Indian National Sample Survey, 1958-9, and in Chapter
16 those of the Family Living Survey in urban areas of India, 1958-8, the
latter from Chinappa (1963).
The following are references to some readily available case studies:
Agricultural surveys: (a) in India conducted by the Indian Statistical Insti
tute: Mahalanobis (1940, 1944, 1946a, 1968); (b) in India conducted by
the Indian Council of Agricultural Research: Panse and Sukhatme, and
Sukhatme and Panse; (c) Surveys of Fertilizer Practice in England and
Wales: Yates.
Case Study: Indian National Sample Survey, 1964-5 557
of Catholic Americans, the Gallup Poll, Public Perception of the Illinois Legis
lature, the National Labor Relations Board (NLRB) Election Study, the Detroit
Area Study, and Unfunded Doctoral Dissertation Research.
Table A5.2: Some estimates from the Indian National Sample Survey, 19th
round, July 1964-June 1965
A6.1 Introduction
559
560 Appendix VI
could, for example, relate to the current rate of population growth, agricul
tural production (yields of different crops), employment and unemployment
rates, etc.
To attain the second objective, the survey programs should be so de
signed as to build up national capacities for the different aspects of survey
sampling, of which training is an important element; and training, by its
very nature, takes a little time to produce results.
Some survey programs tend to emphasize the attainment of one objec
tive over the other. For example, in one developing country that partici
pated in the National Household Capability Program (NHSCP, sponsored
by the U.N. and its agencies), two U.N. agencies funded two separate sur
veys, conducted one after the other, the donor agency for the second sur
vey declined to continue the data processing expert recruited by the donor
agency for the first survey but fielded its own expert; moreover, the com
puter equipment brought in by the second donor was incompatible with the
previous set: thus, the input of one agency did not mesh with that of the
other, resulting in a less significant contribution towards the enhancement of
the data processing capabilities of the recipient country (deGraft-Johnson,
1993, para. 17).
In other international survey programs too, the objective of augmenting
national survey capability is often ignored when it comes into conflict with
the competing objective of “getting out the data”; “for example, a number
of countries had their survey results [of inter-country survey programs other
than NHSCP] processed outside the country because of technical difficul
ties within the countries” (op. cit., para. 18). In the World Fertility Survey
(WFS), country reports (including the first results and the sampling errors
of major items), were prepared in the participating countries, but detailed
country analysis and inter-country comparative analysis, based on com
puter data tapes of the survey results, were made by the WFS secretariat,
the U.N. Secretariat, and a number of universities and research institutions,
mostly in developed countries: such a procedure does not fully contribute
to the capacity-building in the developing countries.
It also became clear that unless multi-country survey programs are made
product-oriented (in terms of obtaining results and analysis in a standard
ized time-frame), the objective of strengthening the process (of surveys
themselves) cannot be achieved easily. Capacity-building will not flourish
unless data are delivered in a timely manner; a survey that is well designed
and well executed and which furnishes the required data and analysis as
input to national planning can well be self-sustaining. Nothing succeeds
like success.
Technical cooperation with United Nations and its agencies and other
bilateral and multilateral organizations has resulted in the transfer of
Multi-country Survey Programs 561
A6.2.2 Funding
External funding for the survey program for a country follows two routes:
Absent a guarantee from donor agencies for national surveys, the second
approach inevitably results in delays and other problems in implementing
survey programs. The problems are compounded by the data needs of
a particular country not meshing with a donor organization’s mandate,
priority and interests.
A6.2.3 Management
Different survey programs adopt different systems of in-country project
management. In the NHSCP, it was observed that in general, projects with
externally recruited Chief Technical Advisers did much better than those
562 Appendix VI
because they are not using “modern” toothbrush and toothpaste that they
are not taking care of oral hygiene. To cite another example, some surveys
include a topic on pre-natal care of mothers, specifically asking questions
on whether any vitamin or milk was taken; when the answers are in the
negative, the conclusion is drawn that there was no pre-natal care. But in
several segments of population in Afghanistan, Bangladesh, Ethiopia, and
India (and no doubt in many other countries), pregnant women in rural
and small urban areas follow traditional practices of ingesting items such
as dried yogurt, roasted chick peas (kurut and nakhod respectively in the
local language in Afghanistan), and special red clay that are rich in calcium
or protein: the body knows what is good for it. But questions on such in
digenous pre-natal care are seldom asked in multi-country surveys with set
questionnaires.
The suggestion is that while maintaining a standard set of questions on
any topic, local variations should be allowed to supplement the set ques
tions.
It may also be observed, as an aside, that the tradition of indigenous
pre-natal and post-natal care of mothers in developing countries is being
swamped by the spreading waves of urbanization and industrialization (and
extended family norms giving way to nuclear family norms), without an
increase, pari passu, in attendant community health care - a price that has
par force to be paid for progress until such a time that a health network
covers the entire population. The argument, raised in connection with
the environmental damages resulting from some capital-intensive projects
for which the World Bank provides loans, therefore, remains moot: “the
fundamental problem of modernity may be that development pursued as
an absolute goal is nihilistic” (Rich, 1994).
(United Nations, 1982b) and has recently published another study on the
computation of sampling variance (U.N., 1993a); however, UN-supported
Pan Arab Child Health Program has plans to calculate sampling errors for
the major items (see Section A6.3.8 later in the appendix).
To their credit, WFS and DHS with their strong central management
had made mandatory the procedures for computing sampling variance (for
the major items) and indications of non-sampling errors in surveys con
ducted under their auspices. Some studies of the sampling and non-sampling
errors have also been undertaken for the Contraceptive Prevalence Surveys
(see Section A6.3.2).
However, of the 20 European countries that had executed their own
WFS-type of fertility surveys, without financial or technical assistance from
the WFS secretariat, a number used non-probability samples, and even
among those that had adopted probability sampling, none computed sam
pling errors (Kish, 1994).
households from whom household information was obtained and 5,000 “el
igible” women for whom detailed fertility particulars were recorded.
For processing and analyzing survey data, WFS adapted available com
puter software and evolved new ones, including “CLUSTERS”, for com
puting sampling variances. In case where there was only one first-stage
unit (fsu) in a stratum, as in the Kenyan Fertility Survey, some strata were
collapsed to provide two or more fsu’s in the re-constituted stratum (see
section 24.8 of this book).
For additional information on WFS, copies of publications and data
files, contact International Statistical Institute, P.O. Box 950, 2270 AZ
Voorburg, Netherlands.
sion of issues, see DHS Sampling Manual and the articles by Aliaga and
Verma (1991) and by Than Le (1993)).
For data processing, including computing sampling variances, DHS de
veloped the software package, “ISSA” (see Section 27.10.3 of this book).
An experiment was conducted in the DHS of Guatemala for interactive
processing of data (Ochoa et al.).
For copies of DHS publications and data files, contact Macro Interna-
tional/Demographic and Health Surveys, 11785 Beltsville Drive, Suite 300,
Calverton, Maryland, 20705-3119, U.S.A, phone: 01-301-572-0000; fax: 01-
301-572-0999.
errors in household surveys had been published in 1982 (UN, 1982b) and
another study made on the computation of sampling errors (UN, 1993a),
but computation of sampling errors and indications of non-sampling errors
are not mandatory for the program.
For copies of publications and other information, contact Director, Sta
tistical Division, United Nations, New York, New York, 10017, U.S.A.;
phone: 01-212-963-4996; fax: 01-212-963-9851.
The survey data are processed using either commercial software, such
as SAS-PC and SPSS, or software specifically developed, such as IMPS or
ISSA (see Chapter 27).
Since the inception of the SDA program, 30 countries have received
or have requested, assistance to establish national SDA programs, and of
these, 20 now have on-going programs. For copies of publications and other
information, contact World Bank, 1818 H Street N.W., Washington D.C.,
20233, U.S.A.
way, in three other Arab countries. The average external assistance for a
PAPCHILD survey was US$380,000.
PAPCHILD survey is carried out using a stratified two-stage sample
of households. The prototypical plan contains 300 first-stage units (fsu’s)
and 6,000 households. The 300 fsu’s are “standard segments” derived from
census enumeration districts as the sampling frame. In some countries, re
sources permitting, the second stage of sampling may be expanded to 18,000
households. Within a selected fsu, a sample of 20 households (second-stage
units, ssu’s) will be selected for interviewing (60 households for countries
opting for 18,000 total sample households). Individual interviews will re
late to all ever-married women aged under 54 and all children (under 5),
irrespective of whether their mothers were household members.
Survey data are processed and analyzed, adapting available computer
software, such as ISSA (see Section 27.10.3). Sampling errors have been,
or are being, computed for major estimates.
For copies of publications and information, contact: Project Manager,
Pan-Arab Project for Child Development, League of Arab States, 22A Taha
Hussein Street, Cairo, Egypt; phone 202-340-4306; fax: 202-340-1422.
Table A6.1: Estimates and standard errors from some multi-country survey
programs
Children ever born Kenya 1977-8 WFS 8,100 3.896 0.060 1.3
(per woman)
Nepal 1981 CPS 5,880 3.28 0.072 2.2
ically closest to the starting household. This procedure of visiting the next
closest households and collecting information on all eligible individuals in
those households continued until the required seven subjects are studied
(For details, see Lemeshow and Stroh, 1988).
About 4,500 EPI surveys had been conducted in 122 countries all through
out the world between 1978 and 1992.
For obtaining estimates, relating both to children and adults, and their
sampling variances from EPI surveys, a software, COSAS (COvearge Sur
vey Analysis System), which is now in version 4.3, was developed (Desve,
Havreng, and Brenner, 1991).
Absent the total number of children and the number of persons in the
clusters, EPI surveys cannot be said to have adopted strictly probability
samples in selecting the second-stage sample of 7 children in each of the 30
selected cluster. However, a study with computer simulation models has led
to the conclusion that “the method appears quite useful when used for the
target areas as a whole, but could provide highly unacceptable estimates if
used for particular clusters or subgroups” (Levy and Lemeshow, ibid.).
For further information and copies of publications, contact: EPI Project,
World Health Organization, CH 1211 Geneva-27, Switzerland.
Further reading
For the World Fertility Survey: see its Final Report and national reports. For the
World Bank sponsored Social Dimensions of Adjustment Surveys in Sub-Saharan
Africa, see Grootaert and Marchant (1991) and Delaine et al. (1992). For other
survey programs, see the references cited in the text. On the medicinal properties
of the “neem” (Melia Azadirachta) tree, referred to in Section A6.2.4, see “From
the Ancient Neem Tree, a New Insecticide,” New York Times, 5 June 1994, p. 49.
References
This list comprises publications listed under “Further Reading” at the end of
each chapter and others cited in the text. For the abbreviations of the names of
journals, organizations and programs, see the List of Abbreviations.
575
576 References
GAGE, T.J. (1978). “Theories differ on use of focus groups,” Advertising Age,
S-19, 20-22.
GALLUP, GEORGE (1948). A Guide to Public Opinion Polls. Princeton Uni
versity Press, Princeton.
GANGULY, AMALENDU N. and RANJAN K. SOM (1958). “A Note on the
Estimated Variances of Vital Rates in the National Sample Survey, Sev
enth Round,” (Unpublished), Indian Statistical Institute, Calcutta (mimeo
graphed).
GERLAND, PATRICK (1992). “Software development: Past, present and fu
ture trends and tools.” IUSSP/NIDI Expert Group Meeting on Demo
graphic Software and Micro-Computing, The Hague, 29 June-3 July 1992.
GINI, C. and L. GALVANI (1929), “Di una applicazione de metodo rapprsenta-
tivo all’ultimo censimento Italiano della poplazione,” Annali di Statistica,
6, 1-107.
GLASS, DAVID V. and EUGENE GREBENIK (1954). The Trend and Pattern
of Fertility in Great Britain: A Report on the Family Census of 19J6, Part
I (Report). H.M.S.O., London.
GODAMBE, V.P. (1955). “A unified theory of sampling from finite popula
tions,” JRSS(B), 17, 269-278.
GODAMBE, V.P. and D.A. SPROTT (eds.) (1971). Foundations of Statistical
Inference. Holt, Rinehart and Winston, Toronto.
GODAMBE, V.P. and M.E. THOMPSON (1988). “On single stage unequal
probability sampling.” In Handbook of Statistics, Vol. 6, Sampling, edited
by P.R. Krishnaiah and C.R. Rao. Elsevier, Amsterdam, 111-124.
GOLDSTEIN, RICHARD (1990). “A review of Resampling software for MS-
DOS computers,” Proceedings of the Statistical Computing Section, ASA,
42, 86.
GOODMAN, R. and L. KISH (1950). “Controlled selection a technique in prob
ability sampling,” J ASA, 45, 350-372.
GOURIEROUX, C., A.M. DUSSAIX, J.C. DEVILLE and J.M.
GROSBARS (1987). Contribution dans Les Sondages par Jean-Jacques
Droesbeke, Bernhard Fichet and Philippe Tassi (éditeurs). Economica,
Paris.
GRAY, HENRY L. and W.R. SCHUCANY (1972). The Generalized Jackknife
Statistics. Marcel Dekker, New York.
GRAY, P.G. and T. CORLETT (1950). “Sampling for the Social Survey.”
JRSS(A), 113, 150-206.
GRIK, D.C., K. PARKER and G.M.B. HATEGIKAMANA (1987). “Integrating
quantitative and qualitative survey techniques.” Community Health Edu
cation, 7(3), 181-200.
584 References
HORVITZ, D.G., B.V. SHAH and W.R. SIMMONS (1967). “The unrelated
question randomized response model,” Proceedings of the ASA, Social Statis
tics Section, 65-72.
HOSNI, E. (1975). “Analyse de l’effet de rétrospection (Méthode SOM),” A.s-
Soukan: étude de centre de recherches et d’études démographiques. Direc
tion de la statistiques, Rabat, Maroc, 3, 17-27.
I-CHENG, C., L.P. CHOW and R.V. RIDER (1972). “The randomized response
technique as used in the Taiwan outcome of pregnancy study,” SFP, 3, 265.
IDE, MITSURU and SHIGERU KAWASAKI (1991). “Some new attempts in
the data processing of the 1990 round of population censuses of Japan,”
ISI/IASS Booklet, Invited Papers, Cairo, 9-17 September 1991, 424-434.
ILO (1990). Surveys of Economically Active Population, Employment, Unem
ployment and Underemployment. ILO Manual on Concepts and Methods.
Geneva.
IMPULSE RESEARCH CORP. (1993). Impulse Survey of Focus Facilities. Los
Angeles.
INDIA: OFFICE OF THE REGISTRAR-GENERAL (1955). Sample Census of
Births and Deaths in 1953-54, Uttar Pradesh. Ministry of Home Affairs,
New Delhi.
INDIA: OFFICE OF THE REGISTRAR-GENERAL (1970). Sample Regis
tration of Births and Deaths in India: Rural 1965-68. Ministry of Home
Affairs, New Delhi.
ISI (1964). “Proceedings of the 34th Session,” BISI, 40(1).
IUSSP (1959). Problems in African Demography: A Colloquium. Paris.
LEWONTIN, R.C. (1995). “Sex, Lies, and Social Sciences,” New York Review
of Books, 42(7), 24-29. (Review of Sex in the Bedroom... by Bullough,
The Social Organization of Sexuality... by Laumann et al., and Sex in
America... by Michael et al.)
LICHTY, TOM (1994). Desktop Publishing with Word for Windows (2nd ed.).
Ventana Press, Chapel Hill, North Carolina.
LINCOLN, F.C. (1930). “Calculating waterfowl abundance on the basis of band
ing returns,” Circ. U.S. Department of Agriculture, No. 118.
LINDLEY, D.V. (1965). Introduction to Probability and Statistics from a Bayesian
Viewpoint, Pts. 1 and 2. Cambridge University Press, Cambridge, U.K.
LINDLEY, D.V. (1972). Bayesian Statistics: A review. Society for Industrial
and Applied Mathematics, Philadelphia.
LORD, M.G. (1994). “What that survey didn’t say,” Op-Ed, New York Times,
October 25, 1994.
LYBERG, LARS and DANIEL KASPRZYK (1991). “Data collection methods
and measurement error: An overview.” In Measurement Error in Surveys,
edited by P. Biemer et al., Wiley, New York, 237-258.
Preface, Sankhya, 20, 1-68, and (1961), Asia Publishing House, London,
and Statistical Publishing Society, Calcutta.
MAHALANOBIS, P.C. (1960). “A method of fractile graphical analysis.”
Econometrika, 28, 325-351. Reprinted in Sankhya, A, 23, 325-358.
MAHALANOBIS, P.C. (1966). “Some concepts of sample surveys in demo
graphic investigations,” World Population Conference, Vol. III. United Na
tions, Sales No. 66.XIII.7, 246-250.
MAHALANOBIS, P.C. (1968). Sample Census of Area Under Jute in Bengal.
Statistical Publishing Society, Calcutta.
MAHALANOBIS, P.C. and J.M. SENGUPTA (1951). “On the size of sample
cuts in crop cutting experiments in India,” BISI, 33(2), 359-403.
MAMMON, E. (1992). When Does Bootstrap Work? Asymptomatic Results
and Simulations. Springer-Verlag, New York.
MARKET DYNAMICS (undated). Respondent Motivation: Los Angeles Focus
Groups: Final Report.
MARKS, ELI S., WILLIAM SELTZER and KAROL J. KROTKI (1974). Pop
ulation Growth Estimation: A Handbook of Vital Statistics Measurement.
Population Council, New York.
MARTIN, ELIZABETH (1993). “Response errors in survey measurements of
facts,” ISI/IASS Booklet, Invited Papers, Florence, 25 August-2 September
1993, 17-34.
MASTERS, WILLIAM H. and VIRGINIA E. JOHNSON (1966). Human Sexual
Response. Little, Brown: Boston.
MAURITANIA: NATIONAL STATISTICAL OFFICE and LEAGUE OF ARAB
STATES (1992). Mauritania: Maternal and Child Health Survey (1990-
1991), Principal Report. Nouakchot and Cairo.
McDANIEL, C. (1979). “Focus groups - their role in the marketing research
process,” Akron Business and Economic Review, 10(4), 14-19.
MERTON, ROBERT K. (1987). “The focussed interview and focus groups:
Continuities and discontinuities,” POQ, 51, 550-566.
MERTON, ROBERT K. and PATRICIA L. KENDALL (1946). “The focused
interview and focus groups: Continuities and discontinuities,” American
Journal of Sociology, 51, 541-557.
MERTON, ROBERT K., MARJORIE FISKE and PATRICIA L. KENDALL
(1956). The Focussed Interview: A Manual of Problems and Procedures
(2nd ed., 1990). Free Press, New York and Collier Macmillan, London.
MICHAEL, ROBERT T., JOHN II. GAGNON, EDWARD O. LAUMANN and
GINA KOLATA (1994). Sex in America: A Definitive Study. Little Brown,
Boston.
MILLER, J. (1984). “A new survey technique for studying deviant behavior.”
Ph.D. Dissertation, Sociology Department, The George Washington Uni
versity, Washington, D.C.
MILLER, R.G. (1974). “The Jackknife - a review.” Biometrika, 61, 1-17.
References 591
POLITZ, ALFRED and SIMMONS, WILLARD (1949 and 1950). “An attempt
to get the ‘not at home’ into the sample without callbacks,” JASA, 44, 9-31,
and 45, 136-137.
POLLOCK, K.H., J.D. NICHOLS, J.E. HINES and C. BROWNIE (1990). “Sta
tistical inference for capture-recapture experiments,” Wildlife Monographs,
107, 1-97.
PONIKWOSKI, C.H., K.R. COPELAND and S.A. MEILY (1989). “Applica
tions for Touchstone Recognition Technology in Establishment Surveys,”
Proceedings of the Fourth Annual Research Conference, 1988, U.S. Bureau
of the Census. Washington, D.C.
POPULATION COUNCIL (1970). A Manual for Surveys of Fertility and Family
Planning: Knowledge, Attitudes, and Practice. New York.
POPULATION REPORTS (1981). “Special topics: Contraceptive Prevalence
Surveys: a new source of family planning data,” Population Reports, Series
M, No. 5. Johns Hopkins University, Population Information Program.
POPULATION REPORTS (1985). “Special topics: fertility and family planning
surveys: an update,” Population Reports, Series M, No. 8. Johns Hopkins
University, Population Information Program.
POPULATION REPORTS (1992). “Special topics: the reproductive revolution:
new survey findings,” Population Reports, Series M, No. 11. Johns Hopkins
University, Population Information Program.
PUBLIC HEALTH REPORTS (1990). Special issue, 105 (2).
SEN, A.R. (1971). “Some recent developments in waterfowl sample survey tech
niques,” JRSS(C), 20, 139-147.
SEN, A.R. (1972). “Some nonsampling errors in the Canadian Waterfowl Mail
Survey,” Journal of Wildlife Management, 36, 951-954.
SEN, A.R. (1973). “Response errors in Canadian Waterfowl Surveys,” Journal
of Wildlife Management, 37, 485-491.
SEN, A.R. (1991). “Review of sampling techniques for estimation of marine fish
catch and effort in North America with particular reference to Hawaii,”
ISI/IASS Booklet, Invited Papers, Cairo, 9-17 September 1991, 506-529.
SIKES, O.J. (1993). “Appropriate action to narrow the KAP-gap.” In Family
Planning: Meeting Challenges: Promoting Choices, the Proceedings of the
IPPF Family Planning Congress, New Delhi, November 1962, edited by
Pramila Senanyake and Ronald L. Kleinman. Parthenon Publishing, Pearl
River, New York, U.S.A.
SINGH, DAROGA and F.S. CHAUDHARY (1986). Theory and Analysis of
Sample Survey Designs. Wiley Eastern, New Delhi.
SLAVSON, S.R. (1979). Dynamics of Group Psychotherapy. Jason Aronson,
New York.
SLONIM, MORRIS J. (1967). Sampling (3rd paperback printing). Simon and
Schuster, New York.
SMITH, T.M.F. (1976). Statistical Sampling for Accountants. Haymarket Pub
lishing, London.
SMITH, T.M.F. (1984). “Sample surveys, present position and potential devel
opments: Some personal views (with discussion),” JRSS(A), 147, 208-221.
SNEDECOR, GEORGE W. and WILLIAM G. COCHRAN (1967). Statistical
Methods (3rd ed.). Iowa State University Press, Ames, Iowa.
SOGUNRO, B.O. (1991). “Nonresponse in developing countries - evolving an
adequate approach to handle nonresponse,” ISI/IASS Booklet, Invited Pa
pers, Cairo, 9-17 September 1991, 242-262.
SOM, RANJAN K. (1958-9). “On sample design in opinion and marketing
research,” POQ, 32, 564-566.
SOM, RANJAN K. (1959). “Self-weighting sample design with an equal number
of ultimate stage units in each of the selected penultimate stage units,”
Calcutta Statistical Association Bulletin, 9, 59-66.
SOM, RANJAN K. (1965). “Use of interpenetrating samples in demographic
studies,” Sankhya (B), 27(3&4), 329-342.
SOM, RANJAN K. (1973). Recall Lapse in Demographic Enquiries. Asia Pub
lishing House, Bombay.
SOM, RANJAN K. (1993). “Use of microcomputers in national development.”
In Probability and Statistics, edited by S.K. Basu and B.K. Sinha, Narosa
Publishing, New Delhi, 287-295.
SOM, RANJAN K. (1994). “Role of Prof. Mahalanobis in the United Nations
and the International Statistical Institute.” In a volume under publication
by the Indian Statistical Institute.
References 597
SOM, RANJAN K., AJOY K. DE and NITAI C. DAS (1961). Report on Mor
bidity. NSS Report No. 49. The Cabinet Secretariat, Government of India,
New Delhi.
SOM, RANJAN K., AJOY K. DE, NITAI C. DAS, B. TRIVIKRAMAN PIL
LAI, HIRALAL MUKHERJEE and S.M. UMAKANTA SARMA (1961).
Preliminary Estimates of Birth and Death Rates and of the Rate of Growth
of Population, Fourteenth Round, July 1958-July 1959. NSS Report No.
48. The Cabinet Secretariat, Government of India, New Delhi.
SPEIZER, HOWARD and DOUG DOUGHERTY (1991). “Automating data
transmission and case management functions for a nationwide CAPI study.”
In Proceedings of the 1991 Annual Research Conference. U.S. Bureau of
the Census, Washington, D.C., 389-397.
STATISTICS CANADA (1986). Report of the Canadian Health and Disability
Survey 1983-1984- Social Trends Analyses Directorate, Ottawa, Canada.
STATISTICS CANADA (1991). Health and Activity Limitation Survey - 1991:
User’s Guide. Ottawa, Canada.
STATISTICS CANADA (1992). Report of the International Workshop on the
Development and Dissemination of Statistics on Persons with Disabilities,
October 13-16, 1992, Ottawa, Canada. Statistics, Canada, Ottawa, and
the UN Statistical Division, New York.
STATISTICS CANADA (1994). Health and Activity Limitation Survey - 1991.
Making Disability Statistics Accessible: A Workshop on the 1991 Health
and Activity Limitation Survey (HALS). Ottawa, Canada.
STEPHAN, F.F. and McCARTHY, P.J. (1958). Sampling Opinions. Wiley,
New York.
STUART, ALAN (1987). Ideas of Sampling. Edwin Arnold, London (3rd ed.).
STYCOS, MAYONE J. (1981). “A critique of focus group and survey research:
the machismo case.” SFP, 12 (12), 450-456.
SUDMAN, SEYMOUR (1976). Applied Sampling. Academic, San Diego.
SUDMAN, SEYMOUR and NORMAN M. BRADBURY (1974). Response Ef
fects in Surveys: A Review and Synthesis. Aldine, Chicago.
SUDMAN, S. and NORMAN M. BRADBURN (1982). Asking Questions: A
Practical Guide to Questionnaire Design. Jossey-Bass Publishers, San
Francisco.
SUDMAN, SEYMOUR, MONROE J. SIRKEN and CHARLES D. COWAN
(1988). “Sampling rare and elusive populations,” Science, 240, 991-996.
SUKHATME, P.V. (1946a). “Bias in the use of small-size plots in sample surveys
for yield,” Current Science, 15, 119-120; Nature, 157, 630.
SUKHATME, P.V. (1946b). “Size of sampling unit in yield surveys,” Nature,
158, 345.
SUKHATME, P.V. (1947a). “Use of small size plots in yield surveys,” Nature,
160, 542.
SUKHATME, P.V. (1947b). “The problem of plot size in large-scale yield sur
veys,” J ASA, 42, 297-310, 460.
598 References
SUKHATME, P.V. and V.G. PANSE (1951). “Crop surveys in India II,” Journal
of the Indian Society of Agricultural Statistics, 3, 96-168.
SUKHATME, P.V. and G.R. SETH (1952). “Non-sampling errors in surveys,”
Journal of the Indian Society of Agricultural Statistics, 4, 5-41.
SUKHATME, PANDURANG V. and BALKRISHNA V. SUKHATME
(1970b). “On some methodological aspects of sample surveys of agriculture
in developing countries.” In New Developments in Survey Sampling, edited
by Norman L. Johnson and Harry Smith, Jr. Wiley, New York.
SUKHATME, PANDURANG V., BALKRISHNA V. SUKHATME, SHASHI-
KALA SUKHATME and C. ASOK (1984). Sampling Theory of Surveys
with Applications (3rd ed.). Iowa State University Press, Ames, Iowa, and
Indian Society of Agricultural Statistics, New Delhi; first edition (1953) by
Pandurang V. Sukhatme, FAO, Rome, and Indian Society of Agricultural
Statistics, New Delhi; second edition (1970a) by Pandurang V. Sukhatme
and Balkrishna V. Sukhatme, FAO, Rome, and Asia Publishing House,
Bombay.
SULLIVAN, K.M., R.R. FICHNER, J. GORSTEIN and A.G. DEAN (1990).
“The use and availability of Anthropometry software.” Food and Nutrition
Bulletin, 12(2), 116-119. UN University, Tokyo.
SULLIVAN, K.M. and J. GORSTEIN (1990). ANTHRO Version 1.01: Software
for Calculating Pediatric Anthropometry. Division of Nutrition, Centers
for Disease Control, Atlanta, Georgia, U.S.A., and Nutrition Unit, WHO,
Geneva.
SURVEY RESEARCH CENTER (1983). General Interview Techniques: A Self-
Instructional Workbook for Telephone and Personal Interview Training.
University of Michigan, Ann Arbor.
SUYONO, HARYONO, NANCY PIET, FARQUHAR STIRLING and JOHN
ROSS (1981). “Family planning attitudes in urban Indonesia: Findings
from focus group research,” SFP, 12 (12), 433-442.
SZAMETIAT, K. and SCHAFFER, K.A. (1964). “Imperfect frames in statistics
and the consequences for their use in sampling,” BISI, 40, 517-538.
SZYBILLO, G.J. and R. BERGER (1979). “What advertising agencies think of
focus groups,” Journal of Advertising Research, 19(3), 29-33.
TAVRIS, CAROL and SUSAN SADD (1975). The Redbook Report on Female
Sexuality. Delacorte, New York.
THANH LE (1993). “Sampling Practice in the DHS,” Proceedings of the 49th
session of the ISI, Florence, Italy, 1993. (Abstract).
THIONET, P. (1953). Applications des Methodes de Sondage. INSEE, Paris.
THIONET, P. (1958). La Théorie des sondages. INSEE, Paris.
THOMPSON, STEVEN K. (1992). Sampling. Wiley, New York.
TIPPET, L.H.C. (1952). The Methods of Statistics (4th ed.). Williams and
Moorgate, London.
References 599
VANCE, L.L. and J. NETER (1950). Statistical Sampling for Auditors and
Accountants. Wiley, New York.
VELU, R. and G.M. NAIDU (1988). “A review of current survey sampling meth
ods in marketing research (Telephone, Mail Intercept, and Panel Surveys).”
In Handbook of Statistics, Vol. 6, Sampling, edited by P.R. Krishnaiah and
C.R. Rao. Elseiver, Amsterdam, 533-554.
VERMA, V. (1992). “Household surveys in Europe: Some issues in comparative
methodologies.” Seminar: International Comparison of Survey Methodolo
gies, Athens.
YATES, FRANK (1981). Sampling Methods for Censuses and Surveys (4th ed.).
Edwin Arnold, London and Oxford University Press, New York.
YERUSHALMY, J. and J. NEYMAN (1947). PHR, 67.
YULE, G. UDNY and MAURICE G. KENDALL (1950). An Introduction to
the Theory of Statistics (14th ed.). Griffin, London, and Hafner, New York.
1. Following Example 2.1, y = 2.91 ± 0.2351 acres per plot; j/q — 291.0 ± 24.5
acres; CV for both, 8.42%.
2. Following Example 2.3, (a) 940, 597± 19,128; the 95% probability limits are
903,106 and 978,088; (b) 12.49±0.254 cattle per farm; the 95% probability
limits are 11.99 and 12.99.
5. Using the method of section 2.13, the number of households possessing
radios is 40 ± 13; the number of persons in these households is 175 ± 60.
6. Follow the methods of section 2.13, and noting that the finite sampling
correction is (1 — /), where f = n/N (N = 175 and n = 60), the estimated
incidence of HIV seroconversion is 0.15 with standard error of 0.038.
8. For (a) using the methods of section 2.9 and considering the sample as
having been drawn with replacement, (i) 25,902 ± 1797; (ii) 6093 ± 650;
(iii) 0.2331 ± 0.01872. Here n = 43, N = 325. For (b) use equation (2.74)
of section 2.13, which gives the standard error of the proportion absent as
0.007224. Here n has to be taken as 3427.
9. Considering the sample as with replacement and following Example 2.2, (a)
2460 ± 175; (b) $47, 880 ± 1215; (c) $18,143 ± 1164; (d) 3.73 ± 0.2657; (e)
$27.49 ±$1.76; (f) $7.38 ± $0,534; (g) $19.46 ± $1,476; (h) 0.3789 ± 0.0758.
1. y* R = 962, 096 ± 14, 218, CV 1.48%; j/0* = 943, 609 ± 19,188, CV 2.03%.
2. 961,348 ± 12,012, CV 1.25%.
3. 959,620 ± 14,081, CV 1.47%.
4. 957,579 ± 11,349, CV 1.19%.
1 The sign ± after an estimate indicates the estimated standard error of the estimate.
605
606 Answers to Exercises
1. Following the method of Example 4.1, the estimated total area under wheat
is 9855 ± 238 acres, CV 2.4%.
2. Using equation (7.8), and noting that the permissible margin of error d =
0.1 P, where P is the universe proportion, the number of wells is 44 for
P = 0.9 to 400 for P — 0.5. For details, see the reference.
3. See Exercise 2, using P = 0.5, n = 400 persons.
4. In Example 2.2, the CV of the estimator obtained from 20 sample house
holds was 0.0863. Using equation (7.22), the required sample size is 60
households.
5. Assuming normality and using Table 7.2 (last line), the estimated s.d. is
15, whereas the true s.d. is 16.1.
1. As students are often asked to analyze the data of a stratified srs in the
above form, the required computations are given in Table 1. The estimated
standard errors of the stratum means will be obtained on dividing column
(14) by Nh-
2. For (a) use equations (10.5(d)) and (10.15); y = 1,353, 572 households, CV
9.12%. For (b) use equations (10.55) and (10.15); gain 407%.
3. From equation (10.51), average household size — 3022/598.8 = 4.95; from
equation (10.52), its standard error is 0.1306 and CV 2.64%. Note that in
this case, the use of separate ratio estimates for the estimate of a ratio has
not led to an improvement.
Answers to Exercises
Table 1: Computations for Chapter 10, Exercise 1
(%)
(1) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15)
I 4.05 2572 383 161 2504 3075 23,256 0.132210 53,310 231 9.0
II 10.31 5878 2,024,929 14,673 9580 18,906 0.506696 164,626 406 6.9
III 15.29 7263 3,090,564 26,874 7208 13,110 0.549773 124,043 352 4.8
IV 23.16 7017 2,859,481 39,171 12,248 5256 2.330295 213,942 462 6.6
V 28.71 2565 363,609 17,315 990 420 2.357833 18,676 137 5.2
t Not additive.
O
-J
608 Answers to Exercises
1-5 435 25 0 0 0 0
6-20 519 26 3 9 1 0.0385
21-50 357 16 6 36 1 0.0625
51-150 519 17 159 3969 8 0.4706
151-300 400 26 762 38,510 20 0.7692
300 266 15 1371 164,737 15 1
t Not additive.
4. For convenience and checking, Table 2 is provided with additional columns
(4)-(7); yht is the wheat acreage of the ith farm (i = l,2,...,nn) in the
/ith stratum (/i = 1,2,..., 6), Th is the number and pn. the proportion of
farms growing wheat in the hth stratum.
For (i), using the formulae of sections 2.9 and 2.12, the estimated total area
of wheat = (2496 x 2301/125) = 45, 946T8142 acres, CV 17.7%. Estimated
number of farms growing wheat Np = 2496 x 45/125 = 899, with estimated
standard error of A-^/[p^l--^pj/^rT^Tj] = 109, CV 12.1%.
For (ii), using the formulae of section 10.7(1), y — Nhÿh =41, 106±4444
acres, CV 10.8%; using the results of section 10.3, the estimated number of
farms growing wheat Np = £2 NhPh = 860, with estimated standard error
y/[72 — Py)/(nn — 1)] = 79, CV 10.2%. Note the gain in efficiency
in stratification after sampling.
6. Follow the methods of sections 10.4 and 10.5, noting that the finite sampling
correction (1 — fh) = 0.5, where fh = nh/Nh = 0.5, are not small and have,
therefore, to be applied to estimate variances. Note, further, that since one
hospital is selected in a sample county, there would be no need to assign
a subscript for the sample hospital. Let x denote hospital beds and y
AIDS admissions, then i* hl = NhMhiXhi and y^t = NhMhtyhi, where Mhi
is the number of hospitals in the ith selected county and Xht and ?/ht are
respectively the number of beds and the number of AIDS admissions in the
Answers to Exercises 609
one selected hospital in the ith selected county. Arrange the computation
as in the tables in Example 10.2. The following are the results: (a) unbiased
estimate of the total number of AIDS admissions: 1096, s.e. 308; proportion
of AIDS admissions: 0.4769, s.e. 0.06299; (b) separate ratio estimate of
AIDS admissions: 1170, s.e. 506; combined ratio estimate: 1192, s.e. 157.
1. Use equations (12.6) and (12.27); the results are given in Table 12.6.
2. Using the Neyman allocation (equation (12.10)), with 14 = 74(1 — Ph) and
n = 2000, the allocations are 1222, 167 and 611 persons.
1. Following the method of section 16.3.1, the estimated total number of cattle
is 28,421 ± 2899, with CV 10.20%.
2. Following section 16.3.2, the estimated total number of cattle is 24, 188
with CV 9.79%. For details of computation, see the reference.
2. Following the same method as for Example 17.2, the estimated variance of
the total income = 3180 + 653, 049 = 3, 833, 529.
3. (a) Following the same methods as for Example 17.1, Vi = 0.0788, V2 =
2.1374, and the unbiased variance estimate of the mean = 0.000788 +
0.0021374 = 0.0029254.
(b) From equation (17.3), expected variance = 0.0843766.
A
Abernathy, J.R., 415
Ackoff, R.L., 23
Acsádi, Georges, 480, 557
Adlaka, Arjun L., 448
Afghanistan, 563
African Development Bank, 569
AGFUND, 570
Alder, James E., 413
Algeria, 561
Aliaga, Alfredo, 568
Anderson, Dallas W., 401
Anderson, J.E., 567
Appel, Martin V., 4, 96
Arbis, G., 502
American Statistical Association, 1, 18
Ardilly, Pascal, 23
Asok, C., (in Sukhatme et al., 1984), 22
Australia, 2, 486, 496
Azorin, F., 23
B
Bailar, Barbara, 428, 447, 455
Bailey, N.T.J., 399
Baker, Reginald P. (in Bradburn et al.), 490
Balaban, V., 429
Bangalore (India), 446
Bangladesh, 412, 419, 563
Barnett, Vic, 22
Bayes, T., 409
Behmoiras, J.P., 176
Bellhouse, D.R., 21, 89
Benjamin, Bernard, 427, 441
Berger, R., 413
Bershad, M.A., 438, 442
Bethlehem, Jelke G., 489, 495
Bhattacharyya, A., 539, 546, 556
Bhattacharya, Bhairab C., 422
Bhutan, 560
Biemer, Paul P. (also in Groves et al.), 23, 428, 438, 439, 455
Billard, Lynne, 406
Binder, A., 397
Blalok, Hubert M., 22
Blanc, Robert, 482
Blankenship, Albert B., 23
Bochtler, Erwin, 23
Bombay (India), 421
Booth, Heather (in Adlaka et al.), 448
Boswell, M.T., 399
Bourke, Patrick D., 415
Bowley, A.L., 211
Bradburn, Norman M., 482, 490
Brazil, 492, 566, 567
Brenner, Eric, 572
Brewer, K.R.W., 112
Brick, J.M., 406
Brookes, B.C., 427
Brown, S., 418
Brownie, C. (in Pollock et al.), 399
Brunei, 2
Buckland, William R., 22, 427
Bullough, Vern L., 482
Burdon, J. (in Cox et al.), 413
Burnham, K.P. (in Boswell et al.), 399
Start of Citation[PU]M. Dekker[/PU][DP]1996[/DP]End of Citation
< previous page page_611 next page >
< previous page page_612 next page >
Page 612
Bush, Carolee, 412, 413
C
Cable, Gloria, 498
Calder, Bobby J., 413
Cameroon, 390, 397, 461-464, 573
Canada, 2, 416, 426, 440, 449, 486, 489, 496, 557, 568
Cantor, David C., 491
Cantril, H., 23
Carlson, Beverley A., 418, 419
Casley, D.J., 482
Caspar, Robert A. (in Droitcour et al.), 415
Chad, 447
Chakravarti, I.M., 22, 67, 221, 269, 270, 299, 353, 376-377, 380, 385
Chamberlain, Kathleen, 485, 502
Chamie, Mary, 417, 418
Chandra Sekar, C., 447
Chatterjee, S., 210
Chaudhary. F.S., 22, 74, 78, 112, 267, 287, 299, 306, 351, 380, 385, 394, 397, 408, 409, 455
Chaudhuri, Arijit, 22, 66, 112, 394, 397, 409, 410, 415, 455
Chinappa, Nanjamma, 73, 556
Chintakananda, A., 448
Chow, L.P., 415
Clausen, J.A., 453
Clayton, R.L., 498
Coale, Ansley J., 441
Cochran, William G., 22, 24, 66, 68-69, 74, 77, 78, 89, 123, 128, 130-131, 133, 134, 139, 155, 183, 204, 209,
210, 221, 267, 287, 299, 300, 380, 385, 394, 397, 401, 403, 408, 409, 437, 438, 441, 444, 451, 455, 483
Co¬ffic, Nicole, 3
Cole, Robert A., 497
Coli. M., 502
Collins, Martin, 23, 572
Columbia Office of Radio Research, 411
Columbia Bureau of Applied Social Research, 411
Copeland, K.R., 498
Corlett, T., 23, 557
Cowan, Charles D., 401
Cox, Keith K., 411, 413
Cramér, H., 16
Creighton, K., 496
Croft, Trevor, 485
Curran, James W. (in Pappaioanou, et al. (1990a,b)), 421
Cyert, R.M., 22, 556
Cyprus, 2
D
Dalenius, Toré, 22, 214, 415, 557
Das, N.C., 167, 449, 557
Dasgupta, Ajit, 428
Dataquest, 500
David, H.A., 22
De, Ajoy K., 167, 557
Dean, A.G., 419
DeGraft-Johnson, K.T., 560, 562
Dekker, Arij L., 485, 489
Delaine, Ghislaine, 572
De La Macorra, Luis (in Folch-Lyon, et al.), 412
Demery, Lionel (in Delaine, et al.), 572
Deming, W. Edwards, 22, 24, 25, 66, 78, 123, 124, 133, 139, 221, 267, 299, 402, 403, 423, 429, 434, 446, 447,
451, 457, 474, 556, 557
Deroo, Marc, 23
Desabie, M.J., 23
Desvé, Gilles, 572
Deville, J.C. (in Gourieroux et al.), 23
Dick, W.F.L., 427
Dillman, Don A., 482, 557
Discover, 415
Dondero, Timothy J. Jr. (in Pappaioanou, et al. (1990a and b)), 421
Dougherty, Doug, 491
Droitcour, Judith, 415
Duncan, G., 397
Dubois, Jean-Luc (in Delaine et al.), 572
Dupa , Vavclac, 22, 398, 399
Durbin, J., 406
Dussaix, Anne-Marie, 23
Dutka, Salomon, 23
Duza, M. Badrud, 412
E
Economist, 406
Efron, Bradley, 405, 406
Egypt, 2, 397, 416, 418
Ekman, G., 214
El-Badry, M., 453
Ellis, Carlos, 489, 490, 492
El Salvador, 567
England and Wales, 556
ENIAC, 487
Ericson, W.A., 410
Espana, Eduardo Garcia, 482
Start of Citation[PU]M. Dekker[/PU][DP]1996[/DP]End of Citation
F
FAO, 23, 418, 482
Fellegi, Ivan P., 438, 489
Feller, W., 410
Ferber, Robert, 1, 18
Fern, Edward F., 413
Fichner, R.R. (in Sullivan, et al.), 419
Fienberg, S.E., 23, 482
Finkner, A.L., 482
Fisher, R.A., 50, 531, 532
Fiske, Marjorie, 411
Fitzpatrick, T.B., 408
Folch-Lyon, Evelyn, 412
Ford, R.N., 453
Foreman, E.K., 78, 315, 351, 376, 380, 385, 394, 397, 399, 409, 455, 482
Forsyth, Barbara H., 413
Fowler, F., 482
France, 2, 489, 565
France: Ministère de la France d'outre-mer, 440
Frankel, Lester R., 23, 439
Frankel, Martin R., 490
Freeman, Michael, 412, 413, 557
Frey, James H., 482, 557
Frost, John F., 412
Furrie, Adele D., 417
G
Gage, T.J., 413
Gagnon, John H. (in Laumann et al. and Michael et al.), 483
Ganttt, William, 465
Gallup, George, 23
Galvani, L., 25
Ganguly, Amalendu N., 374
Gates, C.E., 399
Gebhhard, Paul H. (in Kinsey et al. (1953)), 482
George, J. Richard (in Pappaioanou, et al. (1990b)), 421
Gerland, Patrick, 486
Germany, 397, 450
Ghana, 560
Gini, C., 25,
Glass, David V., 454
Glasser, G.J., 474
Godambe, V.P., 22, 23, 112
Goldstein, Richard, 406
Gong, G., 406
Goodchild. M.F., 502
Goodman, R., 408
Gorstein, J., 419
Gourieroux, C., 23
Gray, Henry L., 406, 407
Gray, P.G., 23, 557
Grady, George F. (in Pappaioanou et al. (1990b)), 421
Grdjic, Branko (in Delaine, et al.), 570
Grebenik, Eugene, 454
Greece, 404, 556, 557
Gregoire, Timothy G. (in Schreuder et al.), 482
Greenberg, B.G., 415
Grik, D.C., 413
Grootaert, Christiaan, 572
Grosbras, Jean-Marie, 23
Groves, Robert M., 23, 438, 455
Guatemala, 568
Guerney, Margaret, 214
Gulf Co-operation Council States (GCCS), 419, 570
H
Hájek, Jaroslav, 22, 23, 398, 399
Haldane, J.B.S., 400
Hammersley, J.. 399
Hanif, M., 112
Hannox, W. Harry (in Pappaioanou et al. (1990b)), 421
Hansen, Morris H., 5, 22, 24, 66, 78, 89, 95, 96, 123, 139, 155, 183, 214, 221, 267, 287, 299, 306, 310, 351, 376,
380, 385, 394, 397, 404, 405, 408, 438, 442, 450, 455
Harewood, John, 556
Harris, F.F., 557
Hartley, H.O., 454
Hategikamana, G.M.B. (in Grik et al.), 413
Havreng, Jean François, 572
Hedayat, A.S., 22, 66, 78, 112, 123, 139, 155, 183, 204, 221, 267, 287, 299, 394, 401
Hendricks, Walter A., 22, 295, 299, 453, 454
Herberger, L., 184
Hess, Irene, 408
Higginbotham, J.R., 411, 413
Hill, Christopher (in Delaine, et al.), 572
Hines, J.E. (in Pollock et al.), 399
Hite, Shere, 483
Start of Citation[PU]M. Dekker[/PU][DP]1996[/DP]End of Citation
I
I-Cheng, C., 415
Ide, Mitsuru, 492
Illinois Board of Higher Education, 2
ILO, 420
Impulse Research Corp., 411
India, 2, 24, 310, 394, 402, 416, 440, 443, 446, 447, 464, 474, 476, 477, 539-558, 563, 568
India, Central Statistical Organization, 540
India, Government, 149, 428, 446, 539, 540
India, NSS Organization, 310, 429, 457
India, Office of the Registrar-General, 541, 556
Indian Council of Agricultural Research, 171, 556
Indian Statistical Institute, 442, 540, 542, 556, 558
Indonesia, 2
Institute for Resource Development, 493
International Development Research Council, Canada, 565
International Statistical Institute, 481, 563
IUSSP, 25
J
Jabine, T.J., 483
Janus, Cynthia L., 483
Janus, Samuel S., 483
Japan, 496, 565
Japan, Statistical Center, 492
Japan, Statistical Bureau, 492
Jaworski, John, 418, 419
Jemiai, Hedi, 419
Jessen, R.J., 22
Johnson, D.R., 413
Johnson, Norman L., 22, 410
Johnson, P.O., 22
Johnson, Virginia E., 483
Jones, D.C., 23
Jordan, 418, 572
K
Kalton, Graham, 21, 22, 23, 310, 397, 401, 456, 457, 557
Kalsbeek, William D., 457
Kasprzyk, D., 397, 491, 498
Kawasaki, Shigeru, 492
Kendall, Maurice G., 22, 50, 427
Kendall, Patricia L., 411, 413
Kenya, 566, 572
Kersemajers, F. (in van Bastelaer et al.), 491
Khalifa, Atef, 419
Kinsey, Alfred C., 483
Kinshasa (Zaire), 458
Kiser, C.V., 427
Kish, Leslie J., 22, 23, 24, 66, 78, 89, 112, 119, 123, 125, 138, 139, 221, 267, 299, 351, 376, 380, 394, 397, 403,
405, 408, 409, 455, 556, 564, 571
Klar, J., 571
Klinger, Andreas, 482, 557
Knodel, John, 413
Knoop, H., 456
Kolata, Gina (in Michael et al.), 483
Koop, J.C., 403, 445
Korea, South, 566, 572
Krishnaiah, P.R., 22
Krotki, Karol J. (also in Marks et al.), 22, 448
Krupka, K., 557
L
Laha, R.G. (in Chakravarti et al.), 22, 67
Lahiri, D.B., 97, 403
Landman, C. (in Creighton et al.), 496
Laplante, M., 418
Laumann, Edward O., 483, 557
Lauriat, P., 448
Lavrakas, Paul J., 482
League of Arab States, 570, 571
Lemeshow, Stanley, 23, 67, 139, 184, 187, 401, 482, 571, 572
Lepage, Raoul, 406
Leslie, P.H., 399
Lessler, Judith T., 413, 455
Levy, Paul S., 23, 67, 139, 184, 187, 401, 482, 571, 572
Start of Citation[PU]M. Dekker[/PU][DP]1996[/DP]End of Citation
M
Macro International, 495, 557, 567, 568
Macura, M., 429
Madow, William G. (also in Hansen et al. (1953)), 22, 457
Maguire, D.J., 502
Mahalanobis, P.C., 171, 214, 402, 403, 428, 443, 444, 455, 482, 539, 556
Malawi, 2, 448
Mammon, E., 406
MapInfo Co., 500
Marchant, Timothy (also in Delaine et al.), 572
Market Dynamics, 413
Marks, Eli S. (also in Hansen et al. (1951)), 22, 438, 448
Martin, Clyde E.(in Kinsey et al. (1948) and (1953)), 483
Martin, Elizabeth, 455
Masters, William S., 483
Mauldin, W. Parker (also in Hansen et al. (1951)), 438
Matchett, S. (in Creighton et al.), 496
Mathiowetz, Nancy A. (in Biemer et al.), 438
Mauldin, W. Parker (also in Hansen et al. (1951)), 438
Mauritania, 572
McCarthy, P.J., 23
McDaniel, C., 413
McKay, Andrew (in Delaine et al.), 572
Meily, S.A., 498
Mendenhall, William (in Scheaffer et al.), 22
Merton, Robert K., 411, 413,
Messley, James T. (in Groves et al.), 23, 439
Mexico, 561
Michael, Robert T. (also in Laumann et al.), 483, 557
Michaels, Stuart (also in Laumann et al. and Michael et al.), 483
Microsoft Corp., 465, 562
Miller, J., 415
Miller, R.G., 407
Monroe, John, 482
Morgan, David L., 413
Morin, Hervé, 23
Morocco, 2, 447
Moser, Claus A., 22, 23, 310, 397, 447, 454, 455, 557
Mosteller, F., 483
Mukherjee, Rahul, 415
Mukherjee, Hiralal, 167
Murthy, M.N., ix, 22, 24, 66, 73, 74, 78, 89, 95, 96, 112-113, 123, 131, 139, 155, 167, 183, 185, 197, 204, 221,
226, 235, 267, 287, 299, 306, 366-367, 385, 394, 397, 403, 455, 482, 556
Myers, Robert J., 448
Mysore (India), 149, 446
N
Nadot, Robert, 447
Nag, Moni, 412
Naidu, G.M., 23
Namboodiri, Krishnan, 22
Nanjamma, N.S., 73
National Opinion Research Center (NORC), University of Chicago, 483
Nepal, 2, 572
Neter, J., 22, 457
Netherlands, 2
Netherlands, Central Bureau of Statistics, 416, 417, 418, 486, 491, 565
New York City, 421, 439
New York Times, 406, 483, 572
Neyman, J., 210, 426
Nichols, J. D. (in Pollock et al.), 399
Nichols, William T. II (in Groves et al.), 23, 439
Nisselson, H., 443, 449
Nizeto, B. (in Krupka et al.), 557
North Carolina, 453-454
Novello, Antonia (in Pappaioanou et al. (1990b)), 421
Novick, David G., 497
O
Ochoa, Luis Hernadedz, 568
Olkin, I. (in Madow et al.), 457
Olsen, Randall J., 491
Onorato, Ida M. (in Pappaioanou et al. (1990a)), 421
Start of Citation[PU]M. Dekker[/PU][DP]1996[/DP]End of Citation
P
Pakistan, 2, 448
Pakistan, Institute of Development Economics, 556
Panama, 2
Panse, V.G., 22, 482, 556
Pappaioanou, Marguerite, 421
Papua and New Guinea, 2
Paraguay, 567
Parker, K. (in Grik et al.), 413
Parker, Roger C., 502
Parsley, Teressa L. (in Droitcour et al.), 415
Parten, M.B., 23
Pathak, P.K., 66
Patil, G.P. (in Boswell et al.), 399
Payne, S.L., 23
Pergamit, Michael R. (in Bradburn et al.), 491
Peru, 2, 402
Petersen, Lyle, R. (in Pappaioanou et al. (1990a)), 421
Piet, Nancy, 412
Petkunas, Tom (in Suyono et al.), 498
Philippines, 2
Pillai, B. Trivikraman, 167
Politz, Alfred, 454
Pollock, K.H., 399
Pomeroy, Wardell B. (in Kinsey et al. (1948) and (1953)), 483
Ponikwoski, C.H., 498
Population Council, 412, 482
Population Reports, 566, 567, 571
Pritzker, L., 5, 450
Public Health Reports, 422
Q
Queens College, New York City University, 480
Quenouille, M.H., 73, 406
R
Raj, Des, 95, 557
Ramachandran, K.V., 443
Ramsey, F.L., 399
Rand Corporation, 50
Rao, C.R., ix, 22, 25, 403
Rao, J.N.K., 406
Rao, P.S.R.S., 78
Rao, T.J., 73
Reinhards, J.N. (in Krupka et al.), 483
Research Triangle Institute, 492
Reynolds, F.D., 413
Rhind, D.W., 502
Rice, S.C., 491
Rider, R.V., 416
Rich, Bruce, 561
Riedel, D.C., 408
Ries, P., 418
Robinson, David, 571
Robinson, Paul, 484
Rojas, Guillermo, 491
Roosevelt, F.D., 5
Ross, John (in Suyono et al.), 412
Rotschild, B.B., 491
Rosander, A.C., 23, 557
Round, Jeffery (in Delaine et al.), 572
Rowe, B., 491
Rowe, Errol, 498
Roy, A.S., 482, 539, 546, 556
Roy, J. (in Chakravarti et al.), 22, 67
Royer, Jacques, 482
Rubin, Donald R., 457
S
Sadd, Susan, 483
Salant, Priscila, 482
Sampford, M.R., 22, 112-113, 401
Sampson, Peter, 413
Sanchez, Carolyn D. (in Pappaioanou et al. (1990a)), 421
Sanchez-Crespo, J.L., 23
San Francisco, 421
Sanderson, F.H., 482
Sanwal, Mukul, 487, 502
Saris, W.E., 498
Sarma, S.M. Umakanta, 167
Schaffer, K.A., 483
Scheaffer, Richard L., 21, 414
Schearer, R. Bruce, 412
Scherr, Marvin G., 411, 412
Schuerhoff, Maarten, 489
Schmitt, S.A., 410
Schreuder, Hans T., 482
Schucany, W. R., 406, 407
Scopp, Thomas S., 489
Scott, C., 122, 572
Seltzer, William (in Marks et al.), 22, 448
Sen, A.R., 399
Sengupta, J.M., 428, 556
Seth, G.R., 438
Seychelles, 24
Start of Citation[PU]M. Dekker[/PU][DP]1996[/DP]End of Citation
T
Taillie, C., 399
Tanur, J.M., 23, 482
Tavris, Carol, 483
Tepping, Benjamin J., 5
Thailand, 2, 448
Thanh Le, 568
Thionet, P., 23
Thompson, M.E., 23, 112
Thompson, Steven K., 22, 78, 399, 416
Tibshirani, Robert J., 406
Tippet, L.H.C., 22
Toro, Vivian, 465, 502
Trinidad & Tobago, 556
Trowbridge, F.L. (in Fichner et al.), 419
Trueblood, R.M., 22, 556
Truman, Harry S., 5
Tufalu, 2
Tukey, J. W., 402, 406, 483
Turkey, 448
Turner, Anthony, 1
U
Uganda, 561
U.K., 310, 439, 448, 453, 456, 486, 539, 565
UN, 2, 19, 22, 23, 67, 78-79, 149, 183, 185, 189, 218, 269, 270, 287-288, 393, 399, 399, 401, 403, 418, 419,
420, 427, 429, 446, 447, 456, 461, 469, 471, 479, 481, 482, 486, 496, 502, 559, 560, 562, 563, 564, 565, 568,
569
UNDP, 569, 570
UN Economic Commission for Africa, 464
UN Economic Commission for Asia and the Pacific, 488, 502
UNFPA, 496, 565, 568, 570
UNICEF, 412, 570, 571
Start of Citation[PU]M. Dekker[/PU][DP]1996[/DP]End of Citation
V
Van Bastelaer, A., 491
Vance, L.L., 22
Velu, R., 23
Venezuela, 417
Verma, Vijay, 562, 568
Viet Nam, 561
Viet Nam Institute of Computer Science, 501
Visscher, Wendy (in Droitcour et al.), 415
W
Waksberg,, J. 1, 23, 455
Warner, S., 413
Warwick, Donald P., 22
Way, Peter O., 421
Weber, A.A., 268, 280-1
Weimin, Zhang, 3
Welburn, Arthur J., 22
Werking, G.S., 498
Westinghouse Public Applied System, 566
WHO (World Health Organization), 416, 418, 420, 482, 493, 501, 570, 571, 572
WHO Eastern Mediterranean Region/South-east Asian Region, 418
Willard, Joseph L. (in Ochoa et al.), 568
Willeman, Thomas R., 406
Willoughby, Anne D. (in Pappaioanou et al. (1990b)), 421
Wilson, John K., 483
Wilson, L.B., 491
Wood, Geoffrey, 482
Woolsey, T.D., 443, 449
World Bank, 493, 561, 568, 569, 570, 571, 572
Wright, Audrey (in Pappaioanou et al. (1990b)), 421
Wright, R.A. (in Rice et al.), 491
Wu, C.F.J., 406
Y
Yates, F., 22, 23, 24, 25, 50, 66, 67-68, 78, 89, 112, 123, 139, 155, 183, 186, 204, 205, 210, 267, 287, 299, 394,
397, 403, 404, 416, 427, 428, 441, 454, 456, 482, 502, 556
Yerushalmy, J., 428
Yugoslavia, 2
Yule, G. Udny, 22
Z
Zaire, 425, 456
Zambia, 310
Zarkovich, S.S., 403, 433, 471, 482
Zimbabwe, 2, 67, 402, 446
Zongming, Shao, 3
Start of Citation[PU]M. Dekker[/PU][DP]1996[/DP]End of Citation
< previous page page_618 next page >
< previous page page_619 next page >
Page 619
INDEX OF SUBJECTS
A
Accounting surveys, 22
Accuracy of individual survey values, 430 (see also Response errors)
Ad hoc survey:
definition, 470
by permanent survey organizations, 465
Administrative organization of surveys, 465
Adult education survey, 2
Advanced estimated from IPNS, 402
African Population Census Programme, 459
Agricultural surveys, 22
(see also Crop surveys)
errors and biases in surveys:
in Greece, 441
in Hertfordshire, U.K., 183, 186, 204-205
in Peru, 402
in United States, 404
use of multi-stage sampling in, 230
AIDS (Acquired Immunodeficiency Syndrome) surveys, 420-422
Allocation of total sample size:
to different stages in multi-stage sampling, 289-300
to different strata and stages in stratified multi-stage sampling, 379-380
to different strata in stratified single-stage sampling, 207-222
Analysis of variance, 294-296
Ancillary variables:
(see also pps sampling, Ratio method of estimation, Regression method of estimation, Stratified sampling,
Systematic sampling)
definition, 8
use of in srs, 71-80
ANTHRO software, 419
Areal units:
complete enumeration of, 449-50
as ultimate-stage sample units, 397
Atlas GIS, 499-500
Auditing and accounting surveys, 22
Australia, health surveys, 2
Automatic voice recognition, 499
Auto-owners' satisfaction survey, 1-2
Average (see Mean)
B
Bayes' theorem, use in sampling, 409-419
Belgium, family-building surveys, 2
Bengal crop survey, 171-172, 402
Bias, effect of (see also Errors and biases in data and estimates):
in combined ratio estimator, 175
on errors of estimates, 437
in estimate of ratios, 46
of an estimator, 16-17
in estimator of a ratio, 45
of individual survey values, survey means and survey totals, 430
in ratio method of estimation, 73-74, 517-519
in regression method of estimation, 77
in separate ratio estimator, 175
in stratification after sampling, 181-182
Biased estimator, definition, 33
Biological studies, 22
Binomial distribution (see Proportions of units)
Blaise system, 495
BMDP software, 499
Start of Citation[PU]M. Dekker[/PU][DP]1996[/DP]End of Citation
C
CAC (computer-assisted coding), 489, 496
CADE (computer-assisted data entry and editing), 489
CADI (computer-assisted data input), 489-499
Cameroon:
demographic surveys, 122, 390, 461-463, 572-573
survey of level of living, 290
Canada:
disability surveys (inc. HALS), 416- 418, 426-427
labor force surveys, 397
population census, 416, 440
Canvassing method of data collection, 467
CAPI (computer-assisted personal interview), 489-491, 496-499
Capture-recapture method in estimating mobile populations, 398, 447-448
Cartographic work, 465-466
Case study: Indian National Sample survey, 539-558
CASI (computer-assisted self interview), 498
CASIC (computer-assisted survey information system), 496-499
CATI (computer-assisted telephone interview), 496-497
Cauchy-Schwartz inequality, 526
CBSView software, 495
CENTRACK (census tracking software), 494
CENTRY (census entry software), 489, 493-494
CENTS (census tabulation system software), 491
CENVAR software for variance estimation, 492, 494
Chad, demographic surveys, 176, 447
Chandra Sekhar-Deming formula, 447-448
Children's nutritional status surveys, 418-419
in Bangladesh, 418
as part of PAPCHILD, 418-419
Chile, death rates in, 429
Circular systematic sampling, 82-83
Cluster sampling, 24, 115-124, 451-452, 523-524
comparison with srs, 115-120
in controlling non-sampling errors, 451-452
optimal size, 136-138
for proportions, 120-121
sample estimators, 118-119
sample size, 136
universe values, 116-117, 521-522
variance function, 122-123
Cluster size, optimum, 138-140
Coding, 477
Coefficient of variation:
of an estimator, 46, 128-133
of mean in srs, 43, 46-47, 533
of proportion in srs, 61, 126-128, 532
of total, 67, 130, 533
Coefficient of variation per unit:
definition, 12
and sample size, 125-134, 533
for different shapes of distribution, 133
for different values of universe
proportion, 533
Collapsed strata, 407-408
Combined ratio estimator:
in stratified pps sampling, 201-203
in stratified srs, 173-180
in stratified two-stage srs, 336
Complete enumeration vs. sample, 2, 6
Commuting/travel habits survey, 2
Compound probability, theorem of, 507
Computer-assisted coding (CAC), 489, 496
Computer-assisted data entry and editing (CADE), 499
Computer-assisted data input (CADI), 489-499
Computer-assisted self interview (CASI), 498
Computer-assisted personal interview (CAPI), 489-481, 496-499
comparison with PAPI, 489
Computer-assisted self interview (CASI), 498
Computer-assisted survey information system (CASIC), 496-499
Start of Citation[PU]M. Dekker[/PU][DP]1996[/DP]End of Citation
D
Dalenius-Guerney rule for formation of strata, 214
Dalenius-Hodges rule for formation of strata, 214-215
Data base software programs, 488
Data collection, methods of, 466-467
Data processing, 477-479, 485-502
Death rates, in Chile and Portugal, 429
Deep stratification, 215
deff (design effect), 119, 123-124, 492
Degrees of freedom, 46-47, 152-153, 155
Demarcation of strata, 214-215
equalization of:
cumulative [f(y)]1/2 (Dalenius-Hodges rule), 214-215
NhRh (Dalenius-Guerney rule), 214
Nha Vh (Ekman rule), 214
stratum totals (Mahalanobis-Hansen-Hurwitz-Madow rule), 214
Demographic and health surveys (DHS), 22, 485, 489, 492, 561, 562, 564, 567-568, 572-573
in Cameroon, 122, 390, 461-464
in Cyprus, 2
in Chad, 176, 447
in Germany, 184
in Guinea, 440
in India, 138-139, 152, 167, 170, 195-196, 214, 374, 425, 446, 447, 448
in Liberia, 448
in Malawi, 448
in Morocco, 447
in Mysore (India), 149, 446
in Pakistan, 443, 447
repeated visits in, 443
in Thailand, 447
in Turkey, 443, 447
in the United States (see Current Population Survey, U.S.)
in Upper Volta (Burkina Faso), 447
in Zaire, 425
in Zimbabwe, 446
Design effect (deff), 119, 123-124, 492
Digital mapping, 497
Disability surveys, 416-419
in Canada, 416-418
in Egypt, 416, 418
in India, 416
in Iraq, 418
in Jordan, 418
in Lebanon, 418
in Syria, 418
in the Netherlands, 416-418
in U.S., 416-418
Double sampling, 391-394
optimum sizes, 392
universe values and sample estimators, 391-393
E
EDI (Electronic data interchange), 497
Editing, 486-489, 491
Efficiency of estimators, 15-16
Egypt, labor force surveys, 397
Ekman rule for demarcation of strata, 214
Election polls, in U.K. and U.S., 5
Electronic data interchange (EDI), 497
Elementary units, definition, 8
Employment surveys:
in Ghana, 122
Start of Citation[PU]M. Dekker[/PU][DP]1996[/DP]End of Citation
F
False negatives/positives, 416, 427
Family budget surveys, European Community, 562
(see also Household budget/expenditure surveys) objectives, 460
Family census, U.K., 454
Japan, family income and expenditure survey, 474
Family planning surveys (see also Demographic surveys, Fertility surveys, Focus groups, and WFS):
in Hungary, 482, 557
in India, 442
references, 557
Felligi-Sunter algorithm, 499
Fertility surveys:
adjustment for non-response, 454
in England and Wales, 428
Finite multiplier, finite population correction, and finite sampling correction, 43
First-stage unit (fsu), selection with probability proportional to the number of ssu's:
in stratified two-stage design, 360
in two-stage design, 279-280
Fisheries surveys, 2
Focus groups, 7, 410-413
Food stamp survey, 1
Formation of strata, 213-215
demarcation of strata, 213-215
number of strata, 215
Fractile graphical analysis, 403
Fractional values in stratified sampling, 213, 225-226
Fruit growers inquiry, 453-454
fsu (see First-stage unit)
Fundamental theorems:
in multi-stage sampling, 231-235
in single-stage sampling, 44-46
in stratified multi-stage sampling, 310-315
in stratified single-stage sampling, 149-152
G
Gain due to stratification:
in stratified multi-stage sampling, 350, 380
in stratified single-stage sampling, 180, 204
Gantt chart, 19, 465
GCHS (Gulf Child Health Survey), 419, 563, 570
Geographical Information System (GIS), 499-501
Germany, micro-census, 184, 397, 450
Ghana, employment survey, 122
Start of Citation[PU]M. Dekker[/PU][DP]1996[/DP]End of Citation
H
HALS (Health and Activity Limitation Survey), Canada, 416-418, 426-427
Hansen-Hurwitz method of estimation in pps sampling, 95
Hansen-Hurwitz method of adjustment for non-response, 453
Hartley-Politz-Simmons method of adjustment for non-response, 454-455
Health surveys in the U.S., 2, 557
Health and Activity Limitation Survey (HALS), Canada, 416-418, 426-427
Hertfordshire, U.K., agricultural survey, 183, 186, 204-205
HIV/AIDS infection surveys, 420-422
Homogeneity, measures of (see Intra-class correlation)
Hong Kong, crime victim survey, 2
Household budget/expenditure survey and family budget surveys, 84
in Ethiopia, 251
use of multi-phase sampling in, 389
response errors in, 456
Household inquiries, 471
Householder method for data collection, 467
Hungary, fertility and family planning study, 491, 557
Hypothetical universe, 50-51, 533-534
I
Image capture technique, 490
Imaging, 497
Inadequate preparation errors, 425-426
Incomplete samples (see Non-response, adjustment for)
Independent events, definition, 507
India:
agricultural surveys, 554
demographic surveys, 556
morbidity survey, 449, 557
National Sample Survey (see Indian National Sample Survey)
population census, 440
socio-economic inquiries, 86
Indian Council of Agricultural Research, 171
Indian National Sample Survey, 138, 167, 170-171, 214, 310, 371, 394, 402-403, 442, 447, 449, 471, 476, 539-
568
Individual survey values, 429-436
Induced abortion survey, North Carolina, 415
InfoShare, 500
In-home radio station rating, 439-440
Integrated microcomputer processing system (IMPS), 493-495, 571
Integrated survey, definition, 470-471
Integrated System of Survey Analysis (ISSA) software, 419, 492, 495, 568, 570, 571
Internal consistency checks, 442-443, 449
Interpenetrating network of subsamples (IPNS), 402-403, 443-446, 471, 477
advanced estimators from sub-samples, 402, 477
estimating variances and margins of uncertainty, 444-446
fractile graphical analysis, 403
testing enumerator differences, 444-446
use in pilot inquiries, 475
Interpretation errors, 429
Interviewer (see Enumerator)
Intra-class correlation coefficient, 90
in cluster sampling, 118-124, 137-138
in response errors, 438, 444-445
in two-stage sampling, 291-292
Inverse sampling, 399-401
of continuous data, 400-401
for proportions, 399-400
Israel, labor force surveys, 397
Item count method, 415
J
Jackknife method of variance estimation, 406-407, 495
Japan, national survey of family income and expenditure, 476
statistical database (SISMAC), 490
Start of Citation[PU]M. Dekker[/PU][DP]1996[/DP]End of Citation
K
KAP (knowledge, attitude and practice of family planning) studies, 412-413, 482 (see also CPS/FPS, Family
planning surveys, DHS, WFS)
L
Labor force surveys:
in Canada, 397
in Egypt, 397
in Israel, 397
in U.S. (see Current employment survey and Current population survey)
Lahiri's method of pps selection, 97
Lattice sampling, 415-416
Legal basis of surveys, 460
Liberia, demographic survey, 448
Lincoln index, 398, 447
Linear systematic sampling, 81-84
Line sampling, 415-416
Literary Digest poll (1930), 5
Living Standards Measurement Study (LSMS), 561, 569
Longitudinal survey, 24
Loss due to errors in estimation, 135
Loss function, 135
Lotus 1-2-3 spreadsheet, 499
LSMS (Living Standards Measurement Study), 561, 569
M
Mahalanobis-Hansen-Hurwitz-Madow rule for demarcation of strata, 214
Mail inquiry, 452-454, 464
adjustment for non-response, 452-554
Malawi, demographic survey, 448
MapInfo, 500
Mapping work, 465
Market research (see Opinion and marketing research)
Matching fraction, in sampling on successive occasions, 395-396
Mathematical expectation, 14, 33, 507-508
Mauritania, PAPCHILD, 572-573
Maximum likelihood estimator, 398, 448
Mean square error:
control of, 449-451
definition, 16-17, 434-437
effect of sample size on, 494-450
Mean, universe values and sample estimators of
(see also Double sampling, Non-response, Ratio method of estimation, Regression method of estimation,
Sampling on successive occasions, Statistical model, Systematic sampling)
in single-stage pps sampling, 94-95
in single-stage srswor, 11-14, 29-30, 31-38, 49-50, 508-521
in single-stage srswr, 11-14, 29-30, 31-38, 47-49, 508-521
in stratified single-stage pps sampling, 190-192
in stratified single-stage srs, 159-161, 524-526
in stratified three-stage srs, 342-344
in stratified two-stage srs, 322-324
in three stage srs, 262-264
in two-stage srs, 235-243, 244-247
Measures of homogeneity (see Intra-class correlation)
Micro-census sample survey, Germany, 184, 397, 450
Minimum variance estimator, 16
Mobile population, estimation, 398-399
Morbidity surveys, 449, 557
in Canada, 416-418, 426-427
in India, 416, 449, 557
in U.K., 439
in U.S., 427, 429, 439, 557
Morocco:
demographic survey, 2, 447, 557
trachoma survey, 557
Multi-country survey programs, 559-573
contraceptive prevalence/family planning surveys(CPS/FPS), 566-567, 572-573
coordination among multi-country survey programs, 564-565
demographic and health surveys (DHS), 567-568, 572-573
expanded program on immunization (EPI) surveys, 571-572
funding, 561-562
Gulf Child Health Survey (GCHS), 570
inter-country data comparability, 562-563
Living Standards Measurement Study, 569
major issues, 559-565
management, 561
National Household Survey Capability Programme (NHSCP), 568-569
Start of Citation[PU]M. Dekker[/PU][DP]1996[/DP]End of Citation
O
Objectives of surveys, 460
Observational errors (see Response errors)
OCR (see Optical character recognition)
OMR (see Optical mark reader)
Opinion and marketing research focus groups, 7, 410-413
self-weighting design in multi-stage sampling, 301-306
self-weighting design in stratified multi-stage sampling, 384
multi-stage sampling in, 230, 306
in U.K., 23, 555
in U.S., 555
Optical character recognition (OCR), 485, 497-498
Optical mark reader (OMR), 490
Optimal cluster size, 136-138
Optimal sample size, 125
Optimum allocations:
in stratified multi-stage sampling, 379-380
Start of Citation[PU]M. Dekker[/PU][DP]1996[/DP]End of Citation
P
Pakistan, demographic survey, 2, 443, 447, 556
Pan Arab Project for Child Development (PAPCHILD), 419, 564, 570-573
Paperless fax image reporting system (PFIRS), 495-496
PAPI (see Pencil-and-paper interview)
Parameters, definition, 11-13
Parks, use of, survey, 2
PC-EDIT, 489, 496
PDE (Prepared data entry), 498-499
Pencil-and-paper interview, 470, 489, 491
comparison with CAPI, 489
Periodicity:
of surveys, 472
of units in systematic sampling, 80
PERT (see Project evaluation and review technique)
Peru:
agricultural census, 402
demographic survey, 2
PES (see Post-enumeration surveys)
PGE (see Population growth estimation studies)
Pilot inquiries, 472-473
Planning, execution, and analysis of surveys, 459-483
administrative organization, 465
budget and cost control, 461-464
cartographic work, 465
coordination with other inquiries, 465
data collection methods, 466-467
data processing, 477-479, 485-502
enumerators, selection/training, 475
legal basis, 453
objectives of surveys, 460
pilot inquiries and pretests, 474-475
project management, 465
publicity and cooperation of respondents, 460-461
questionnaire preparation, 467-468
report preparation, 479-482
supervision, 476
survey design, 469-481
tabulation programs, 466
Poisson distribution, for determining sample size, 127-128
PopMap, 501
Population, 8 (see also Universe)
Population analysis spreadsheet software, 494
Population census:
in Canada, 416, 440-441
in India, 440
in Sierra Leone, 441
in Swaziland, 441
in U.K., 440, 441
in U.S, 440, 441
in U.S.S.R., 441
in Yugoslavia, 429
Population growth estimation (PGE) studies, 447-448
in Liberia, Malawi, Pakistan, Thailand, and Turkey, 447
Population sampling (see Demographic sample surveys)
Portugal, death rate in, 429
Post-enumeration surveys (PES) of population censuses (see also Resurveys)
in Canada, 3, 441
in China, 3
in France, 3
in India, 440
in U.K., 440
in U.S., 3, 440
in U.S.S.R., 441
Post-stratification (see Stratification after sampling)
pps (see Probability proportional to size)
Precision of estimators, 15-16
Pregnancy history, data collection, 443
Pregnancy outcome, data collection, 415
Prepared data entry (PDE), 498-499
Pretests, 7, 474-475
Principal steps in a survey, 17-19, 460-482
Start of Citation[PU]M. Dekker[/PU][DP]1996[/DP]End of Citation
Q
Quality control techniques in survey operations, 449
Quattro Pro spreadsheet, 499
Quenouille-Murthy-Nanjamma method of almost unbiased ratio estimator, 73
Questionnaire and schedule definition, 467-468
precoded, 475
preparation, 467-9
QUICKTAB software, 490, 493
Quota sample, 7
R
Random groups, in estimating variances and covariances, 64-65, 404
Random sample numbers, 50
tables of, 528-529
Random sampling, 9
Random variable, definition, 8, 505
Range, estimating variance from, 133
Rare events, estimation of, 401
Ratio method of estimation, 71-76
correction for bias, 73-4, 517-519
in double sampling, 394
in pps sampling, 109-110
in single-stage srs, 424
in srs, 71-76
in stratified pps sampling, 201-203
in stratified srs, 173-180
in stratified three-stage srs, 347-350
in stratified two-stage srs, 328-336
in survey design, 470-471
in three-stage srs, 265
in two-stage srs, 250-251
Ratio of random variables, 42
in srs, 517-519
Ratio of ratio estimators of two totals in stratified single-stage pps sampling, 203
in stratified single-stage srs, 176-180
in stratified two-stage srs, 337
Ratio of two totals, universe values and sample estimators of:
in unstratified multi-stage sampling, 231-235
in single-stage pps sampling, 95
in single-stage srswor, 12, 14, 49-50, 517-519
in single-stage srswr, 12, 14, 47-49, 517-519
in stratified multi-stage sampling, 312
in stratified single-stage pps sampling, 191-192
in stratified single-stage sampling, 151, 154
stratified single-stage srs, 158-160
stratified three-stage srs, 342
stratified two-stage srs, 321-322
three-stage srs, 261-262
two-stage srs, 243-244
Readership preference surveys, 466
Recall analysis/lapse, 428, 446-447
Recording unit, 9
Regression method of estimation:
in double sampling, 391-392
in general sample designs, 471
in sampling on successive occasions, 395-397, 476-477
in srs, 76-78
Start of Citation[PU]M. Dekker[/PU][DP]1996[/DP]End of Citation
S
Sample design:
choice of, 470-471
definition, 10
Sample estimate (see Estimate)
Sample estimator (see Estimator)
Sample size:
effect on errors, 449-451
estimation of variance for determining, 133
in multi-stage designs, 289-300
required to ensure desired CV in srs, 531
in single-stage sampling, 136-140
in stratified multi-stage sampling, 379-380
in stratified sampling, 207-222
for sub-division of universe, 132-133
Sample mean, distribution of, in srs, 33-38, 432-434
Sample surveys (see Surveys)
Sampling, advantages, 4-5
Sampling:
basic concepts, 1-26
examples, 1-2
history, 21
limitations, 5-6
relationship with complete enumeration, 6
versus complete enumeration, 2
Sampling, applications:
agriculture, forestry and fisheries, 22, 182-183
auditing and accounting, 22
biological and geological studies, 22
business research, 22
child health, nutrition and immunization, 418-419, 570-572
demography, 22, 480, 567-568
disability, 416-418
fertility and family planning, 483, 565-567
health and morbidity, 23, 583, 567-568
HIV/AIDS infection, 420-422
household and social survey, 23, 483, 568-569
living standards, 569
opinion survey and market research, 23
sexuality, 483
social research, 23
telephone survey, 23
traffic, 23
women's status, 419-420
Sampling biases and errors, 424-425
Sampling fraction:
in multi-stage srs, 235
in single-stage srs, 42-43
Sampling frames, 9, 471-472
Sampling method, definition, 1
Sampling on successive occasions, 394-397, 468
estimating change in mean values, 394-395
estimating mean of two occasions, 395
estimating mean on subsequent occasion, 395-396
rotation or partial replacement of units, 396-397
use of computers in applying regression method of estimation, 476-477
Sampling plan, definition, 10
Sampling unit, choice of:
in controlling non-sampling errors, 449-450
in sampling on successive occasions, 397
in single-stage sampling, 115-124
Sampling variance (see Variance, universe values and sample estimators in)
Sampling versus non-sampling errors, evaluation and control, 449-451
Sampling with replacement (see Multi-stage sampling, pps sampling, Simple random sampling, Single- stage
sampling, and Stratified multi-stage sampling)
Sampling with varying probabilities (see pps sampling)
Start of Citation[PU]M. Dekker[/PU][DP]1996[/DP]End of Citation
T
Table Retrieval System software, 493
Tabulation programs, 464
Start of Citation[PU]M. Dekker[/PU][DP]1996[/DP]End of Citation
U
United Kingdom:
election polls, 5
family census, 454
morbidity surveys, 439
population census, 440
social survey, 459
Unbiased estimators, definition, 14, 39
Unit of analysis, 8
Unitary checks, 440
Unite, finite and infinite, definitions, 8
Unordered estimator in pps sampling without replacement, 95-96
Unrestricted srs, 29 (see also Single-stage srs)
Unweighted estimates, use of, 424-425
Universe, 8
Unordered (Raj) estimator in pps sampling without replacement, 95-96
Upper Volta, demographic survey, 447
United States:
agricultural surveys, 404, 443
current population survey, 1, 4-5, 139, 310
income and expenditure surveys, 2
election polls, 5
morbidity surveys, 427, 439
National Health Survey, 557
population and housing censuses, 440, 441
tuberculosis survey, 426
U.S.S.R., population census, 441
V
Vanity effect, 427
Variable, definition, 8
Variance of random variables, 507-508
Variance, universe values and sample estimators:
in multi-stage sampling, 231-235
in single-stage sampling, 44-48
in single-stage srswor and srswr, 12-14, 29-42, 510-517
in stratified multi-stage sampling, 310-315
in stratified single-stage pps sampling, 190-192, 207-210
in stratified single-stage sampling, 149-151
in stratified single-stage srs, 159-161, 172-180, 524-526
in stratified three-stage pps, 357-358
in stratified three-stage srs, 338-351
in stratified two-stage pps design, 355-357
in stratified two-stage srs, 317-337
in three-stage design with pps, srs, and srs, 282-283
in three-stage pps, 273-275, 298-299
in three-stage srs, 258-265, 298-299
in two-stage design with pps and srs, 275-280
in two-stage pps design, 271-273, 289-292
in two-stage srs, 237-251, 289-298
Variance computations:
simplified methods of, 403-405
method of random groups, 404
selection of a random pair of units from each stratum, 404
use of error graphs, 404
use of computers, 419, 476-477, 490, 492, 493, 566, 568, 569
Variance function:
in cluster sampling, 116-119, 521-522
in double sampling, 393
in sampling on successive occasions, 395-396
in single-stage sampling, 126-128
in stratified single-stage sampling, 207-208, 216-217, 522-524
in stratified two-stage sampling, 329-380
in three-stage design, 298-299
in two-stage design, 292-294
Variate, definition, 8
Varying probability sampling (see pps sampling)
Viet Nam, 19
Institute of Computer Science, 501
W
Weighting factors:
in multi-stage sampling, 234-235, 301-306
in single-stage sampling, 142
in stratified multi-stage sampling, 301-383
WFS (see World Fertility Survey)
WISTAT (UN data base on women), 419
Women's status, survey, 419-420
World Fertility Survey (WFS), 22, 560-561, 564-566, 571-573
Start of Citation[PU]M. Dekker[/PU][DP]1996[/DP]End of Citation
X
XTable, 492
Y
Yield surveys (see Crop surveys)
Yugoslavia, population census, 429
Z
Zambia, sample survey on goods traffic movement, 310
Zimbabwe:
demographic survey, 3
family budget survey, 2
fisheries survey, 2
Start of Citation[PU]M. Dekker[/PU][DP]1996[/DP]End of Citation