Chap 11
Chap 11
148
Chapter 11
150
Sampling Methods
simple random sampling unrestricted random sampling (with replacement) systematic sequential selection probability proportional to size (PPS) with and without replacement PPS systematic PPS for two units per stratum sequential PPS with minimum replacement
SURVEYMEANS Design Accommodated stratication clustering unequal weighting population total population mean proportion standard error condent limit t test
Available Statistics
SURVEYREG Design Accommodated stratication clustering unequal weighting t linear regression model regression coefcients covariance matrix signicance tests estimable functions contrasts
Available Analysis
Survey Sampling
The SURVEYSELECT procedure provides a variety of methods for selecting probability-based random samples. The procedure can select a simple random sample or a sample according to a complex multistage sample design that includes stratication, clustering, and unequal probabilities of selection. With probability sampling, each unit in the survey population has a known, positive probability of selection. This property of probability sampling avoids selection bias and enables you to use statistical theory to make valid inferences from the sample to the survey population.
151
PROC SURVEYSELECT provides methods for both equal probability sampling and sampling with probability proportional to size (PPS). In PPS sampling, a units selection probability is proportional to its size measure. PPS sampling is often used in cluster sampling, where you select clusters (groups of sampling units) of varying size in the rst stage of selection. Available PPS methods include without replacement, with replacement, systematic, and sequential with minimum replacement. The procedure can apply these methods for stratied and replicated sample designs. See Chapter 63, The SURVEYSELECT Procedure, for more information.
data summary information PROC SURVEYREG ts linear models for survey data and computes regression coefcients and their variance-covariance matrix. The procedure also provides significance tests for the model effects and for any specied estimable linear functions of the model parameters. PROC SURVEYMEANS presently does not perform domain analysis (subgroup analysis). However, note that you can produce a domain analysis with PROC SURVEYREG (see Example 62.7 on page 3269). This capability will be available in a future release of the SURVEYMEANS procedure.
Variance Estimation
The SURVEYMEANS and SURVEYREG procedures use the Taylor expansion method to estimate sampling errors of estimators based on complex sample designs. This method obtains a linear approximation for the estimator and then uses the variance estimate for this approximation to estimate the variance of the estimate itself (Woodruff 1971, Fuller 1975). When there are clusters, or primary sampling units (PSUs), in the sample design, the procedures estimate the variance from the variation among the PSUs. When the design is stratied, the procedures pool stratum variance estimates to compute the overall variance estimate. For a multistage sample design, the variance estimation method depends only on the rst stage of the sample design. Thus, the required input includes only rststage cluster (PSU) and rst-stage stratum identication. You do not need to input
SAS OnlineDoc: Version 8
152
Population refers to the target population or group of individuals of interest for study. Often, the primary objective is to estimate certain characteristics of this population, called population values. A sampling unit is an element or an individual in the target population. A sample is a subset of the population that is selected for the study. Before you use the survey procedures, you should have a well-dened target population, sampling units, and an appropriate sample design. In order to select a sample according to your sample design, you need to have a list of sampling units in the population. This is called a sampling frame. PROC SURVEYSELECT selects a sample using this sampling frame.
153
Stratied sampling involves selecting samples independently within strata, which are nonoverlapping subgroups of the survey population. Stratication controls the distribution of the sample size in the strata. It is widely used in practice to meet a variety of survey objectives. For example, with stratication you can ensure adequate sample sizes for subgroups of interest, including small subgroups, or you can use stratication to improve the precision of overall estimates. To improve precision, units within strata should be as homogeneous as possible for the characteristics of interest.
Clustering
Cluster sampling involves selecting clusters, which are groups of sampling units. For example, clusters may be schools, hospitals, or geographical areas, and sampling units may be students, patients, or citizens. Cluster sampling can provide efciency in frame construction and other survey operations. However, it can also result in a loss in precision of your estimates, compared to a nonclustered sample of the same size. To minimize this effect, units within clusters should be as heterogeneous as possible for the characteristics of interest.
Multistage Sampling
In multistage sampling, you select an initial or rst-stage sample based on groups of elements in the population, called primary sampling units or PSUs. Then you create a second-stage sample by drawing a subsample from each selected PSU in the rst-stage sample. By repeating this operation, you can select a higherstage sample. If you include all the elements from a selected primary sampling unit, then the twostage sampling is a cluster sampling.
Sampling Weights
Sampling weights, or survey weights, are positive values associated with each unit in your sample. Ideally, the weight of a sampling unit should be the frequency that the sampling unit represents in the target population. Therefore, the sum of the weights over the sample should estimate the population size N . If you normalize the weights such that the sum of the weights over the sample equals the population size N , then the weighted sum of a characteristic y estimates the population total value Y . Often, sampling weights are the reciprocals of the selection probabilities for the sampling units. When you use PROC SURVEYSELECT, the procedure generates the sampling weight component for each stage of the design, and you can multiply these sampling weight components to obtain the nal sampling weights. Sometimes, sampling weights also include nonresponse adjustments, post-sampling stratication, or regression adjustments using supplemental information. When the sampling units have unequal weights, you must provide the weights to the survey analysis procedures. If you do not specify sampling weights, the procedures use equal weights in the analysis.
154
The ratio of the sample size (the number of sampling units in the sample) n and the population size (the total number of sampling units in the target population) N is written as n f=
This ratio is called the sampling rate or the sampling fraction. If you select a sample without replacement, the extra efciency compared to selecting a sample with replacement can be measured by the nite population correction (fpc) factor, 1 , f . If your analysis should include a nite population correction factor, you can input either the sampling rate or the population total. Otherwise, the procedures do not use the fpc when computing variance estimates. For fairly small sampling fractions, it is appropriate to ignore this correction. Refer to Cochran (1977) and Kish (1965). As stated in the section Variance Estimation on page 151, for a multistage sample design, the variance estimation method depends only on the rst stage of the sample design. Therefore, if you are specifying the sampling rate, you should input the rststage sampling rate, which is the ratio of the number of PSUs in the sample to the total number of PSUs in the target population.
Sample Selection To select a sample with PROC SURVEYSELECT, you input a SAS data set that contains the sampling frame, or list of units from which the sample is to be selected. You also specify the selection method, the desired sample size or sampling rate, and other selection parameters.
In this example, the sample design is a stratied simple random sampling design, with households as the sampling units. The sampling frame (the list of the group of the households) is stratied by State and Region. Within strata, households are selected by simple random sampling. Using this design, the following PROC SURVEYSELECT statements select a probability sample of households from the HHSample data set.
155
The STRATA statement names the stratication variables State and Region. In the PROC SURVEYSELECT statement, the DATA= option names the SAS data set HHSample as the input data set (the sampling frame) from which to select the sample. The OUT= option stores the sample in the SAS data set named Sample. The METHOD=SRS option species simple random sampling as the sample selection method. The N= option species the stratum sample sizes. The SURVEYSELECT procedure then selects a stratied random sample of households and produces the output data set Sample, which contains the selected households together with their selection probabilities and sampling weights. The data set Sample also contains the sampling unit identication variable Id and the stratication variables State and Region from the data set HHSample.
Survey Data Analysis You can use the SURVEYMEANS and SURVEYREG procedures to estimate population values and to perform regression analyses for survey data. The following example briey shows the capabilities of each procedure. See Chapter 61, The SURVEYMEANS Procedure, and Chapter 62, The SURVEYREG Procedure, for more detailed information.
To estimate the total income and expenditure in the population from the sample, you specify the input data set containing the sample, the statistics to be computed, the variables to be analyzed, and any stratication variables. The statements to compute the descriptive statistics are as follows:
proc surveymeans data=Sample sum clm; var Income Expense; strata State Region; weight Weight; run;
The PROC SURVEYMEANS statement invokes the procedure, species the input data set, and requests estimates of population totals and their standard deviations for the analysis variables (SUM), and condence limits for the estimates (CLM). The VAR statement species the two analysis variables, Income and Expense. The STRATA statement identies State and Region as the stratication variables in the sample design. The WEIGHT statement species the sampling weight variable Weight. You can also use the SURVEYREG procedure to perform regression analysis for sample survey data. Suppose that, in order to explore the relationship between the total income and the total basic living expenses of a household in the survey population, you choose the following linear model to describe the relationship. Expense =
+
Income + error
SAS OnlineDoc: Version 8
156
In the PROC SURVEYREG statement, the DATA= option species the input sample survey data as Sample. The STRATA statement identies the stratication variables as State and Region . The MODEL statement species the model, with Expense as the dependent variable and Income as the independent variable. The WEIGHT statement species the sampling weight variable Weight.
References
Cochran, W. G. (1977), Sampling Techniques, Third Edition, New York: John Wiley & Sons, Inc. Fuller, W. A. (1975), Regression Analysis for Sample Survey, Sankhya, 37 (3), Series C, 117132. Hansen, M. H., Hurwitz, W. N., and Madow, W. G. (1953), Sample Survey Methods and Theory, Volumes I and II, New York: John Wiley & Sons, Inc. Kalton, G. (1983), Introduction to Survey Sampling, SAGE University Paper series on Quantitative Applications in the Social Sciences, series no. 07-035, Beverly Hills and London: SAGE Publications, Inc. Kish, L. (1965), Survey Sampling, New York: John Wiley & Sons, Inc. Lee, E. S., Forthoffer, R. N., and Lorimor, R. J. (1989), Analyzing Complex Survey Data, Sage University Paper series on Quantitative Applications in the Social Sciences, series no. 07-071, Beverly Hills and London: Sage Publications, Inc. Sarndal, C.E., Swenson, B., and Wretman, J. (1992), Model Assisted Survey Sampling, New York: Springer-Verlag Inc. Wolter, K. M. (1985), Introduction to Variance Estimation, New York: SpringerVerlag Inc. Woodruff, R. S. (1971), A Simple Method for Approximating the Variance of a Complicated Estimate, Journal of the American Statistical Association, 66, 411414.
The correct bibliographic citation for this manual is as follows: SAS Institute Inc., SAS/STAT Users Guide, Version 8, Cary, NC: SAS Institute Inc., 1999. SAS/STAT Users Guide, Version 8 Copyright 1999 by SAS Institute Inc., Cary, NC, USA. ISBN 1580254942 All rights reserved. Produced in the United States of America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc. U.S. Government Restricted Rights Notice. Use, duplication, or disclosure of the software and related documentation by the U.S. government is subject to the Agreement with SAS Institute and the restrictions set forth in FAR 52.22719 Commercial Computer Software-Restricted Rights (June 1987). SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513. 1st printing, October 1999 SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. The Institute is a private company devoted to the support and further development of its software and related services.