Statistics Individuals Sample Population Randomly Probability
Statistics Individuals Sample Population Randomly Probability
Each individual is chosen randomly and entirely by chance, such that each individual has the same probability of being chosen at any stage during the sampling process, and each subset of k individuals has the same probability of being chosen for the sample as any other subset [1] of kindividuals . This process and technique is known as simple random sampling, and should not be confused with random sampling. Simple random sampling is a basic type of sampling, since it can be a component of other more complex sampling methods. The principle of simple random sampling is that every object has the same possibility to be chosen. For example, N college students want to get a ticket for a basketball game, but there are not enough tickets (X) for them, so they decide to have a fair way to see who gets to go. Then, everybody is given a number (0 to N-1), and random numbers are generated, either electronically or from a table of random numbers. Non-existent numbers are ignored, as are any numbers previously selected. The first X numbers would be the lucky ticket winners. In small populations and often in large ones, such sampling is typically done "without replacement" , i.e., one deliberately avoids choosing any member of the population more than once. Although simple random sampling can be conducted with replacement instead, this is less common and would normally be described more fully as simple random sampling with replacement. Sampling done without replacement is no longer independent, but still satisfies exchangeability, hence many results still hold. Further, for a small sample from a large population, sampling without replacement is approximately the same as sampling with replacement, since the odds of choosing the same individual twice is low. An unbiased random selection of individuals is important so that in the long run, the sample represents the population. However, this does not guarantee that a particular sample is a perfect representation of the population. Simple random sampling merely allows one to draw externally valid conclusions about the entire population based on the sample. Conceptually, simple random sampling is the simplest of the probability sampling techniques. It requires a complete sampling frame, which may not be available or feasible to construct for large populations. Even if a complete frame is available, more efficient approaches may be possible if other useful information is available about the units in the population. Advantages are that it is free of classification error, and it requires minimum advance knowledge of the population other than the frame. Its simplicity also makes it relatively easy to interpret data collected via SRS. For these reasons, simple random sampling best suits situations where not much information is available about the population and data collection can be efficiently conducted on randomly distributed items, or where the cost of sampling is small enough to make efficiency less important than simplicity. If these conditions are not true, stratified sampling or cluster sampling may be a better choice.
Contents
[hide]
1 Distinction between a systematic random sample and a simple random sample 2 Sampling a dichotomous population 3 See also
4 References
[edit]Distinction
2. In the case that any selected person is returned to the selection pool ie. can be picked more than once (Geometric distribution):
This means that every student in the school has in any case approximately 1 in 10 chance of being selected using this method. Further, all combinations of 100 students have the same probability of selection. If a systematic pattern is introduced into random sampling, it is referred to as "systematic (random) sampling". For instance, if the students in our school had numbers attached to their names ranging from 0001 to 1000, and we chose a random starting point, e.g. 0533, and then pick every 10th name thereafter to give us our sample of 100 (starting over with 0003 after reaching 0993). In this sense, this technique is similar to cluster sampling, since the choice of the first unit will determine the remainder. This is no longer simple random sampling, because some combinations of 100 students have a larger selection probability than others - for instance, {3, 13, 23, ..., 993} has a 1/10 chance of selection, while {1, 2, 3, ..., 100} cannot be selected under this method. [edit]Sampling
a dichotomous population
If the members of the population come in two kinds, say "red" and "black", one can consider the distribution of the number of red elements in a sample of a given size. That distribution depends on the numbers of red and black elements in the full population. For a simple random sample with replacement, the distribution is a binomial distribution. For a simple random sample without replacement, one obtains a hypergeometric distribution.
o o
The most common methods to calculate p values and confidence limits The output from most statistics computer programmes assume simple random sampling
Regardless of what form your data are in, the important characteristic of simple random sampling is that the person doing the selecting has NO CONTROL over which households are selected. The selection is entirely random, and the selection of each household is not dependent on the selection of other households. Example of simple random sampling of 10 households from a list of 40 households We have a list of 40 heads of households. Each has a unique number, 1 through 40. We want to select 10 households randomly from this list. Using a random number table, we select consecutive 2-digit numbers starting from the upper left. If a random number matches a household's number, that household is added to the list of selected households. If a random number does not match a household's number (for example, if it is greater than 40), then it does not select a household. After each random number is used, it is crossed out so that it is never used again. We continue to select households until we have 10.
Note that even though the selected households appear somewhat clustered, if the random number table is truly random, the selected households have been randomly selected.
ideal for
hard to achieve in practice requires an accurate list of the whole population expensive to conduct as those sampled may be scattered over a wide area
If we were doing market research and wanted to sample two houses from a street containing houses numbered 1 to 48 we would read off the digits in pairs 36 80 22 31 88 46 54 18 04 98 52 45 70 71 25 97 and take the first two pairs that were less than 48, which gives house numbers 36 and 22. If we wanted to sample two houses from a much longer road with 140 houses in it we would need to read the digits off in groups of three: 368 022 318 846 541 804 985 245 707 1 25 97 and the numbers underlined would be the ones to visit: 22 and 125. Houses in a road usually have numbers attached, which is convenient (except where there is no number 13). In many cases, however, one has first to give each member of the population a number. For a group of 10 people we could number them as: 0 1 2 Appleyard Banyard Croft 5 Francis 6 Gray 7 Hibbert
3 4
Durran Entwhistle
8 Jones 9 Lillywhite
By numbering them from 0 to 9 you need only use single digits from the random number table. 36802231884654180498524570712597 In this case the first digit is 3 and so Durran is chosen.
The most common sampling design in vegetation science is simple random sampling. Simple random sampling is a type of probability sampling where each sampling location is equally likely to be selected, and the selection of one location does not influence which is selected next. In statistical terms, the sampling locations are independent and identically distributed. Consider an example of simple random sampling (SRS) of canopy forest trees. You have determined that there are 24 canopy trees in the sampling universe of interest, and you want to take measurements from a subset of this group of 24, using simple random sampling. One way to do this is to number each tree (1-24), put numbers in a hat, and pick one. The tree corresponding to the number is now part of your sampling subset. Each number (that is, each tree) is equally likely to get picked and picking one number doesn't change the probability that another number will get picked next time. There are two versions of random sampling: sampling with replacement and sampling without replacement. In the example of tree numbers in a hat, if you return the selected number to the hat, the corresponding tree has another chance to get selected. (And if selected, you repeat your measurements on the tree.) That is sampling with replacement. If instead you discard a number once it is selectedsampling without replacementa tree can be selected only once. In vegetation science, SRS without replacement is much more common than SRS with replacement. Picking numbers out of a hat is perfectly valid, if done correctly, but there are better ways to select random numbers. Even if you are familiar with using random number tables and random number generators in calculators, review the section of the course called How to use random number tables and generators.
The general sequence for conducting simple random sampling studies in the field
The general procedures for any simple random sampling study in vegetation science are about the same. First, as has been emphasized in the course, you must determine your ecological objectives. For example, you might wish to know the stand basal area
of a community. Then decide on the sampling scheme, such as sampling by individual or sampling by area, as with quadrats. Then pick individuals or locations at random (this is the simple random sampling part), and take your measurements. Finally, you use the data you collected to make inferences about the whole sampling universe, coming up with statements like "the stand basal area is 49 m2/ha." In this section of the course you will learn the procedures for locating random samples in the field and the formulas for analyzing data collected from simple random samples.
Sampling by individual
The first step is to number all the individuals in your sampling universe. In simple random sampling, each of these individuals has an equal chance of being selected. This step is a lot trickier than it might seem. For one thing, you must use an unambiguous definition of what constitutes an individual. Plants that spread vegetatively are notoriously difficult to separate into individuals. You must also enumerate all individuals in your sampling universe; if you don't, you violate the "equal chance of being selected" tenet of simple random sampling. Perhaps the most common use of sampling by individuals is with mature trees, where separate trunks define individuals and it is feasible to number all individuals. Sampling rhizomatous grasses, mosses, and much of the rest of the plant world by individual is usually not feasible. Once your individuals are numbered, the next step is to select among those numbers at random, using a random number table or random number generator. You will make your measurements on the group of selected individuals. It can be inefficient to pick a random number, take measurements on that individual, pick another random number, take measurements on that individual, and so forth. Much better is to pick the numbers for all the individuals to be measured ahead of time. Then you can plot a short path that visits each selected individual, and save yourself a lot of time.
The figure on the right shows how this works. You have selected four trees at random. Don't go to the first tree you selected (marked as 1), make measurements, then traipse to the second tree. Rather, pick an efficient path, as from tree 3 to tree 1 to tree 4 to tree 2.
In the illustration, using this flawed technique for selected trees would produce misleading data. Because of crowding, trees within the clumps tend to be stunted and trees on the edge of clumps larger. In the illustration, taking measurements from trees that were selected because they are closest to random points will strongly overestimate tree abundance, because you are more likely to select trees on the edges of clumps.
random location within your study area. The figure shows where your quadrat would be located if you picked as your random pair of numbers X = 60.7 and Y = 36.2
OK, but finding your quadrat in the field is not as easy as finding it on a diagram. The most efficient process is to create one axis of this coordinate system by placing a meter tape along one side of the study area, with the zero end of the tape at one corner. To locate your plot, go to the point on this axis corresponding to the first number in your random number pair. Then run a second tape out at right angles for a distance corresponding to the second number in your random number pair. To see this process in action, click here. (The coordinates have been rounded in this animation; do not round in the field.) Repeat this process for each quadrat location. As usual, it is more efficient to select the series of random numbers first, even in the lab well before going to the field. That way you can rearrange the sequence of quadrats into an efficient order. Once you have your random location for the quadrat, you need a system for actually placing the quadrat on the ground. You want a system that doesn't harm the vegetation and a system that is statistically valid. See the section on 'Hints for dealing with reality' for my advice.
An important note about resolution The axes in the coordinate system represent continuous numbers from 0 to the end of the axis. When picking random numbers, however, you have to determine how many digits of resolution to use in locating your quadrats. The example used a resolution of whole meters (0 digits). That means that quadrats could not be located at 61.4 m or 60.9 m. What resolution is acceptable? Use a resolution that is at least as fine as your quadrat size. For example, if you are using a 0.5 m by 0.5 m quadrat, use a resolution of at least 0.5 m. If you use a 1-m resolution, as in the example, then 3/4 of your study area will not be available for sampling! (Do you see why?) Because most quadrats in vegetation science are in the range of 0.2 m to 1.0 m, I recommend using a resolution of 0.01 m or sometimes 0.1 m.
Using a resolution of 0.01 m instead of a resolution of 1 m takes no more work with the random number table, except that you read two additional digits. It also takes no more work in the field. If you are using a standard tape in herbs and low shrubs, measuring to the nearest centimeter is just as fast as measuring to the nearest decimeter (0.1 m) or meter. You still find your position along the tape in the same way. So go with the finest resolution on your measuring device feasible in the vegetation you are studying. That's usually 0.01 m.
To actually find these quadrat locations in the field, use the procedure described for the coordinate system. Now is a good chance to visit How to use random number tables and generators, if you haven't already. This section explains some nuances about using random numbers in the coordinate and grid systems.)
Many studies in vegetation science do not have the luxury of rectangular study areas. You can still use the coordinate system, but there is some extra work involved. Basically, you pick random coordinates as before but discard any locations that fall outside your sampling universe. This process is a lot easier if you have a map of the area boundary so you can select random locations in the lab.
The grid system for selecting sample locations does not work well for non-rectangular study areas because the study area usually cannot be broken up into equal-sized rectangles.
Use a GPS unit with a high precision (at least within 5 m). Be wary of GPS units that drift rapidly, that is, the readings change before your eyes. Drift occurs because the unit has not locked onto enough satellites or does not have good enough software. If you have any drift, create a rule for knowing when you have reached your location, otherwise your subjective judgment will creep in and the sampling will no longer be random. A good rule is to stop the first time the GPS unit says
that you have reached your destination. That is, do not wait for the unit to "settle down." Once you have stopped, have a rule for locating the plot itself. I like to use the mid-point between the toes of my boots. If you use your GPS unit to enter the boundary of your study area as a polygon, you can use some GIS systems to help in sampling. For example, many GIS systems will select random coordinates from within a polygon. This automatic process is much faster than the coordinate method when your study area is highly irregular in shape.
If your next random sample location is filled with gopher holes or tire ruts, what do you do? If a deer trail runs through it or if the field crew had lunch at that spot, what do you do? There are two important questions involved. First, what is the cause of the damage? Second, what is your sampling universe? If the damage was caused by the process of sampling, skip that location and select another spot (using your procedure for randomization). If the damage was by another agent, like gophers, you then need to decide if locations disturbed by gophers are part of your sampling universe. It is legitimate to exclude damaged locations from your study, but only if you exclude them from your inferences. To see why this is important, think back to our familiar sword-fern example. Originally, the objective was to estimate sword-fern production in the entire tract. If you decide to exclude from your sample locations damaged by skid trails and gravel pits, you must state explicitly that your inferences are only for parts of the forest undamaged by skid trails and gravel pits. After all, if 50% of your forest is damaged by skid trails and gravel pits, extrapolating your measurements from pristine samples to the whole forest would be misleading and wrong.
Avoid self-inflicted damage
It is unavoidable. You have to walk through your study area as you establish its boundaries, as you find your sampling locations, and as you shift from side to side as you collect data. If a plot ends up where your boots have ripped up the vegetation, what do you do? (See the previous paragraph.) Best minimize the damage that you and your crew-mates inflict on your study area. Walk on animal trails when you can.
Know where your future plots will be, so you can avoid walking through those locations. Eat your lunch outside the study area.
Warnings and technicalities about the coordinate system.
When using the coordinate system, you need to decide if the selected coordinates designate the center of the quadrat or one of its corners. You also need to pick a plot orientation. Just pick a system (like "put the plot center at the selected coordinate and orient the long dimension of the quadrat north to south") and stick with it. The point of the system is to eliminate any subconscious bias in placing the quadrat frame. For example, in my experience, folks tend to move the frame away from poison oak but toward pretty flowers! Having a system protects your data from your subconscious biases. The coordinate system has problems along the boundaries. Let's say you are using coordinates to locate the center of a 1-m by 1-m quadrat. Then a coordinate of 0.3 m would place part of the quadrat outside the sampling area. In this system, any coordinate value less that half the length of the quadrat will put part of the quadrat outside the study area. (The same for coordinates near the end.) This is no good! You could just skip quadrat locations that extend beyond the edge of the study area. Or you could shrink the size of the quadrat, cutting it off at the boundary of the study area. Or you could fugedaboudit and ignore any quadrat that extends beyond your study area. None of these solutions is completely satisfactory because in different ways they violate the rules of simple random sampling. But the techniques for dealing with this problem correctly are much too difficult to use in the field. What to do?! I recommend that you skip quadrat locations that put the quadrat beyond the edge of the study area. Realize that this isn't quite legitimate, but it is probably the least bad. Besides, in practice, quadrats are much smaller than the study area and these difficulties with boundaries have little impact. But I thought you should know.
Overlapping quadrats
Sometimes the selection of random locations leads to quadrats that overlap each other. This is statistically acceptable and goes by the technical name of "sampling with replacement." But overlapping quadrats are hardy ever used in vegetation science. For one thing, the vegetation around the previous quadrat is usually disturbed by the process of sampling. The second, overlapping quadrat would then be damaged and give false data. (See above.) The standard procedure in vegetation science is to drop any random locations that would produce an overlap with a previous quadrat.
An important purpose of these guidelines for locating samples is to take the process out of our subjective hands and into an objective set of procedures. So it is important to follow the objective procedure precisely. But it is also important to recognize which part of your procedures are crucial for maintaining objective, representative, and independent observations -- and which parts are not. Imagine yourself at the end of a hard morning of sampling, when you discover that all your quadrat locations are off by half a meter because the tape establishing one Cartesian axis wasn't pulled quite tight enough. Do you throw away your data from the morning and start over? Not if you're on my crew! As long as the mistake didn't push a location outside your study boundary, everything is OK. The mistake was unintentional, so it couldn't impose a subjective choice on the location of quadrats. The locations are still random and independent of each other. Therefore the data collected from those locations are completely valid. Note the corrected locations, and get ready for the afternoon.
With simple random sampling without replacement, the best estimate of the population mean ( ) is usually the sample mean, the mean of your n measurements:
The best estimate of the population variability is usually the standard deviation of your data:
. There are separate formulas for and s2 for other sampling designs, like stratified random sampling and cluster sampling. Refer to the course references for details. Be sure to keep in mind your scientific objective: You want to make statements about the population mean and about your confidence in that mean. That is, you need to know the variability of your estimate of the mean, not the variability of the data. Lucky for us, statistical theory provides a way to convert from describing data to describing the behavior of your estimates of the mean:
, where n is the size of your sample, N is the size of the entire population, and is the amount you expect your estimates of the mean to vary. the standard error. is often called
But what about the factor on the right in the equation? This factor is called the finite population correction, or fpc. It is necessary because statistical distributions describe infinite populations, but sampling is from a carefully delimited (finite) population. (Reminder: The step of defining your study area / statistical population / sampling universe is the step that makes the sampling population finite.) You can see the effect of sample size on fpc at the extremes. When N is very large and n very small, fpc approaches 1 and the formula reduces to that of the familiar standard error.
When n = N, fpc = 0, which makes the estimate of variability = 0! But this makes sense because you have measured every member of the population and you now have a census, not a sample. Because you know the whole population, you know the mean exactly and there is no sampling error. Most studies in vegetation science ignore the finite population correction. Although technically incorrect, in practice it has little effect because sampling intensity in vegetation science is typically very low. For example, the sampling intensity of a study using 20 1-m2 quadrats per hectare is only 20/10000, so the fpc is
which is very close to 1.0. For the rest of the course, we will follow this grand tradition and usually not bother with the finite population correction factor unless sampling intensity goes above 10%.
Confidence and confidence interval
The next step is to convert your estimates of the population mean and its variability into confidence intervals. The statistical formula for the confidence interval with simple random sampling is the same as the standard formula (see the Statistical Background chapter and the Confidence Interval primer): to .
As before, is usually the best estimate of the population mean, t, the t-statistic, reflects both the number of samples and the level of confidence you have set (like 90%), and , the standard error, reflects the variability in the data.
More on precision
Before you use your carefully calculated values of central tendency and variability, pause a while to reflect on what contributes to the variability you measure. If your technique of vegetation measurement
22%? 28%?
varied from one time to another (and you know it did) then this measurement variabilitycontributes to overall variability.
If the vegetation itself varied from one sample location to another (and it always does), then this spatial heterogeneity or sampling variability contributes to overall variability.
Here's the important part. You can reduce the effect of sampling variability just by collecting more statistically valid samples. But the only way to reduce measurement variability is to get better at conducting the measurements themselves. That is what a lot of Chapter 3 was about, and what Chapter 9 will state again.