0% found this document useful (0 votes)
5 views

Module 1 - Section 3 - Data Collection

The document discusses different sampling methods researchers can use to gather data from a population, including simple random sampling, stratified random sampling, systematic random sampling, and cluster random sampling. It provides examples and explanations of each sampling method.

Uploaded by

jejeprm0
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Module 1 - Section 3 - Data Collection

The document discusses different sampling methods researchers can use to gather data from a population, including simple random sampling, stratified random sampling, systematic random sampling, and cluster random sampling. It provides examples and explanations of each sampling method.

Uploaded by

jejeprm0
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Module 1 Gathering Data (Assigned readings)

In order for researchers to gather data to determine a population parameter (numerical summary of a population), they need to do a census.

What is a Census?

- Census is a special sample that includes everyone and “samples” the entire population.

- There are problems with taking a census:

o Too expensive

o Undercoverage (may not actually include everyone)

o Too time-consuming

Because of these problems, researchers prefer to collect data from a sample instead, and summaries that are found from data in a sample are called sample statistics. These

sample statistics can be used as an estimation for a population parameter. There are two types of conclusions (or inferences) that a researcher can make with sample

statistics.

Two types of Inferences:

a) Population Inference

- Results from the sample can be generalized to an entire population (as estimates)

b) Causal (cause-and-effect) Inference

- The difference in the responses is caused by the difference in treatments when comparing the results from two treatment groups.

Population Inference

We should only make population inferences when we have random sampling (in other words, randomly select individuals in samples from the population).

1 of 11
Module 1 Gathering Data (Assigned readings)

- Randomizing helps to eliminate the effect of unknown extraneous factors, even ones that we may not have thought about.

o Randomizing makes sure that, on the average, the sample looks like the rest of the population.

- Non-random sampling leads to biased results (results that tend to over- or under- emphasize some characteristics of the population)

o There is usually no way to fix a biased sample and no way to salvage useful information from it.

o The best way to avoid bias is to select individuals for the sample at random.

- When we do not have random sampling from a population, conclusions should be restricted to the sample. That is, we should not generalize our results in the sample to

anyone else.

Random Sampling Methods

1)
Simple Random Samples (SRS)

-
SRS of size n: each sample of size n in the population has the same chance of being selected.

o
Ex: put all the names of the individuals in the population in a box and draw names to complete the sample

-
Samples drawn at random generally differ from one another.

o
Each draw of random numbers selects different people for our sample.

o
These differences lead to different values for the variables we measure.

o
We call these sample-to-sample differences sampling variability.

Example:

Suppose a local school district decides to randomly test high school students for their overall school experience. There are three high schools in the district, each with grades 9-

12. The school board pools all of the students together and randomly samples 250 students. Is this a simple random sample?

A. Yes, because each student is equally likely to be chosen.

B.
Yes, because they could have chosen any sample of 250 students from throughout the district.

C. No, because we can’t guarantee that there are students from each school in the sample.

D. No, because we can’t guarantee that there are students from each grade in the sample.

E. There is not enough info to know whether it is a simple random sample.

2 of 11
Module 1 Gathering Data (Assigned readings)

Example: A random sample of 2000 Canadians were asked to name their favorite fast-food restaurant in 2019. ABC Restaurant had the highest percentage with 10% of

Canadians ranking it as their favorite restaurant. Which is TRUE?

I. The population of interest is all Canadians.

II. 10% is a statistic and not a population parameter for the percentage of all Canadians who would rank this restaurant as their favorite.

III. This sampling design should provide a reasonably representative estimate of the actual percentage of all Canadians who would rank this restaurant as their favorite.

A. I only

B. II only

C. III only

D.
I, II, and III

2)
Stratified Random Sampling

-
the population is first divided into different homogeneous groups, called strata; then take an SRS within each stratum before the results are combined.

-
Stratified random sampling can reduce bias.

-
Stratifying can also reduce the variability of our results.

-
Ex: Suppose you want to estimate the proportion of Canadians that support federal party X based on an appropriate representation from each province.

o
You could break up the population by province (strata) and select an SRS from each province.

Example: Suppose a store owner decides to randomly test the hygiene of his staff. There are 10 departments in his store, and each department has 20 staff members. He plans

to test 40 of his staff members by randomly choosing 4 staff members from each department. Is this a simple random sample?

A. Yes, because the staff members were chosen at random.

B. Yes, because each staff member is equally likely to be chosen.

C. Yes, because stratified samples are a type of simple random sample.

D. No, because not all possible groups of 40 staff members could have been the sample. This is a stratified sample.

E. No, because a random sample of departments was not first chosen.

3 of 11
Module 1 Gathering Data (Assigned readings)

3)
Systematic Random Sampling

- th
Start from a randomly selected individual, then sample every k person.

-
When there is no reason to believe that the order of the list could be associated in any way with the responses sought, systematic sampling can give a representative

sample.

-
Systematic sampling can be much less expensive than true random sampling.

- th
Ex: Suppose you want to estimate the proportion of individuals that support federal party X in your area. You can set up a booth in your area and ask every 50 person

your question.

4)
Cluster Random Sampling

-
Splitting the population into similar groups (or clusters), select one or a few clusters at random and perform a census within each of them.

o
This sampling design is called cluster sampling.

o
If each cluster fairly represents the full population, cluster sampling will give us an unbiased sample.

-
Cluster sampling is not the same as stratified sampling. Consider how.

Example: Suppose a store owner decides to randomly test the hygiene of his staff. There are 10 departments in his store, and each department has 20 staff members. He plans

to test 40 of his staff members by randomly choosing two departments and check everyone in these two departments. Is this cluster sampling?

A. Yes, because the staff members were chosen at random.

B. No, because each staff member is equally likely to be chosen.

C. Yes, because cluster samples are a type of simple random sample.

D. No, because not all possible groups of 40 staff members could have been the sample.

E. Yes, because each department is a cluster, and everyone within the two randomly chosen departments is the sample.

Example: A statistics teacher wants to know how her students feel about an introductory statistics course. She decides to administer a survey to a random sample of students

taking the course. She has several sampling plans to choose from. Name the sampling strategy in each.

st nd rd th
a. There are four levels of students taking the class: 1 year, 2 year, 3 year, and 4 year. Randomly select 15 students from each level.

 Stratified Random Sample

b. Divide the class into 4 similar groups, and randomly select one of the groups and survey every student in that group.

 Cluster Random Sample

c. Each student has a seven-digit student number. Randomly choose 60 numbers.

 Simple Random Sample

4 of 11
Module 1 Gathering Data (Assigned readings)

d. Using the class roster, select every fifth student from the list.

 Systematic Random Sample

Example: You want to determine the proportion of university students that have “jobs” while attending school. What kind of sample can you get if:

1) you have a complete list of students?

 SRS (Simple random sample)

2) you have a complete list of students and want to make sure each faculty is appropriately represented.

 Stratified Random Sample. Select an SRS from each faculty.

3) you do not have a complete list of students but believe that, at any given time, the group of students in each classroom across campus all individually form a representative

sample of the entire student body.

 Cluster Sample. Select an SRS of classrooms, then go to each selected classroom and sample everyone.

4) you do not have a complete list of students. You believe that students walking in front of the Registrar’s building throughout the day are a good representation of the

entire student body.

 Systematic Random Sample. Because of the massive number of students walking past your booth, you can’t sample everyone, but you sample every 20th person

that walks by.

Recall: Bias is the tendency for a sample to differ from the corresponding population in some systematic way.

Sources of Bias:

1)
Selection Bias (Undercoverage): when some portion of the population is not sampled at all or has a smaller representation in the sample than it has in the population.

-
Usually the people that are not covered differ from the rest of the population, so bias exists.

o
Ex: a sample survey of households will miss persons with no fixed address and prison inmates.

5 of 11
Module 1 Gathering Data (Assigned readings)

o
Ex: an opinion poll conducted by telephone will miss the households without residential phones.

2)
Response Bias: refers to anything in the survey design that influences the responses.

-
respondents may lie, especially if asked about illegal or unpopular behavior.

o
Ex: Have you lied to your friends last week?

3)
Voluntary Response Bias: occurs when individuals can choose on their own whether to participate in the sample.

o
Ex: an internet poll asking people how they feel about the healthcare system? People can choose whether they want to participate.

4)
Nonresponse Bias: occurs when a large proportion of those sampled fail to respond.

o
Ex: a telephone survey is conducted to observe the eating habits of office workers, those who are selected but are randomly away on vacation can’t respond

to the survey.

o
Ex: a large number of magazine subscribers did not respond to a survey made by this magazine.

Example:

Name and describe the kind of bias that might be present if a statistics teacher decides that, instead of randomly selecting students to survey on how they feel about the

course, she just asks students to volunteer for the survey.

 Volunteer response bias—the bias would probably be towards those students who say they enjoy the course.

Example:

A chemistry professor who teaches a large lecture class surveys the students who attend his class on how he can make the class more interesting to get more students to

attend. This survey method suffers from what?

A. nonresponse bias

B. response bias

C.
undercoverage

D. none of the above

Example:

6 of 11
Module 1 Gathering Data (Assigned readings)

A question posted on a Canadian website asked visitors to the site whether they think that a particular Canadian Government bill should pass.

 Population – all Canadian adults

 Parameter – proportion that feels the bill should pass

 Sample – those visiting the web site who responded

 Method – voluntary response (no randomization employed)

 Bias – voluntary response bias; those who visit the website and respond may be predisposed to a particular answer.

Example:

In order to determine the proportion of the voting population that supports a new government policy, a local news organization carried out a survey. The results showed that

only 32% of the people answering the survey support the policy.

Consider where they obtained their evidence (data) and answer these questions:

Can these results be generalized to the entire voting population? Comment on any problems with the ways in which the data were collected.

i. An online survey was conducted. They asked individuals to log onto their website and offer their opinion.

 No random selection; there is voluntary response bias - generalizing to any population not possible.

 Selection/Undercoverage bias: People without a computer (or tv) can’t respond!

ii. They randomly selected individuals at the local mall and asked for their opinion.

 Even though a random sample was conducted, it was not a random sample from the population of interest.

 We could perhaps generalize the result to the population of shoppers at the local mall. We should also take into account the time the survey was conducted.

 Selection/Undercoverage Bias: You have to be at the mall to respond.

 Voluntary Response Bias: People refuse to respond.

iii. They randomly selected phone numbers and called people.

 So far, the best method, but still…

 Selection/Undercoverage Bias: People without a phone cannot be selected.

 Voluntary Response Bias: People hang up!

7 of 11
Module 1 Gathering Data (Assigned readings)

Causal (cause-and-effect) Inference

-
We should only make cause-and-effect (causal) inferences when we have random allocation.

o
When there is no random allocation, the difference in responses could have been caused by lurking variables

o
Lurking variables are variables that are related to both group membership and to the response. These are other variables that could possibly explain the

result.

Example:

After many dogs and cats suffered health problems caused by contaminated foods, a researcher is trying to find out whether a newly formulated pet food is safe. Our

experiment will feed some dogs the new food and some cats a food known to be safe, and a veterinarian will check the response.

Why would it be a bad design to feed the test food to some dogs and the safe food to some cats?

⮚ There are lurking variables not accounted for. We would not be able to tell whether any differences in animals’ health were attributable to the food they had eaten or to

the differences in how the two species responded.

Example:

For children between the ages of 6 and 10, the larger their shoe size, the better they do on a particular math test. Does this mean that larger feet cause students to do better

on the test?

⮚ Probably not. Age is a reasonable confounding variable that might be a more reasonable explanatory variable for the test performance. Age might also be used to explain

shoe size.

Study Designs

There are two main types of study designs:

1) Observational Studies

2) Randomized Experiment

8 of 11
Module 1 Gathering Data (Assigned readings)

Observational Study

-
In an observational study, the investigator observes individuals and measures variables of interest but does NOT attempt to influence the responses.

-
Observational studies are valuable for discovering trends and possible relationships.

-
However, it is not possible for observational studies to demonstrate a causal relationship.

-
There are two types of observational study: Retrospective and Prospective

-
Example: Study impact of distance of home to power plant on cancer incidence.

o
Since the researchers did not assign households to live at a certain home (ie. They didn’t influence the distance) and simply observed the households

randomly, it was an observational study.

o
Because researchers in this example first identified the distance of the subjects’ home to the power plant and then collected data on whether they had cancer,

this was a retrospective study.

▪ Historical data is useful when an outcome is rare

▪ Historical data, however, may contain many types of observation error.

-
Example (revised):

o
If researchers in this example identified the distance of the subjects’ home to the power plant, and then keep checking on these subjects over the next twenty

years to collect data on whether they develop cancer.

o
Had the researchers identified subjects in advance and collected data as events unfolded, the study would have been a prospective study.

▪ Observe explanatory and response variables

▪ More costly than a retrospective study, but design could better avoid the many types of observation error.

Example: There are more cats being diagnosed with organ enlargement than in the past. A researcher identified 50 kittens not already diagnosed with organ enlargement and

followed the cats for several years to see if any developed organ enlargement. This is a(n)

A. Randomized experiment

B. Survey

C.
Prospective study

D. Retrospective study

9 of 11
Module 1 Gathering Data (Assigned readings)

Randomized, Comparative Experiments

-
An experiment is a study design that allows us to prove a cause-and-effect relationship.

-
An experiment:

a.
Manipulates factor levels to create treatments.

b.
Randomly assigns subjects to these treatment levels.

c.
Compares the responses of the subject groups across treatment levels.

-
Example: Study the impact of exercise on blood pressure.

We can enroll and assign exercise schedules to individuals participating in the study.

Summary:

1. Random Selection of Individuals (Random Sampling):

- the individuals in the sample are selected randomly from the population

=> population inferences are allowed

2. Random Allocation (Random Assignment):

- the individuals are randomly assigned to different treatment groups

=> causal inferences are allowed

NOTE: We cannot make causal inferences from observational studies.

Example:

According to a recent national article, the female average wage grew faster than the male average wage at ABC Bank. These results are based on a random sample of females

and males working at ABC Bank across Canada in 1971.

i. Can we conclude that, in general, the female average wage grew faster than males’ at any bank?

 No. We do not have a random sample from ALL banks.

ii. Can we conclude that, in general, the average female wage grew faster than males’ at any branch of this bank (across Canada)?

 Yes. We have a random sample from THIS bank.

iii. Is there evidence of sexual discrimination in this bank? In other words, do these results imply that average female wage grew faster than males’ because they are female?

 Because this was an observational study, NO causal inference should be made.

10 of 11
Module 1 Gathering Data (Assigned readings)

Example:

We propose 2 designs to test the effectiveness of a new medication in relieving migraines:

Design 1:

In order to test the effectiveness of a new medication, a random sample of individuals were chosen from a particular population. The drug was administered to the individuals

the instant they began to experience pain. After a fixed period of time elapsed, they were asked to rate the effectiveness of the drug (in terms of pain reduction) on a 10-

point scale, 10 meaning no pain to 1 meaning the drug was ineffective. The results showed the average rating was 6.6.

The study was repeated.

Design 2:

This time, each individual was randomly assigned to one of two treatment groups.

One group took the new drug and the other took a placebo. The results showed that the average rating for the drug group was 8.6 and the average rating for the placebo group

was 6.7.

i. What do these results imply about the whole population? Can we generalize these results to everyone in the population?

⮚ Yes, because we have random selection, we can generalize these results to the population of interest.

ii. Can we conclude that the drug is effective in relieving migraines? Did the drug cause a decrease in pain?

⮚ Not in Design 1: There is no random allocation.

⮚ Yes in Design 2: The individuals were randomly assigned to treatment groups.

iii. Consider the variable that is being measured. Does a rating of 8 mean the same thing for all individuals?

⮚ It is hard to tell, but most likely not. As a result, there may be response bias.

11 of 11

You might also like