0% found this document useful (0 votes)
21 views12 pages

STAT 1181 Chapter 3

Uploaded by

yugpatelrewards
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views12 pages

STAT 1181 Chapter 3

Uploaded by

yugpatelrewards
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Contents

Contents 1

3 Producing Data 2
3.1 Sources of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.2 Designing Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1
Chapter 3

Producing Data

We need reliable data to make decisions therefore the data should be carefully collected. In previous chapters, we learned to
use some basic tools to analyze data. The validity of the conclusions of the exploratory analysis of the data depends on both
the use of the best methods and on the quality of the data.
In this chapter, we begin with a short overview of the source of data

• Designed Samples
• Designed Experiments

3.1 Sources of Data

In this section, you will learn:

• Anecdotal evidence

• Available data

• Observation versus experiment

• Population versus sample

• Census or sample

• Advantages of sample surveys

Beware of drawing conclusions from our own experience or hearsay.

Anecdotal evidence is based on haphazardly selected individual cases, which we tend to remember because they are unusual

in some way. They also may not be representative of any larger group of cases.

Available data are data that were produced in the past for some other purpose but that may help answer a present question

inexpensively. This is a better choice. The library and the Internet are sources of available data.

Government statistical offices are the primary source for demographic, economic, and social data.

Some questions require data produced specifically to answer them. This leads to designing observational or experimental

studies.

2
CHAPTER 3. PRODUCING DATA 3

Census or sample

A sample survey collects data from a sample of cases that represent a larger population of cases.

We often wish to find information or answer a question about a group or population. For example, a business might want

to know something about all American consumers.

Census: There is an attempt to contact every individual in a population in order to answer some question(s) of interest.

Usually, we prefer not to study all the individuals of interest, (i.e. measuring variables for every individual of interest). Why?

Sampling: A subset of the members of the population is selected in order to gain information about the population of

interest.

Advantages of sample surveys

• Lower cost: Sampling a smaller group is less expensive than gathering information from an entire population.

• Timeliness: Gathering information from an entire population can take a lot of time, while sample information can be

gathered relatively quickly.

• Accuracy: The accuracy of information gathered on a smaller scale can be greater than that gathered on a larger

group.

Example 3.1.1 The Langara Students’ Union conducted a Health & Dental Plan - Student Satisfaction Survey. They sent

a questionnaire to 250 students selected at random from a list of students who use the plan. 230 questionnaires are returned.

a) What is the population in this study?

b) What is the sample in this study?

Observational studies and experiments

Observational study: Record data on individuals without attempting to influence the responses. You simply observe.

Example: Watch the behavior of consumers looking at store displays or the interaction between managers and employees.

Experimental study: Treatment is deliberately imposed (intervention) on individuals and their responses are recorded.

Influential factors can be controlled.

Example: In order to answer the question “Which TV ad will sell more toothpaste?” each ad is shown to a separate group

of consumers. The number that buy the toothpaste is recorded for each group

Note: Sample surveys are a type of observational study. Experimental studies can also be performed on the sample group.
CHAPTER 3. PRODUCING DATA 4

3.2 Designing Samples

In this section, you will learn:

• Population versus sample

• Sampling methods

• Simple random samples

• Stratified samples

• Cluster samples

• Caution about sampling surveys

Population: (or target population) The entire group of individuals in which we are interested but can’t usually assess

directly. The size of the population is denoted by N .

Examples: All humans, all working-age people in California, all crickets

Sample: The part of the population we actually examine and for which we do have data. How well the sample represents

the population depends on the sample design. The size of the sample is denoted by n.

The method that is used to select a sample from our desired population is the design of a sample.

A Sampling Frame or Frame is a list of the individuals that the sample will be selected from. The frame should be close

to the target population to avoid selection bias or coverage error.

A representative sample is a proper reflection of the entire population.

Bias

The sample design is biased if it systematically favors a particular outcome.


CHAPTER 3. PRODUCING DATA 5

Non-Sampling Errors

To summarize: below are the different sources of bias:

• Selection bias or coverage error or undercoverage happens when some cases in the population have less chance

of being selected in the sample than others and when parts of the population are left out in the process of choosing the

sample. It occurs when the frame does not include all the cases from the population because the frame does not match

perfectly with the target population.

Note: Increasing the sample size does not necessarily improve or eliminate the selection bias.

• Non-Response Bias occurs when certain questions are not answered in a meaningful way. For example, some

demographic or specific types of people do not respond to some or all specific questions.

Note: If non-response does not cause bias (i.e. the responses are not systematically different from non-responses), so

we do not consider it as a problem. The follow-up procedure remedy should be done to reduce non-response questions.

• Response bias occurs when we do not collect the fact or truth due to social desirability, fear or embarrassment,

misunderstanding, difficult questions, interviewer, memory problem, the phrasing of questions or more.

Note: Online opinion polls are particularly vulnerable to bias because the sample who respond are not representative of the

population at large. Why?

We select a sample in order to get information about a larger population. The process of drawing conclusions about a

population on the basis of sample data is called inference.

Inferential statistics make estimates, decisions, predictions, or other generalizations about a population based on the

selected sample.

Therefore, applying proper sampling methods is important in order to obtain “good” estimates of the population character-

istics.

Since statistical inference is a generalization about a population based on sample data, selecting another sample changes the

conclusion. Therefore, we try to eliminate or reduce biases and errors.

Important Terminology

• A Parameter is a true numerical or another measurable characteristic of a population which is always unknown and

we wish to estimate. (e.g population mean, µ, population standard deviation, σ, or population proportion, p.) The

values of parameters are always constant or fixed. Why?

• Statistic or Sample Statistic is a numerical value or measure used as a summary measure for a sample and calculated

based on sample data which is always known or can be calculated. (e.g sample mean, x̄, sample standard deviation, s,

or sample proportion, p̂.) The values of statistics are not constant or fixed and they vary from sample to sample.
CHAPTER 3. PRODUCING DATA 6

Question: When the parameters of a population are known?

Sources of Error

Sampling Error occurs whenever we study a sample rather than the entire population. Sampling error which always occurs

in sampling usually decreases when the sample size is increased but does not disappear unless conducting a census instead.

The sample design and the variability within the population can affect the sampling error. As a result, the calculated

sample statistic will be different from the population parameter (overestimation or underestimation). This is called error of

estimation which can be controlled and measured by probability sampling designs.

Note: Sampling error does not invalidate a sample since it is inevitable.

Poor sampling designs

How can we identify whether a sample represents the population?

Sampling designs play an important role to select a representative sample.

Non-Random Sampling or Non-Probability Sampling Methods are not based on random selection. The samples are

selected based on subjective judgment, convenience, or haphazard contact.

Voluntary response sampling:

Individuals choose to participate so they select themselves by responding to a general appeal. These samples are very

susceptible to being biased because different people are motivated to respond or not. These people usually have strong

opinions, especially negative opinions.

Often called “public opinion polls.” These are not considered valid or scientific. Online polls use voluntary response samples.

Ann Landers summarizing responses of readers:

Seventy percent of (10,000) parents wrote in to say that having kids was not worth it—if they had to do it over again, they

wouldn’t.

Bias: Most letters to newspapers are written by disgruntled people. A random sample showed that 91% of parents WOULD

have kids again.

Convenience sampling: Simply ask those that are around or the easiest cases to reach.

Example: “Man on the street” survey (cheap, convenient, often quite opinionated or emotional now very popular with TV

“journalism”)

Which men, and on which street?

Ask about corporate regulation or unemployment insurance “on the street” in New York or in some small town in Idaho, and

you would probably get completely different answers.

Even within an area, answers would probably differ if you did the survey outside a high school or a restaurant.

Bias: Opinions limited to individuals present.

Disadvantage: It is impossible to determine how representative of the population the sample is and usually, there is selection

bias in non-probability sampling in particular voluntary and convenience sampling.


CHAPTER 3. PRODUCING DATA 7

Probability Sampling Methods or Random Sampling Methods

Randomly selecing cases eliminates bias by giving all cases an equal chance to be selected in the sample. The simplest way

is drawing out a sample of names of individuals from a hat by using chance!

Individuals are randomly selected. No one group should be represented more than the others.

Random samples rely on the absolute objectivity of random numbers. There are tables and books of random digits available

for random sampling.

Statistical software can generate random digits (e.g., Excel =random()”).

1- Simple random sample (SRS)

A simple random sample (SRS) is made of randomly selected cases. Each case in the population has the same probability of

being in the sample. All possible samples of size n have the same chance of being drawn.

This is similar to placing names in a hat (the population) and draw out a handful (the sample).

Randomization

One way to randomize is to rely on random digits to make choices in a neutral way. We can use a table of random digits

(like Table B) or the random sampling function of a statistical software.

• We first label each of the N individuals with a number (typically from 1 to N , or 0 to N − 1).

• A list of random digits is parsed into digits the same length as N (if N = 233, then its length is 3; if N = 18, its length

is 2).

• The parsed list is read in sequence and the first n digits corresponding to a label in our group of N are selected.

• The n individuals with these labels constitute our selection.

Using Table B

A table of random digits is a long string of the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 with these properties:

1- Each entry in the table is equally likely to be any of these 10 digits.


CHAPTER 3. PRODUCING DATA 8

2- The entries are independent of each other. That is, knowledge of one part of the table gives no information about any

other part.

We need to randomly select five employees from a business with 20 employees.

• List and number all members of the population, which is the workforce of 20.

• The number 20 is two digits long.

• Parse the list of random digits into numbers that are two digits long. Here we chose to start with line 103 for no

particular reason.

• Randomly choose five employees by reading through the list of two-digit random numbers, starting with line 103 and

on.

• The first five random numbers matching numbers assigned to employees make our selection number.

The first individual selected is Ramon (17), then Henry (09). That’s all we can get from line 103.

We then move on to line 104. The next three to be selected are Moe, George, and Amy (13, 07, and 02).

Remember that 1 is 01, 2 is 02, etc.

If you were to hit 17 again before getting five people, don’t sample Ramon twice—just keep going.
CHAPTER 3. PRODUCING DATA 9

Selecting a random sample of n from an Excel spreadsheet

1- Create a column, named “Random”, as your last column.

2- Use the “Rand” function to generate a random number for the first cell and by double clicking on the lower right corner

of that cell generate random numbers for each row.

3- Create a new column, named “Random Numbers”. Copy the values of the Random column and use the “Paste Values”

option to paste them into the Random Number column.

4- Select all the data → Data → Sort.

5- Sort by “Random Numbers”. → ok.

6- Select the top n rows as your random sample.

Note:

• Every possible sample of size n using the SRS method has the same chance of being selected.

• Every individual of the population has the same chance of being chosen. (Calculate the chance.)

2- Stratified samples

There is a slightly more complex form of random sampling:

A stratified random sample is essentially a series of SRSs performed on non-overlapping subgroups, called strata , of a given

population. The strata have similar (homogeneous) cases. The strata are chosen to contain all the individuals with a certain

characteristic.

Therefore, the full sample is the combination of SRS samples which were selected in each stratum.

For example:

• Divide the shoppers at a mall into males and females.


CHAPTER 3. PRODUCING DATA 10

• Divide the population of California by major ethnic group.

• Divide the counties in America as either urban or rural based on criteria of population density.

The SRS taken within each group in a stratified random sample need not be of the same size. For example:

• A stratified random sample of 100 male and 150 female shoppers

• A stratified random sample of a total of 100 Californians, representing proportionately the major ethnic groups

• Advantage: Reducing error of estimation and producing more exact information compared to a simple random sample

of the same size in particular when the individuals within strata are similar.

• Disadvantage: More difficult to conduct.

Cluster Sampling

In a cluster sampling method, the cases in the population are first divided into separate (non-overlapping) groups called

clusters. A simple random sample of the clusters is then taken. All cases within each sampled cluster form the final sample.

The division in the first step is done primarily by geographical factors, such as province, city, community or blocks etc.

Cluster sampling tends to provide the best results when the individuals within the clusters are not alike. Therefore, in cluster

sampling, we prefer to have heterogeneous groups compare to stratified sampling in which we have homogeneous groups.

In cluster sampling:

• The cost is reduced.

• Sample size is larger.

• The list of all clusters is accessible as opposed to the list of all individuals in the population.

Systematic Sampling

If a sample size of n is desired from a population containing N individuals, we might sample one individual for every k (called

a 1-in-k sample) individuals in the population.


N
k= n where k is the sampling interval or stepper.

Note: k is a whole number or integer. In a case where the value is not a whole number, it will always be rounded down to

the nearest whole number.

We randomly select one of the first k individuals from the population list which is called the starting point denoted by r.

Then, we select every kth element that follows in the population list.

The selected individuals which form a sample are: r, r + k, r + 2k, r + 3k, ..., r + (n − 1)k.

This method has the properties of a simple random sample, especially if the list of the population individuals has a random

ordering (no pattern).


CHAPTER 3. PRODUCING DATA 11

Advantages: The sample usually will be easier to identify than it would be if simple random sampling were used. Since

systematic sampling selection is spread evenly over the population, provides more information than a simple random sample.

Disadvantages: We need a list of the population to conduct systematic sampling. If the list of the population has some

pattern or order, we might systematically lose a group of elements when the pattern in the population is consistent with the

sampling interval, k.

Example 3.2.1 Suppose we wish to use the systematic sampling method to randomly select a smpale of 5 students from 20

students.

a) List all the possible samples.

b) Can we say a systematic sample gives all individuals the same chance to be chosen similar to an SRS? why?

20
k= 5 = 4 where 4 is the sampling interval or stepper.

Sample 1: If we randomly select r = 1:

1, 5, 9, 13, 17

Sample 2: If we randomly select r = 2:

2, 6, 10, 14, 18

Sample 3: If we randomly select r = 3:

3, 7, 11, 15, 19

Sample 4: If we randomly select r = 4:

4, 8, 12, 16, 20

Each sample has a 14 or25% chance of being selected. (why?)

Each student belongs to only one sample.

In summary:

Stratified sampling has a smaller error and provides greater precision but is more costly compared to SRS.

Cluster sampling has a larger error and provides less precision but is less costly compared to SRS.

It is better in systematic sampling to have the list of the population individuals or the population size. Systematic sampling is

easy to conduct, usually provides more information since a systematic sample is spread evenly over the population. However,

a systematic sampling might not provide a representative sample for populations that have any patterns. If the population

list has no order or pattern, systematic sampling and SRS provide the same result.

Caution about sampling surveys

• Nonresponse: People who feel they have something to hide or who don’t like their privacy being invaded probably

won’t answer, yet they are part of the population.


CHAPTER 3. PRODUCING DATA 12

• Response bias: Fancy term for lying when you think you should not tell the truth. This is particularly important

when the questions are very personal (e.g., “How much do you drink?”) or related to the past (e.g., “How many texts

did you send last year?”). People may also simply not remember the correct answer.

• Wording effects: Questions worded like “Do you agree that it is awful that . . . ” are prompting you to give a

particular response. Confusing wording can also affect results.

Exercise

1- To assess the opinions of shoppers at a mall on parking lot safety, a reporter interviews 15 shoppers he meets walking in

the mall in the evening who are willing to give their opinion. What is the sample here? What is the population?

• All those shoppers walking in the mall in the evening

• All shoppers at malls with parking lot safety issues

• The 15 shoppers interviewed

• All shoppers approached by the reporter

Acknowledgement

The core content of the slides are from the textbook of this course;

The Practice of Statistics for Business and Economics (5th Edition)

by

Layth C. Alwan, Bruce A. Craig, George P. McCabe

You might also like