01 Data Collection
01 Data Collection
1. DATA COLLECTION
When faced with a research problem, you need to collect, analyze and interpret data to answer your research questions. Examples of research
questions that could require you to gather data include how many people will vote for a candidate, what is the best product mix to use and how
useful is a drug in curing a disease. The research problem you explore informs the type of data you’ll collect and the data collection method you’ll
use. In this article, we will explore various types of data, methods of data collection and advantages and disadvantages of each. After reading
our review, you will have an excellent understanding of when to use each of the data collection methods we discuss.
Data that is expressed in numbers and summarized using statistics to give meaningful information is referred to as quantitative data. Examples
of quantitative data we could collect are heights, weights, or ages of students. If we obtain the mean of each set of measurements, we have
meaningful information about the average value for each of those student characteristics.
When we use data for description without measurement, we call it qualitative data. Examples of qualitative data are student attitudes towards
school, attitudes towards exam cheating and friendliness of students to teachers. Such data cannot be easily summarized using statistics.
Primary Data
When we obtain data directly from individuals, objects or processes, we refer to it as primary data. Quantitative or qualitative data can be collected
using this approach. Such data is usually collected solely for the research problem you will study. Primary data has several advantages. First,
we tailor it to our specific research question, so there are no customizations needed to make the data usable. Second, primary data is reliable
because you control how the data is collected and can monitor its quality. Third, by collecting primary data, you spend your resources in collecting
only required data. Finally, primary data is proprietary, so you enjoy advantages over those who cannot access the data.
Despite its advantages, primary data also has disadvantages of which you need to be aware. The first problem with primary data is that it is
costlier to acquire as compared to secondary data. Obtaining primary data also requires more time as compared to gathering secondary data.
Secondary Data
When you collect data after another researcher or agency that initially gathered it makes it available, you are gathering secondary data. Examples
of secondary data are census data published by the US Census Bureau, stock prices data published by CNN and salaries data published by the
Bureau of Labor Statistics.
One advantage to using secondary data is that it will save you time and money, although some data sets require you to pay for access. A second
advantage is the relative ease with which you can obtain it. You can easily access secondary data from publications, government agencies, data
aggregation websites and blogs. A third advantage is that it eliminates effort duplication since you can identify existing data that matches your
needs instead of gather new data.
Despite the benefits it offers, secondary data has its shortcomings. One limitation is that secondary data may not be complete. For it to meet your
research needs, you may need to enrich it with data from other sources. A second shortcoming is that you cannot verify the accuracy of secondary
data, or the data may be outdated. A third challenge you face when using secondary data is that documentation may be incomplete or missing.
Therefore, you may not be aware of any problems that happened in data collection which would otherwise influence its interpretation. Another
challenge you may face when you decide to use secondary data is that there may be copyright restrictions.
Now that we’ve explained the various types of data you can collect when conducting research, we will proceed to look at methods used to collect
primary and secondary data.
When you decide to conduct original research, the data you gather can be quantitative or qualitative. Generally, you collect quantitative data
through sample surveys, experiments and observational studies. You obtain qualitative data through focus groups, in-depth interviews and case
studies. We will discuss each of these data collection methods below and examine their advantages and disadvantages.
Sample Surveys
A survey is a data collection method where you select a sample of respondents from a large population in order to gather information about that
population. The process of identifying individuals from the population who you will interview is known as sampling.
To gather data through a survey, you construct a questionnaire to prompt information from selected respondents. When creating a questionnaire,
you should keep in mind several key considerations. First, make sure the questions and choices are unambiguous. Second, make sure the
questionnaire will be completed within a reasonable amount of time. Finally, make sure there are no typographical errors. To check if there are
any problems with your questionnaire, use it to interview a few people before administering it to all respondents in your sample. We refer to this
process as pretesting.
Using a survey to collect data offers you several advantages. The main benefit is time and cost savings because you only interview a sample,
not the large population. Another benefit is that when you select your sample correctly, you will obtain information of acceptable accuracy.
Additionally, surveys are adaptable and can be used to collect data for governments, health care institutions, businesses and any other
environment where data is needed.
A major shortcoming of surveys occurs when you fail to select a sample correctly; without an appropriate sample, the results will not accurately
generalize the population.
Once you have selected your sample and developed your questionnaire, there are several ways you can interview participants. Each approach
has its advantages and disadvantages.
In-person Interviewing
When you use this method, you meet with the respondents face to face and ask questions. In-person interviewing offers several advantages.
This technique has excellent response rates and enables you to conduct interviews that take a longer amount of time. Another benefit is you can
ask follow-up questions to responses that are not clear.
In-person interviews do have disadvantages of which you need to be aware. First, this method is expensive and takes more time because of
interviewer training, transport, and remuneration. A second disadvantage is that some areas of a population, such as neighborhoods prone to
crime, cannot be accessed which may result in bias.
Telephone Interviewing
Using this technique, you call respondents over the phone and interview them. This method offers the advantage of quickly collecting data,
especially when used with computer-assisted telephone interviewing. Another advantage is that collecting data via telephone is cheaper than in-
person interviewing.
One of the main limitations with telephone interviewing it’s hard to gain the trust of respondents. Due to this reason, you may not get responses
or may introduce bias. Since phone interviews are generally kept short to reduce the possibility of upsetting respondents, this method may also
limit the amount of data you can collect.
Online Interviewing
With online interviewing, you send an email inviting respondents to participate in an online survey. This technique is used widely because it is a
low-cost way of interviewing many respondents. Another benefit is anonymity; you can get sensitive responses that participants would not feel
comfortable providing with in-person interviewing.
When you use online interviewing, you face the disadvantage of not getting a representative sample. You also cannot seek clarification on
responses that are unclear.
Mailed Questionnaire
When you use this interviewing method, you send a printed questionnaire to the postal address of the respondent. The participants fill in the
questionnaire and mail it back. This interviewing method gives you the advantage of obtaining information that respondents may be unwilling to
give when interviewing in person.
The main limitation with mailed questionnaires is you are likely to get a low response rate. Keep in mind that inaccuracy in mailing address,
delays or loss of mail could also affect the response rate. Additionally, mailed questionnaires cannot be used to interview respondents with low
literacy, and you cannot seek clarifications on responses.
Focus Groups
When you use a focus group as a data collection method, you identify a group of 6 to 10 people with similar characteristics. A moderator then
guides a discussion to identify attitudes and experiences of the group. The responses are captured by video recording, voice recording or writing—
this is the data you will analyze to answer your research questions. Focus groups have the advantage of requiring fewer resources and time as
compared to interviewing individuals. Another advantage is that you can request clarifications to unclear responses.
One disadvantage you face when using focus groups is that the sample selected may not represent the population accurately. Furthermore,
dominant participants can influence the responses of others.
In an observational data collection method, you acquire data by observing any relationships that may be present in the phenomenon you are
studying. There are four types of observational methods that are available to you as a researcher: cross-sectional, case-control, cohort and
ecological.
In a cross-sectional study, you only collect data on observed relationships once. This method has the advantage of being cheaper and taking
less time as compared to case-control and cohort. However, cross-sectional studies can miss relationships that may arise over time.
Using a case-control method, you create cases and controls and then observe them. A case has been exposed to a phenomenon of interest
while a control has not. After identifying the cases and controls, you move back in time to observe how your event of interest occurs in the two
groups. This is why case-control studies are referred to as retrospective. For example, suppose a medical researcher suspects a certain type of
cosmetic is causing skin cancer. You recruit people who have used a cosmetic, the cases, and those who have not used the cosmetic, the
controls. You request participants to remember the type of cosmetic and the frequency of its use. This method is cheaper and requires less time
as compared to the cohort method. However, this approach has limitations when individuals you are observing cannot accurately recall
information. We refer to this as recall bias because you rely on the ability of participants to remember information. In the cosmetic example, recall
bias would occur if participants cannot accurately remember the type of cosmetic and number of times used.
In a cohort method, you follow people with similar characteristics over a period. This method is advantageous when you are collecting data on
occurrences that happen over a long period. It has the disadvantage of being costly and requiring more time. It is also not suitable for occurrences
that happen rarely.
The three methods we have discussed previously collect data on individuals. When you are interested in studying a population instead of
individuals, you use an ecological method. For example, say you are interested in lung cancer rates in Iowa and North Dakota. You obtain number
of cancer cases per 1000 people for each state from the National Cancer Institute and compare them. You can then hypothesize possible causes
of differences between the two states. When you use the ecological method, you save time and money because data is already available.
However, the data collected may lead you to infer population relationships that do not exist.
1.1.6 Experiments
An experiment is a data collection method where you as a researcher change some variables and observe their effect on other variables. The
variables that you manipulate are referred to as independent while the variables that change as a result of manipulation are dependent variables.
Imagine a manufacturer is testing the effect of drug strength on number of bacteria in the body. The company decides to test drug strength at
10mg, 20mg and 40mg. In this example, drug strength is the independent variable while number of bacteria is the dependent variable. The drug
administered is the treatment, while 10mg, 20mg and 40mg are the levels of the treatment.
The greatest advantage of using an experiment is that you can explore causal relationships that an observational study cannot. Additionally,
experimental research can be adapted to different fields like medical research, agriculture, sociology, and psychology. Nevertheless, experiments
have the disadvantage of being expensive and requiring a lot of time.
1.1.7 Summary
This article introduced you to the various types of data you can collect for research purposes. We discussed quantitative, qualitative, primary and
secondary data and identified the advantages and disadvantages of each data type. We also reviewed various data collection methods and
examined their benefits and drawbacks. Having read this article, you should be able to select the data collection method most appropriate for
your research question. Data is the evidence that you use to solve your research problem. When you use the correct data collection method, you
get the right data to solve your problem.
A survey is a way to ask a lot of people a few well-constructed questions. The survey is a series of unbiased questions that the subject
must answer. Some advantages of surveys are that they are efficient ways of collecting information from a large number of people, they are
relatively easy to administer, a wide variety of information can be collected and they can be focused (researchers can stick to just the questions
that interest them.) Some disadvantages of surveys arise from the fact that they depend on the subjects’ motivation, honesty, memory and ability
to respond. Moreover, answer choices to survey questions could lead to vague data. For example, the choice “moderately agree” may mean
different things to different people or to whoever ends up interpreting the data.
One of the most important applications of statistics is collecting information. Statistical studies are done for many purposes:
Probability Sampling
– Simple Random Sampling
– Stratified Sampling
– Cluster Sampling
– Systematic Sampling
Non-probability sampling is a sampling technique in which the researcher selects samples based on the subjective judgment of the
researcher rather than random selection.
In non-probability sampling, not all members of the population have a chance of participating in the study unlike probability sampling, where
each member of the population has a known chance of being selected.
PROBABILITY SAMPLING
Probability Sampling is a sampling technique in which sample from a larger population are chosen using a method based on the
theory of probability. For a participant to be considered as a probability sample, he/she must be selected using a random selection.
The most important requirement of probability sampling is that everyone in your population has a known and an equal chance of getting
selected.
Random sampling is a method in which people are chosen “out of the blue.” In a true random sample, everyone in the population must have
the same chance of being chosen. It is important that each person in the population has a chance of being picked.
Stratified sampling is a method actively seeking to poll people from many different backgrounds. The population is first divided into different
categories (or strata) and the number of members in each category is determined. In order to lessen the chance of a biased result, the sample
size must be large enough. The larger the sample size is, the more precise the estimate is. However, the larger the sample size, the more
expensive and time-consuming the statistical study becomes.
Cluster Sampling (choosing representatives which are close to other representatives based on a particular factor such as location, age, color,
size, etc.)
Systematic Sampling is a probability sampling method where the elements are chosen from a target population by selecting a random starting
point and selecting other members after a fixed ‘sampling interval’. Sampling interval is calculated by dividing the entire population size by the
desired sample size.
In school, while selecting the captain of sports teams, most of our coaches asked us to call out numbers such as 1-5 (1-n) and the students
with a random number decided by the coach, for this instance, 3, would be called out to be the captains of different teams.
Biased Samples
If the sample ends up with one or more sub-groups that are either over-represented or under-represented, then we say the sample
is biased. We would not expect the results of a biased sample to represent the entire population, so it is important to avoid selecting a biased
sample.
Some samples may deliberately seek a biased sample in order to obtain a particular viewpoint. For example, if a group of students were trying
to petition the school to allow eating candy in the classroom, they might only survey students immediately before lunchtime when students are
hungry. The practice of polling only those who you believe will support your cause is sometimes referred to as cherry picking.
Many surveys may have a non-response bias. In this case, a survey that is simply handed out gains few responses when compared to the
number of surveys given out. People who are either too busy or simply not interested will be excluded from the results. Non-response bias may
be reduced by conducting face-to-face interviews.
Self-selected respondents who tend to have stronger opinions on subjects than others and are more motivated to respond may also cause
bias. For this reason phone-in and online polls also tend to be poor representations of the overall population. Even though it appears that both
sides are responding, the poll may disproportionately represent extreme viewpoints from both sides, while ignoring more moderate opinions
that may, in fact, be the majority view. Self-selected polls are generally regarded as unscientific.
Biased Questions
Although your sample may be a good representation of the population, the way questions are worded in the survey can still provoke a biased
result. There are several ways to identify biased questions.
1. They may use polarizing language, words, and phrases that people associate with emotions.
o How much of your time do you waste on TV every week?
2. They may refer to a majority or to a supposed authority.
o Would you agree with the American Heart and Lung Association that smoking is bad for your health?
3. They may be phrased so as to suggest the person asking the question already knows the answer to be true, or to be false.
o You wouldn’t want criminals free to roam the streets, would you?
4. They may be phrased in an ambiguous way (often with double negatives), which may confuse people.
o Do you disagree with people who oppose the ban on smoking in public places?
Surveys can take different forms. They can be used to ask only one question or they can ask a series of questions. We can use surveys to test
out people’s opinions or to test a hypothesis.
1. Determine the goal of your survey: What question do you want to answer?
3. Choose an interviewing method: face-to-face interview, phone interview, self-administered paper survey, or internet survey.
4. Decide what questions you will ask in what order, and how to phrase them. (This is important if there is more than one piece of information
you are looking for.)
Designing a Survey
Example 1:
What if you wanted to know the percentage of the people in your town who eat breakfast at least 5 times per week? Your town is too big to ask
each person individually, so what would you do? How could you ensure that the information you find out is accurate?
Because your town is too big to ask each person individually, you could conduct a random sample. In order to ensure that the information you
collect is accurate, you can use a non-biased survey question.
Example 2:
Raoul wants to construct a survey that shows how many hours per week the average student at his school works.
1. List the goal of the survey.
The goal of the survey is to determine the number of hours the average student at Raoul's school works.
The population to be surveyed is the student body. Care should be taken to randomly select students, so that variety of student life
is represented. Surveying the football team only would not be a good representation of the whole student body. Students can be
randomly surveyed, using dice or a random number generator, in a setting where all, or most, students will be, such as an
assembly.
CLASS ASSIGNMENT:
Comment on the way the following samples have been chosen. For the unsatisfactory cases, suggest a way to improve the sample choice.
a. You want to find whether wealthier people have more nutritious diets by interviewing people coming out of a five-star restaurant.
b. You want to find if a pedestrian crossing is needed at a certain intersection by interviewing people walking by that intersection.
c. You want to find out if women talk more than men by interviewing an equal number of men and women.
d. You want to find whether students in your school get too much homework by interviewing a stratified sample of students from each
grade level.
e. You want to find out whether there should be more public busses running during rush hour by interviewing people getting off the
bus.
f. You want to find out whether children should be allowed to listen to music while doing their homework by interviewing a stratified
sample of male and female students in your school.