LESSON1 ObtainingData
LESSON1 ObtainingData
Introduction
In any research undertaking there must be data that should be collected,
organized, analyzed, and interpreted. Without the necessary data no research
activity can ever hope to succeed. It goes without saying, therefore, that the
collection of data is an essential phase of the research process.
In learning the concepts of data analysis, the students will be taught of the value
of accuracy, patience, critical thinking, timeliness, efficiency and effectiveness and
cooperation.
Learning Content
Here's a simple analogy: Imagine you have a list of numbers: 25, 30, 28, 32, 29.
This is data. It's just a set of numbers without context.
To make this data meaningful, we need to give it context. For example, if we know
these numbers represent the daily temperatures in a city, they become information.
We can now understand that the city experienced a range of temperatures from
25 to 32 degrees.
1
B. Examples of Data
Numbers: 10, 5.2, 1000
Text: "Hello World", "The sky is blue."
Images: Photographs, drawings, diagrams
Audio: Recordings of speech, music
Video: Movies, documentaries
Data are facts, or set of information gathered or under study. According to Good,
data is “an accepted number, quantity, facts, or relation used as a basis for drawing
conclusions, making inferences, or carrying out investigations.”
In short, data is the building block of information. It's the raw material that, when
processed and analyzed, transforms into knowledge and insights that can be used
to make informed decisions.
2
Think of it like this:
Structured Data: A neatly organized library with books shelved by subject
and author.
Unstructured Data: A chaotic attic filled with boxes, furniture, and random
items.
a. Discrete Data. Data that can only take on specific, separate values, often whole
numbers. (Counting - “No. of…”)
For example, the number of apples in a basket, the number of students in a class,
number of units enrolled, monthly salary, and grade.
b. Continuous Data. Data that can take on any value within a range. (Measuring-
“Amount of…”)
For example, height, weight, temperature, and time.
3
The nature of data dictates the methodology:
If the data is numerical, the methodology is quantitative.
If the data is verbal, the methodology is qualitative.
In recent years, with the advancement of technology and the growth of the internet,
the amount of data being generated and collected has exploded, leading to the
concept of "big data." This refers to extremely large and complex datasets that
require specialized tools and techniques to manage, process, and analyze
effectively.
4
Observations: Observing and recording behaviors or events.
Experiments: Controlled tests to gather data on the effects of specific
variables.
Databases: Existing collections of structured data (e.g., customer records,
financial transactions).
Sensors: Devices that capture real-time data (e.g., temperature sensors,
GPS trackers).
Social Media: Online platforms that generate vast amounts of user-
generated content.
Public Records: Government or publicly available data (e.g., weather data,
census data).
E. Data Cleaning
Once collected, data often needs to be cleaned and prepared for analysis. This
may involve:
Removing duplicates: Identifying and removing duplicate entries.
5
Handling missing values: Imputing or removing missing data points.
Formatting data: Ensuring consistency in data formats and units.
The concept of data collection isn’t a new one but the world has changed. There
is far more data available today, and it exists in forms that were unheard of a
century ago. The data collection process has had to change and grow with the
times, keeping pace with technology.
6
Whether you’re in the world of academia, trying to conduct research, or part of the
commercial sector, thinking of how to promote a new product, you need data
collection to help you make better choices.
7
B. The Importance of Ensuring Accurate and Appropriate Data Collection
Accurate data collecting is crucial to preserving the integrity of research,
regardless of the subject of study or preferred method for defining data
(quantitative, qualitative). Errors are less likely to occur when the right data
gathering tools are used (whether they are brand-new ones, updated versions of
them, or already available).
Among the effects of data collection done incorrectly, include the following -
Erroneous conclusions that squander resources
Decisions that compromise public policy
Incapacity to correctly respond to research inquiries
Bringing harm to participants who are humans or animals
Deceiving other researchers into pursuing futile research avenues
The study's inability to be replicated and validated
When these study findings are used to support recommendations for public policy,
there is the potential to result in disproportionate harm, even if the degree of
influence from flawed data collecting may vary by discipline and the type of
investigation.
8
d. Experiments: Experimental studies involve the manipulation of variables to
observe their impact on the outcome. Researchers control the conditions and
collect data to draw conclusions about cause-and-effect relationships.
e. Focus Groups: Focus groups bring together a small group of individuals who
discuss specific topics in a moderated setting. This method helps in understanding
opinions, perceptions, and experiences shared by the participants.
Before an analyst begins collecting data, they must answer three questions first:
What’s the goal or purpose of this research?
What kinds of data are they planning on gathering?
What methods and procedures will be used to collect, store, and process the
information?
Additionally, we can break up data into qualitative and quantitative types.
Qualitative data covers descriptions such as color, size, quality, and appearance.
Quantitative data, unsurprisingly, deals with numbers, such as statistics, poll
numbers, percentages, etc.
9
A. Finding Relevant Data
Finding relevant data is not so easy. There are several factors that we need to
consider while trying to find relevant data, which include -
Relevant Domain
Relevant demographics
Relevant Time period and so many more factors that we need to consider while
trying to find relevant data.
Data that is not relevant to our study in any of the factors render it obsolete and
we cannot effectively proceed with its analysis. This could lead to incomplete
research or analysis, re-collecting data again and again, or shutting down the
study.
10
1.1.5 Issues Related to Maintaining the Integrity of Data Collection
In order to assist the errors detection process in the data gathering process,
whether they were done purposefully (deliberate falsifications) or not, maintaining
data integrity is the main justification (systematic or random errors).
Quality assurance and quality control are two strategies that help protect data
integrity and guarantee the scientific validity of study results.
Each strategy is used at various stages of the research timeline:
Quality assurance - events that happen before data gathering starts
Quality control - tasks that are performed both after and during data collecting
A. Quality Assurance
As data collecting comes before quality assurance, its primary goal is "prevention"
(i.e., forestalling problems with data collection). The best way to protect the
accuracy of data collection is through prevention. The uniformity of protocol
created in the thorough and exhaustive procedures manual for data collecting
serves as the best example of this proactive step.
The likelihood of failing to spot issues and mistakes early in the research attempt
increases when guides are written poorly. There are several ways to show these
shortcomings:
Failure to determine the precise subjects and methods for retraining or training
staff employees in data collecting
List of goods to be collected, in part
There isn't a system in place to track modifications to processes that may occur
as the investigation continues.
Instead of detailed, step-by-step instructions on how to deliver tests, there is a
vague description of the data gathering tools that will be employed.
Uncertainty regarding the date, procedure, and identity of the person or people
in charge of examining the data
Incomprehensible guidelines for using, adjusting, and calibrating the data
collection equipment.
B. Quality Control
Despite the fact that quality control actions (detection/monitoring and intervention)
take place both after and during data collection, the specifics should be
meticulously detailed in the procedures manual. Establishing monitoring systems
requires a specific communication structure, which is a prerequisite. Following the
discovery of data collection problems, there should be no ambiguity regarding the
information flow between the primary investigators and staff personnel. A poorly
11
designed communication system promotes slack oversight and reduces
opportunities for error detection.
Direct staff observation conference calls, during site visits, or frequent or routine
assessments of data reports to spot discrepancies, excessive numbers, or invalid
codes can all be used as forms of detection or monitoring. Site visits might not be
appropriate for all disciplines. Still, without routine auditing of records, whether
qualitative or quantitative, it will be challenging for investigators to confirm that data
gathering is taking place in accordance with the manual's defined methods.
Additionally, quality control determines the appropriate solutions, or "actions," to
fix flawed data gathering procedures and reduce recurrences.
Problems with data collection, for instance, that call for immediate action include:
Fraud or misbehavior
Systematic mistakes, procedure violations
Individual data items with errors
Issues with certain staff members or a site's performance
Researchers are trained to include one or more secondary measures that can be
used to verify the quality of information being obtained from the human subject in
the social and behavioral sciences where primary data collection entails using
human subjects.
12
transformation, and modeling of data to extract meaningful insights. There are
various types of data analysis, each serving a distinct purpose and offering
different perspectives on data.
1. Descriptive Analysis
Descriptive analysis is the most basic type of analysis, focusing on summarizing
and describing the data. It answers the question "What happened?" by providing
a clear and concise overview of the data.
Goal: To summarize and describe the characteristics of a dataset.
Methods: Uses measures of central tendency (mean, median, mode),
measures of dispersion (range, standard deviation), and visualizations like
histograms, bar charts, and pie charts.
Example: Analyzing customer demographics to understand the average
age, gender, and location of your customers.
2. Diagnostic Analysis
Diagnostic analytics delves deeper than descriptive analytics, seeking to
understand the "Why" behind the observed trends. It examines the underlying
causes and factors that contribute to the patterns revealed in descriptive analysis.
Goal: To investigate the "why" behind observed patterns or trends in data.
Methods: Often follows descriptive analysis, examining related data
sources, historical data, and potential causal factors.
Example: Investigating why sales of a specific product decreased in a
particular month by analyzing factors like competitor activity, seasonal
trends, and marketing campaigns.
13
4. Inferential Analysis
Inferential analysis uses statistical methods to draw conclusions and make
generalizations about a larger population based on a smaller sample of data. It
aims to answer the question "What can we infer about the population from this
sample?"
Goal: To draw conclusions about a larger population based on a sample of
data.
Methods: Uses statistical techniques like hypothesis testing and
confidence intervals to make inferences about the population.
Example: Conducting a survey of 100 customers to infer the satisfaction
level of all customers.
5. Predictive Analysis
Predictive analysis uses historical data and statistical modeling to forecast future
trends and outcomes. It answers the question "What might happen in the future?"
by leveraging patterns and relationships identified in previous analyses.
Goal: To forecast future trends and outcomes based on historical data.
Methods: Uses machine learning algorithms, regression models, and time
series analysis to make predictions.
Example: Predicting the demand for a product in the next quarter based on
historical sales data.
6. Causal Analysis
Causal analytics aims to establish cause-and-effect relationships between
variables. It goes beyond correlation, seeking to understand how changes in one
variable directly impact another.
Goal: To determine the cause-and-effect relationships between variables.
Methods: Often involves controlled experiments, randomized controlled
trials, and statistical modeling to isolate causal relationships.
Example: Conducting an A/B test to determine the impact of a new
marketing campaign on website traffic.
7. Mechanistic Analysis
Mechanistic analysis is a type of data analysis that focuses on understanding the
underlying mechanisms and processes that drive observed relationships between
variables.
Goal: To understand the underlying mechanisms and processes that drive
observed relationships.
14
Methods: Often used in physical or engineering sciences, requiring high
precision and meticulous methodologies.
Example: Analyzing data from a nuclear fusion experiment to understand
the processes involved in energy generation.
8. Prescriptive Analysis
Prescriptive analysis goes beyond predictions, offering recommendations and
actionable insights based on the analysis. It answers the question "What should
we do next?" by suggesting optimal courses of action to achieve specific goals.
Goal: To recommend actions or strategies based on insights from previous
data analyses.
Methods: Often involves optimization algorithms, machine learning, and
simulation modeling to determine the best course of action.
Example: Recommending pricing strategies for a product based on
predicted demand and competitor pricing.
B. Number of Variables
In data analysis, univariate, bivariate, and multivariate refer to the number of
variables being analyzed simultaneously.
1. Univariate Analysis
Focus: Examines a single variable at a time.
Purpose: To describe and summarize the characteristics of that variable.
This involves measures like mean, median, mode, standard deviation, and
visualizations like histograms or bar charts.
Example: Analyzing the average age of customers in a store.
2. Bivariate Analysis
Focus: Examines the relationship between two variables.
Purpose: To determine if there's a correlation or association between the
two variables. This involves scatter plots, correlation coefficients, and
regression analysis.
Example: Investigating whether there's a relationship between the numbers
of hours spent studying and exam scores.
3. Multivariate Analysis
Focus: Examines the relationships between three or more variables
simultaneously.
15
Purpose: To understand complex interactions between variables, identify
patterns, and make predictions. This involves techniques like multiple
regression, factor analysis, principal component analysis, and cluster
analysis.
Example: Analyzing the impact of age, income, and education level on a
person's likelihood of buying a specific product.
16
B. Research Process
A. Conducting a Survey
There are various methods for administering a survey. It can be done as a face-to
face interview or a phone interview where the researcher is questioning the
subject. A different option is to have a self-administered survey where the subject
can complete a survey on paper and mail it back, or complete the survey online.
17
The advantages of self-administered surveys are that they are less expensive than
interviews, do not require a large staff of experienced interviewers and can be
administered in large numbers. In addition, anonymity and privacy encourage more
candid and honest responses, and there is less pressure on respondents. The
disadvantages of self-administered surveys are that responders are more likely to
stop participating mid-way through the survey and respondents cannot ask them to
clarify their answers. In addition, there are lower response rates than in personal
interviews, and often the respondents who bother to return surveys represent
extremes of the population – those people who care about the issue strongly,
whichever way their opinion leans.
B. Designing a Survey
Surveys can take different forms. They can be used to ask only one question or they
can ask a series of questions. We can use surveys to test out people’s opinions or
to test a hypothesis.
When designing a survey, the following steps are useful:
1. Determine the goal of your survey: What question do you want to answer?
2. Identify the sample population: Whom will you interview?
3. Choose an interviewing method: face-to-face interview, phone interview, self-
administered paper survey, or internet survey.
4. Decide what questions you will ask in what order, and how to phrase them. (This
is important if there is more than one piece of information you are looking for.)
5. Conduct the interview and collect the information.
6. Analyze the results by making graphs and drawing conclusions.
C. Constructing a Survey
Example C.1. Martha wants to construct a survey that shows which sports students
at her school like to play the most.
a) List the goal of the survey.
The goal of the survey is to find the answer to the question: “Which sports do
students at Martha’s school like to play the most?”
b) What population sample should she interview?
A sample of the population would include a random sample of the student
population in Martha’s school. A good strategy would be to randomly select students
(using dice or a random number generator) as they walk into an all-school
assembly.
18
c) How should she administer the survey?
Face-to-face interviews are a good choice in this case. Interviews will be easy to
conduct since the survey consists of only one question which can be quickly
answered and recorded, and asking the question face to face will help eliminate
non-response bias.
d) Create a data collection sheet that she can use to record her results.
In order to collect the data to this simple survey Martha can design a data collection
sheet such as the one below:
Example C.2. Raoul wants to construct a survey that shows how many hours per
week the average student at his school works.
a) List the goal of the survey.
The goal of the survey is to find the answer to the question “How many hours per
week do you work?”
b) What population sample will he interview?
Raoul suspects that older students might work more hours per week than younger
students. He decides that a stratified sample of the student population would be
appropriate in this case. The strata are grade levels 9th through 12th. He would
19
need to find out what proportion of the students in his school are in each grade
level, and then include the same proportions in his sample.
c) How would he administer the survey?
Face-to-face interviews are a good choice in this case since the survey consists of
two short questions which can be quickly answered and recorded.
d) Create a data collection sheet that Raoul can use to record his results.
In order to collect the data for this survey Raoul designed the data collection sheet
shown below:
This data collection sheet allows Raoul to write down the actual numbers of hours
worked per week by students as opposed to just collecting tally marks for several
categories.
20
In Example C.1., Martha interviewed 112 students and obtained the following
results.
21
Figure 6. Percentage calculation from Martha’s data collection results
Now we can make a graph where the height of each bar represents the percentage
of students in each category:
Figure 7. Bar graph showing the percentage of students playing each sport
22
category by 360 degrees (the total number of degrees in a circle). To draw a pie-
chart by hand, you can use a protractor to measure the central angles that you find
for each category.
Figure 8. Table showing the sports, percentage and central angle calculations
For the second survey, actual numerical data can be collected from each student.
In this case we can display the data using a stem-and-leaf plot, a frequency table
(a table that summarizes a data set by stating the number of times each value
occurs within the data set), a histogram, or a box-and-whisker plot.
In the Example C.2., Raoul found that that 30% of the students at his school are
in 9th grade, 26% of the students are in the 10th grade, 24% of the students are
23
in 11th grade and 20% of the students are in the 12th grade. He surveyed a total
of 60 students using these proportions as a guide for the number of students he
interviewed from each grade. Raoul recorded the following data:
We can easily see from the stem-and-leaf plot that the mode of the data is 0. This
makes sense because many students do not work in high school.
24
III. Draw a histogram of the data.
The histogram associated with this frequency table is shown below.
IV. Find the five number summary of the data and draw a box-and-whisker plot.
The five number summary is as follows:
smallest number = 0
largest number = 22
Since there are 60 data points,
The median is the mean of the 30th and the 31st values:
median = 6.5
Since each half of the list has 30 values in it, then the first and third quartiles are
the medians of each of the smaller lists. The first quartile is the mean of
the 15th and 16th values:
first quartile = 0
The third quartile is the mean of the 45th and 46th values:
third quartile = 12
Decide what phenomenon you wish to investigate. Specify how you can
manipulate the factor and hold all other conditions fixed, to insure that these
extraneous conditions aren't influencing the response you plan to measure.
Then measure your chosen response variable at several (at least two) settings of
the factor under study. If changing the factor causes the phenomenon to change,
then you conclude that there is indeed a cause-and-effect relationship at work.
How many factors are involved when you do an experiment? Some say two -
perhaps this is a comparative experiment? Perhaps there is a treatment group and
a control group? If you have a treatment group and a control group then, in this
case, you probably only have one factor with two levels.
How many of you have baked a cake? What are the factors involved to ensure a
successful cake? Factors might include preheating the oven, baking time,
ingredients, amount of moisture, baking temperature, etc. -- what else? You
probably follow a recipe so there are many additional factors that control the
ingredients - i.e., a mixture. In other words, someone did the experiment in
advance! What parts of the recipe did they vary to make the recipe a success?
Probably many factors, temperature and moisture, various ratios of ingredients,
and presence or absence of many additives. Now, should one keep all the factors
involved in the experiment at a constant level and just vary one to see what would
happen? This is a strategy that works but is not very efficient. This is one of the
concepts that we will address in this course.
26
theory with the resources at hand. From an engineering perspective we're trying
to use experimentation for the following purposes:
reduce time to design/develop new products & processes
improve performance of existing processes
improve reliability and performance of products
achieve product & process robustness
perform evaluation of materials, design alternatives, setting component &
system tolerances, etc.
We always want to fine-tune or improve the process. In today's global world this
drive for competitiveness affects all of us both as consumers and producers.
Robustness is a concept that enters into statistics at several points. At the analysis,
stage robustness refers to a technique that isn't overly influenced by bad data.
Even if there is an outlier or bad data you still want to get the right answer.
Regardless of who or what is involved in the process - it is still going to work.
Every experiment design has inputs. Back to the cake baking example: we have
our ingredients such as flour, sugar, milk, eggs, etc. Regardless of the quality of
these ingredients we still want our cake to come out successfully. In every
experiment there are inputs and in addition, there are factors (such as time of
baking, temperature, geometry of the cake pan, etc.), some of which you can
control and others that you can't control. The experimenter must think about factors
that affect the outcome. We also talk about the output and the yield or the response
to your experiment. For the cake, the output might be measured as texture, flavor,
height, size, or flavor.
27
B. Four Eras in the History of DOE
Here's a quick timeline:
The agricultural origins, 1918 – 1940s
o R. A. Fisher & his co-workers
o Profound impact on agricultural science
o Factorial designs, ANOVA
The first industrial era, 1951 – late 1970s
o Box & Wilson, response surfaces
o Applications in the chemical & process industries
The second industrial era, late 1970s – 1990
o Quality improvement initiatives in many companies
o CQI and TQM were important ideas and became management goals
o Taguchi and robust parameter design, process robustness
The modern era, beginning circa 1990, when economic competitiveness
and globalization are driving all sectors of the economy to be more
competitive.
estimate of the variance of the sample mean, i.e., The width of the confidence
interval is determined by this statistic. Our estimates of the mean become less
variable as the sample size increases.
Replication is the basic issue behind every method we will use in order to get a
handle on how precise our estimates are at the end. We always want to estimate
or control the uncertainty in our results. We achieve this estimate through
replication. Another way we can achieve short confidence intervals is by reducing
28
the error variance itself. However, when that isn't possible, we can reduce the error
in our estimate of the mean by increasing n.
Another way is to reduce the size or the length of the confidence interval is to
reduce the error variance - which brings us to blocking.
C. Blocking
Blocking is a technique to include other factors in our experiment which contribute
to undesirable variation. Much of the focus in this class will be to creatively use
various blocking techniques to control sources of variation that will reduce error
variance. For example, in human studies, the gender of the subjects is often an
important factor. Age is another factor affecting the response. Age and gender
are often considered nuisance factors which contribute to variability and make it
difficult to assess systematic effects of a treatment. By using these as blocking
factors, you can avoid biases that might occur due to differences between the
allocations of subjects to the treatments, and as a way of accounting for some
noise in the experiment. We want the unknown error variance at the end of the
experiment to be as small as possible. Our goal is usually to find out something
about a treatment factor (or a factor of primary interest), but in addition to this, we
want to include any blocking factors that will explain variation.
D. Multi-factor Designs
The point to all of these multi-factor designs is contrary to the scientific method
where everything is held constant except one factor which is varied. The one factor
at a time method is a very inefficient way of making scientific advances. It is much
better to design an experiment that simultaneously includes combinations of
multiple factors that may affect the outcome. Then you learn not only about the
primary factors of interest but also about these other factors. These may be
blocking factors which deal with nuisance parameters or they may just help you
understand the interactions or the relationships between the factors that influence
the response.
E. Confounding
Confounding is something that is usually considered bad! Here is an example. Let's
say we are doing a medical study with drugs A and B. We put 10 subjects on drug
A and 10 on drug B. If we categorize our subjects by gender, how should we
allocate our drugs to our subjects? Let's make it easy and say that there are 10
male and 10 female subjects. A balanced way of doing this study would be to put
29
five males on drug A and five males on drug B, five females on drug A and five
females on drug B. This is a perfectly balanced experiment such that if there is a
difference between male and female at least it will equally influence the results
from drug A and the results from drug B.
30
A. Factors
We usually talk about "treatment" factors, which are the factors of primary interest
to you. In addition to treatment factors, there are nuisance factors which are not
your primary focus, but you have to deal with them. Sometimes these are called
blocking factors, mainly because we will try to block on these factors to prevent
them from influencing the results.
There are other ways that we can categorize factors:
Experimental vs. Classification Factors
Experimental Factors
These are factors that you can specify (and set the levels) and then assign
at random as the treatment to the experimental units. Examples would be
temperature, level of an additive fertilizer amount per acre, etc.
Classification Factors
These can't be changed or assigned, these come as labels on the
experimental units. The age and sex of the participants are classification
factors which can't be changed or randomly assigned. But you can select
individuals from these groups randomly.
References:
Aquino, G V. (1971). Essentials of Research and Thesis Writing. 1st ed.
Phoenix Publishing House, Inc.
31
Montgomery, D. C. (2019). Design and Analysis of Experiments, 10th Edition, John
Wiley & Sons. ISBN 978-1-119-59340-9 Accessed through this link:
https://online.stat.psu.edu/stat503/book/export/html/632
Neo, B., Urwin, M. (2024). 8 Types of Data Analysis. Accessed through this link:
https://builtin.com/data-science/types-of-data-
analysis?need_sec_link=1&sec_link_scene=im
32