100% found this document useful (1 vote)
66 views32 pages

LESSON1 ObtainingData

ABE057, compiled by Engr. Macalimpas

Uploaded by

Shakira Bila
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
66 views32 pages

LESSON1 ObtainingData

ABE057, compiled by Engr. Macalimpas

Uploaded by

Shakira Bila
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

LESSON 1

BASIC CONCEPTS OF DATA ANALYSIS

Introduction
In any research undertaking there must be data that should be collected,
organized, analyzed, and interpreted. Without the necessary data no research
activity can ever hope to succeed. It goes without saying, therefore, that the
collection of data is an essential phase of the research process.
In learning the concepts of data analysis, the students will be taught of the value
of accuracy, patience, critical thinking, timeliness, efficiency and effectiveness and
cooperation.

Learning Content

1.1. Methods of Data Collection

1.1.1 What is Data?


Data is essentially raw, unprocessed information that, on its own, doesn't hold
much meaning. Think of it like a collection of facts, figures, symbols, or
observations. It's like a jumbled puzzle, where each piece is a piece of data.

Here's a simple analogy: Imagine you have a list of numbers: 25, 30, 28, 32, 29.
This is data. It's just a set of numbers without context.

To make this data meaningful, we need to give it context. For example, if we know
these numbers represent the daily temperatures in a city, they become information.
We can now understand that the city experienced a range of temperatures from
25 to 32 degrees.

A. Key Characteristics of Data


 Raw and Unstructured: Data is not organized or processed in a way that's
easily understandable.
 Objective: Data aims to represent facts and observations without personal
bias.
 Valueless without Context: Data becomes meaningful only when it's
interpreted and analyzed within a specific context.

1
B. Examples of Data
 Numbers: 10, 5.2, 1000
 Text: "Hello World", "The sky is blue."
 Images: Photographs, drawings, diagrams
 Audio: Recordings of speech, music
 Video: Movies, documentaries

Data are facts, or set of information gathered or under study. According to Good,
data is “an accepted number, quantity, facts, or relation used as a basis for drawing
conclusions, making inferences, or carrying out investigations.”

Data refers to raw facts, statistics, observations, or information collected or


generated from various sources. It can be in the form of numbers, text, images,
audio, video, or any other format that represents a piece of information. Data on
its own doesn't hold much meaning; it becomes valuable when it's processed,
analyzed, and interpreted to reveal patterns, trends, insights, and conclusions.

In short, data is the building block of information. It's the raw material that, when
processed and analyzed, transforms into knowledge and insights that can be used
to make informed decisions.

C. Two Main Types of Data


Data can be categorized into two main types:
1. Structured Data. This is data organized in a predefined format, typically rows
and columns like a spreadsheet or database. It's easily searchable, analyzed, and
processed by computers. Examples include tables of sales records, inventory lists,
financial statements or records (transactions, balances), customer information in a
database (name, address, purchase history), and sensor readings (temperature,
humidity).
2. Unstructured Data. This type of data doesn't follow a predefined format and is
more complex. It's often text-heavy, multimedia, or doesn't fit neatly into rows and
columns. Extracting meaningful information from unstructured data typically
requires advanced techniques like natural language processing (NLP) and image
recognition. Examples are text documents (emails, articles, social media posts),
images (photos, videos), audio recordings (music, podcasts), and social media
data (tweets, Facebook posts).

2
Think of it like this:
 Structured Data: A neatly organized library with books shelved by subject
and author.
 Unstructured Data: A chaotic attic filled with boxes, furniture, and random
items.

While structured data is easier to work with, unstructured data is increasingly


important in today's world, as it holds valuable insights that can be unlocked
through advanced analytics techniques.

D. Two General Kinds of Data


1. Quantitative Data. This type of data deals with numerical values and
measurements. It can be counted, measured, and expressed in numbers. These
are the results of counting or measurements. May be discrete or continuous set of
number:

a. Discrete Data. Data that can only take on specific, separate values, often whole
numbers. (Counting - “No. of…”)
For example, the number of apples in a basket, the number of students in a class,
number of units enrolled, monthly salary, and grade.
b. Continuous Data. Data that can take on any value within a range. (Measuring-
“Amount of…”)
For example, height, weight, temperature, and time.

2. Qualitative Data. This type of data describes qualities, characteristics, or


attributes. It is often expressed verbally in words, descriptions, and observations.
Attributes or characteristics such as gender, educational attainment, feelings or
opinions. These are facts for which no numerical measure exists. They are usually
expressed as categories or ranks.

a. Categorical Data. Data that can be sorted into categories or groups.


For example, colors, types of fruits, sex or gender, marital status, color of the skin,
religion, and opinions.
b. Ordinal Data. Data that can be ordered or ranked, but the difference between
values is not necessarily equal. For example, customer satisfaction ratings (very
satisfied, satisfied, neutral, dissatisfied, very dissatisfied), IQ, and educational
qualification.

3
The nature of data dictates the methodology:
 If the data is numerical, the methodology is quantitative.
 If the data is verbal, the methodology is qualitative.

1.1.2 What is Data Collection?


The process of turning raw data into meaningful insights involves several steps,
including data collection, cleaning (removing errors and inconsistencies),
transformation (converting data into a usable format), analysis (identifying patterns
and trends), and interpretation (drawing conclusions and making decisions based
on the analysis). Collection of the data is the first step in conducting statistical
inquiry. It simply refers to the data gathering, a systematic method of collecting
and measuring data from different sources of information in order to provide
answers to relevant questions.

In recent years, with the advancement of technology and the growth of the internet,
the amount of data being generated and collected has exploded, leading to the
concept of "big data." This refers to extremely large and complex datasets that
require specialized tools and techniques to manage, process, and analyze
effectively.

Data collection is the process of gathering raw, unprocessed information from


various sources to be used for analysis, research, or decision-making. It's like
assembling the pieces of a puzzle, gathering all the necessary elements to form a
complete picture. During data collection, the researchers must identify the data
types, the sources of data, and what methods are being used.

A. Purpose of Data Collection


Why are you collecting data? What questions do you want to answer? What
decisions do you need to make?
Example: A farmer might collect data on soil moisture to determine when to irrigate
their crops.

B. Sources of Data Collection


Where will you get the data from? This can include:
 Surveys: Questionnaires to gather opinions, preferences, or information
from individuals.
 Interviews: Direct conversations to gather detailed information from
individuals.

4
 Observations: Observing and recording behaviors or events.
 Experiments: Controlled tests to gather data on the effects of specific
variables.
 Databases: Existing collections of structured data (e.g., customer records,
financial transactions).
 Sensors: Devices that capture real-time data (e.g., temperature sensors,
GPS trackers).
 Social Media: Online platforms that generate vast amounts of user-
generated content.
 Public Records: Government or publicly available data (e.g., weather data,
census data).

C. Methods of Data Collection


Data collection methods are techniques and procedures for gathering information
for research purposes. How will you collect the data? This could involve:
 Online forms: Using websites or applications to collect data from
respondents.
 Paper forms: Using printed questionnaires to collect data manually.
 Direct observation: Observing and recording data using checklists or field
notes.
 Data scraping: Extracting data from websites or online platforms using
automated tools.
 API calls: Using programmatic interfaces to access and retrieve data from
databases or other systems. The Application Programming Interface acts
as a bridge between two systems, allowing them to communicate and
exchange data.

D. Data Storage of Collected Data


Where will you store the collected data? This could be:
 Spreadsheets: Simple tables for organizing and storing data.
 Databases: Structured systems for storing and managing large amounts of
data.
 Cloud storage: Online platforms for storing and accessing data remotely.

E. Data Cleaning
Once collected, data often needs to be cleaned and prepared for analysis. This
may involve:
 Removing duplicates: Identifying and removing duplicate entries.

5
 Handling missing values: Imputing or removing missing data points.
 Formatting data: Ensuring consistency in data formats and units.

In short, data collection is the foundation of data analysis and decision-making.


It's the process of gathering the raw materials that, when processed and analyzed,
can provide valuable insights and support informed decisions.

Data collection is the process of gathering, measuring, and analyzing accurate


data from a variety of relevant sources to find answers to research problems,
answer questions, evaluate outcomes, and forecast trends and probabilities. The
approach of data collection is different for different fields of study, depending on
the required information. The most critical objective of data collection is ensuring
that information-rich and reliable data is collected for statistical analysis so that
data-driven decisions can be made for research. Accurate data collection is
necessary to make informed business decisions, ensure quality assurance, and
keep research integrity.

Data collection is an important aspect of any type of research study. Inaccurate


data collection can impact the results of a study and lead to invalid results. Data
collection as a main stage in research can overshadow the quality of achieving
results by decreasing the possible errors which may occur during a research
project. Therefore, alongside a good design for the study, plenty of quality time
should be spent in the collection of data to gain appropriate results since
insufficient and inaccurate data prevents assuring the accuracy of findings (Kabir,
2016). On the other hand, although a suitable data collection method helps to plan
good research, it cannot necessarily guarantee the overall success of the research
project (Olsen, 2012).

1.1.3 Why Do We Need Data Collection?


Before a judge makes a ruling in a court case or a general creates a plan of attack,
they must have as many relevant facts as possible. The best courses of action
come from informed decisions, and information and data are synonymous.

The concept of data collection isn’t a new one but the world has changed. There
is far more data available today, and it exists in forms that were unheard of a
century ago. The data collection process has had to change and grow with the
times, keeping pace with technology.

6
Whether you’re in the world of academia, trying to conduct research, or part of the
commercial sector, thinking of how to promote a new product, you need data
collection to help you make better choices.

A. Importance of Data Collection


Data collection methods play a crucial role in the research process as they
determine the quality and accuracy of the data collected. Here are some major
importance of data collection methods.
 Quality and Accuracy: The choice of data collection technique directly
impacts the quality and accuracy of the data obtained. Properly designed
methods help ensure that the data collected is error-free and relevant to the
research questions.
 Relevance, Validity, and Reliability: Effective data collection methods
help ensure that the data collected is relevant to the research objectives,
valid (measuring what it intends to measure), and reliable (consistent and
reproducible).
 Bias Reduction and Representativeness: Carefully chosen data
collection methods can help minimize biases inherent in the research
process, such as sampling or response bias. They also aid in achieving a
representative sample, enhancing the findings’ generalizability.
 Informed Decision Making: Accurate and reliable data collected through
appropriate methods provide a solid foundation for making informed
decisions based on research findings. This is crucial for both academic
research and practical applications in various fields.
 Achievement of Research Objectives: Data collection methods should
align with the research objectives to ensure that the collected data
effectively addresses the research questions or hypotheses. Properly
collected data facilitates the attainment of these objectives.
 Support for Validity and Reliability: Validity and reliability are essential to
research validity. The choice of data collection methods can either enhance
or detract from the validity and reliability of research findings. Therefore,
selecting appropriate methods is critical for ensuring the credibility of the
research.
The importance of data collection methods cannot be overstated, as they play a
key role in the research study’s overall success and internal validity.

7
B. The Importance of Ensuring Accurate and Appropriate Data Collection
Accurate data collecting is crucial to preserving the integrity of research,
regardless of the subject of study or preferred method for defining data
(quantitative, qualitative). Errors are less likely to occur when the right data
gathering tools are used (whether they are brand-new ones, updated versions of
them, or already available).
Among the effects of data collection done incorrectly, include the following -
 Erroneous conclusions that squander resources
 Decisions that compromise public policy
 Incapacity to correctly respond to research inquiries
 Bringing harm to participants who are humans or animals
 Deceiving other researchers into pursuing futile research avenues
 The study's inability to be replicated and validated
When these study findings are used to support recommendations for public policy,
there is the potential to result in disproportionate harm, even if the degree of
influence from flawed data collecting may vary by discipline and the type of
investigation.

1.1.4 What Are the Different Methods of Data Collection?


Primary and secondary methods of data collection are two approaches used to
gather information for research or analysis purposes.
1. Primary Data Collection. Primary data collection involves the collection of
original data directly from the source or through direct interaction with the
respondents. This method allows researchers to obtain first-hand information
specifically tailored to their research objectives. Primary data is collected from first-
hand experience and is not used in the past. The data gathered by primary data
collection methods are highly accurate and specific to the research’s motive.
There are various techniques for primary data collection, including:
a. Surveys and Questionnaires: Researchers design structured questionnaires
or surveys to collect data from individuals or groups. These can be conducted
through face-to-face interviews, telephone calls, mail, or online platforms.
b. Interviews: Interviews involve direct interaction between the researcher and the
respondent. They can be conducted in person, over the phone, or through video
conferencing. Interviews can be structured (with predefined questions), semi-
structured (allowing flexibility), or unstructured (more conversational).
c. Observations: Researchers observe and record behaviors, actions, or events
in their natural setting. This method is useful for gathering data on human behavior,
interactions, or phenomena without direct intervention.

8
d. Experiments: Experimental studies involve the manipulation of variables to
observe their impact on the outcome. Researchers control the conditions and
collect data to draw conclusions about cause-and-effect relationships.
e. Focus Groups: Focus groups bring together a small group of individuals who
discuss specific topics in a moderated setting. This method helps in understanding
opinions, perceptions, and experiences shared by the participants.

2. Secondary Data Collection. Secondary data collection involves using existing


data collected by someone else for a purpose different from the original intent.
Researchers analyze and interpret this data to extract relevant information.
Secondary data can be obtained from various sources, including:
a. Published Sources: Researchers refer to books, academic journals,
magazines, newspapers, government reports, and other published materials that
contain relevant data.
b. Online Databases: Numerous online databases provide access to a wide range
of secondary data, such as research articles, statistical information, economic
data, and social surveys.
c. Government and Institutional Records: Government agencies, research
institutions, and organizations often maintain databases or records that can be
used for research purposes.
d. Publicly Available Data: Data shared by individuals, organizations, or
communities on public platforms, websites, or social media can be accessed and
utilized for research.
e. Past Research Studies: Previous research studies and their findings can serve
as valuable secondary data sources. Researchers can review and analyze the
data to gain insights or build upon existing knowledge.

Before an analyst begins collecting data, they must answer three questions first:
 What’s the goal or purpose of this research?
 What kinds of data are they planning on gathering?
 What methods and procedures will be used to collect, store, and process the
information?
Additionally, we can break up data into qualitative and quantitative types.
Qualitative data covers descriptions such as color, size, quality, and appearance.
Quantitative data, unsurprisingly, deals with numbers, such as statistics, poll
numbers, percentages, etc.

9
A. Finding Relevant Data
Finding relevant data is not so easy. There are several factors that we need to
consider while trying to find relevant data, which include -
 Relevant Domain
 Relevant demographics
 Relevant Time period and so many more factors that we need to consider while
trying to find relevant data.
Data that is not relevant to our study in any of the factors render it obsolete and
we cannot effectively proceed with its analysis. This could lead to incomplete
research or analysis, re-collecting data again and again, or shutting down the
study.

B. Deciding the Data to Collect


Determining what data to collect is one of the most important factors while
collecting data and should be one of the first factors while collecting data. We must
choose the subjects the data will cover, the sources we will be used to gather it,
and the quantity of information we will require. Our responses to these queries will
depend on our aims, or what we expect to achieve utilizing your data. As an
illustration, we may choose to gather information on the categories of articles that
website visitors between the ages of 20 and 50 most frequently access. We can
also decide to compile data on the typical age of all the clients who made a
purchase from your business over the previous month.
Not addressing this could lead to double work and collection of irrelevant data or
ruining your study as a whole.

C. Dealing with Big Data


Big data refers to exceedingly massive data sets with more intricate and diversified
structures. These traits typically result in increased challenges while storing,
analyzing, and using additional methods of extracting results. Big data refers
especially to data sets that are quite enormous or intricate that conventional data
processing tools are insufficient. The overwhelming amount of data, both
unstructured and structured, that a business faces on a daily basis.
The amount of data produced by healthcare applications, the internet, social
networking sites social, sensor networks, and many other businesses are rapidly
growing as a result of recent technological advancements. Big data refers to the
vast volume of data created from numerous sources in a variety of formats at
extremely fast rates. Dealing with this kind of data is one of the many challenges
of data collection and is a crucial step toward collecting effective data.

10
1.1.5 Issues Related to Maintaining the Integrity of Data Collection
In order to assist the errors detection process in the data gathering process,
whether they were done purposefully (deliberate falsifications) or not, maintaining
data integrity is the main justification (systematic or random errors).
Quality assurance and quality control are two strategies that help protect data
integrity and guarantee the scientific validity of study results.
Each strategy is used at various stages of the research timeline:
 Quality assurance - events that happen before data gathering starts
 Quality control - tasks that are performed both after and during data collecting

A. Quality Assurance
As data collecting comes before quality assurance, its primary goal is "prevention"
(i.e., forestalling problems with data collection). The best way to protect the
accuracy of data collection is through prevention. The uniformity of protocol
created in the thorough and exhaustive procedures manual for data collecting
serves as the best example of this proactive step.
The likelihood of failing to spot issues and mistakes early in the research attempt
increases when guides are written poorly. There are several ways to show these
shortcomings:
 Failure to determine the precise subjects and methods for retraining or training
staff employees in data collecting
 List of goods to be collected, in part
 There isn't a system in place to track modifications to processes that may occur
as the investigation continues.
 Instead of detailed, step-by-step instructions on how to deliver tests, there is a
vague description of the data gathering tools that will be employed.
 Uncertainty regarding the date, procedure, and identity of the person or people
in charge of examining the data
 Incomprehensible guidelines for using, adjusting, and calibrating the data
collection equipment.

B. Quality Control
Despite the fact that quality control actions (detection/monitoring and intervention)
take place both after and during data collection, the specifics should be
meticulously detailed in the procedures manual. Establishing monitoring systems
requires a specific communication structure, which is a prerequisite. Following the
discovery of data collection problems, there should be no ambiguity regarding the
information flow between the primary investigators and staff personnel. A poorly

11
designed communication system promotes slack oversight and reduces
opportunities for error detection.
Direct staff observation conference calls, during site visits, or frequent or routine
assessments of data reports to spot discrepancies, excessive numbers, or invalid
codes can all be used as forms of detection or monitoring. Site visits might not be
appropriate for all disciplines. Still, without routine auditing of records, whether
qualitative or quantitative, it will be challenging for investigators to confirm that data
gathering is taking place in accordance with the manual's defined methods.
Additionally, quality control determines the appropriate solutions, or "actions," to
fix flawed data gathering procedures and reduce recurrences.
Problems with data collection, for instance, that call for immediate action include:
 Fraud or misbehavior
 Systematic mistakes, procedure violations
 Individual data items with errors
 Issues with certain staff members or a site's performance
Researchers are trained to include one or more secondary measures that can be
used to verify the quality of information being obtained from the human subject in
the social and behavioral sciences where primary data collection entails using
human subjects.

1.1.6 Analysis of Data


Data analysis is one of the fields of statistics and it refers to the process of
extracting relevant and noteworthy information from the collected data and this
uses statistical tools or techniques.

The purpose of analysis is to reduce data to intelligible and interpretable form so


that the relations of research problems can be studied and tested. The
interpretation is the search for meaning and implication by taking the results of
analysis and making inferences to the research studied and ultimately drawing
conclusion about those relations.
Note: The kind of analysis that will be used will depend on the kind of data that
were obtained or gathered.

A. Type of Data Analysis


Data analysis is the process of examining raw data to extract meaningful insights
and patterns. It plays a crucial role in various fields, helping us understand trends,
make informed decisions, and solve problems. Data analysis is a crucial aspect of
data science and data analytics, involving the examination, cleaning,

12
transformation, and modeling of data to extract meaningful insights. There are
various types of data analysis, each serving a distinct purpose and offering
different perspectives on data.
1. Descriptive Analysis
Descriptive analysis is the most basic type of analysis, focusing on summarizing
and describing the data. It answers the question "What happened?" by providing
a clear and concise overview of the data.
 Goal: To summarize and describe the characteristics of a dataset.
 Methods: Uses measures of central tendency (mean, median, mode),
measures of dispersion (range, standard deviation), and visualizations like
histograms, bar charts, and pie charts.
 Example: Analyzing customer demographics to understand the average
age, gender, and location of your customers.

2. Diagnostic Analysis
Diagnostic analytics delves deeper than descriptive analytics, seeking to
understand the "Why" behind the observed trends. It examines the underlying
causes and factors that contribute to the patterns revealed in descriptive analysis.
 Goal: To investigate the "why" behind observed patterns or trends in data.
 Methods: Often follows descriptive analysis, examining related data
sources, historical data, and potential causal factors.
 Example: Investigating why sales of a specific product decreased in a
particular month by analyzing factors like competitor activity, seasonal
trends, and marketing campaigns.

3. Exploratory Analysis (EDA)


Exploratory Data Analysis (EDA) is a crucial step in the data analysis process,
involving visualizing and exploring the data to gain a deeper understanding of its
characteristics and potential relationships. It's often used to identify patterns,
outliers, and potential areas for further investigation.
 Goal: To discover hidden patterns, relationships, and anomalies in data.
 Methods: Uses visual techniques like scatter plots, box plots, and
heatmaps to identify potential correlations and relationships.
 Example: Examining the relationship between temperature and ice cream
sales to see if there's a correlation.

13
4. Inferential Analysis
Inferential analysis uses statistical methods to draw conclusions and make
generalizations about a larger population based on a smaller sample of data. It
aims to answer the question "What can we infer about the population from this
sample?"
 Goal: To draw conclusions about a larger population based on a sample of
data.
 Methods: Uses statistical techniques like hypothesis testing and
confidence intervals to make inferences about the population.
 Example: Conducting a survey of 100 customers to infer the satisfaction
level of all customers.

5. Predictive Analysis
Predictive analysis uses historical data and statistical modeling to forecast future
trends and outcomes. It answers the question "What might happen in the future?"
by leveraging patterns and relationships identified in previous analyses.
 Goal: To forecast future trends and outcomes based on historical data.
 Methods: Uses machine learning algorithms, regression models, and time
series analysis to make predictions.
 Example: Predicting the demand for a product in the next quarter based on
historical sales data.

6. Causal Analysis
Causal analytics aims to establish cause-and-effect relationships between
variables. It goes beyond correlation, seeking to understand how changes in one
variable directly impact another.
 Goal: To determine the cause-and-effect relationships between variables.
 Methods: Often involves controlled experiments, randomized controlled
trials, and statistical modeling to isolate causal relationships.
 Example: Conducting an A/B test to determine the impact of a new
marketing campaign on website traffic.

7. Mechanistic Analysis
Mechanistic analysis is a type of data analysis that focuses on understanding the
underlying mechanisms and processes that drive observed relationships between
variables.
 Goal: To understand the underlying mechanisms and processes that drive
observed relationships.

14
 Methods: Often used in physical or engineering sciences, requiring high
precision and meticulous methodologies.
 Example: Analyzing data from a nuclear fusion experiment to understand
the processes involved in energy generation.

8. Prescriptive Analysis
Prescriptive analysis goes beyond predictions, offering recommendations and
actionable insights based on the analysis. It answers the question "What should
we do next?" by suggesting optimal courses of action to achieve specific goals.
 Goal: To recommend actions or strategies based on insights from previous
data analyses.
 Methods: Often involves optimization algorithms, machine learning, and
simulation modeling to determine the best course of action.
 Example: Recommending pricing strategies for a product based on
predicted demand and competitor pricing.

B. Number of Variables
In data analysis, univariate, bivariate, and multivariate refer to the number of
variables being analyzed simultaneously.
1. Univariate Analysis
 Focus: Examines a single variable at a time.
 Purpose: To describe and summarize the characteristics of that variable.
This involves measures like mean, median, mode, standard deviation, and
visualizations like histograms or bar charts.
 Example: Analyzing the average age of customers in a store.

2. Bivariate Analysis
 Focus: Examines the relationship between two variables.
 Purpose: To determine if there's a correlation or association between the
two variables. This involves scatter plots, correlation coefficients, and
regression analysis.
 Example: Investigating whether there's a relationship between the numbers
of hours spent studying and exam scores.

3. Multivariate Analysis
 Focus: Examines the relationships between three or more variables
simultaneously.

15
 Purpose: To understand complex interactions between variables, identify
patterns, and make predictions. This involves techniques like multiple
regression, factor analysis, principal component analysis, and cluster
analysis.
 Example: Analyzing the impact of age, income, and education level on a
person's likelihood of buying a specific product.

1.1.7 Statistics and Research Defined:


Statistics is a branch of science which deals with collection of data, presentation
of data, analysis of data and interpretation of data.

Research is a systematic study or investigation of something for the purpose of


answering questions posed by the researcher

A. Relationship between Research and Statistics

Research Plan Theory

Statistics Implementation Practice

Figure 1. Relationship between Research and Statistics

Plan: Chapters 1, 2, and 3, plus the Instrument use to gather data


(Research Proposal or Research Outline)
Theory: Not concerned with practical application
Implementation: Implementation of the research proposal (Chapter 4 and 5)
Practice: Implement the findings

Meaning of the Diagram:


“Plan without implementation is almost useless”
“Theory without practice is almost useless”
“Research without Statistics is almost useless (incomplete)

16
B. Research Process

Identification of the Problem Formulation of Hypothesis Data


Analysis

Conclusion Testing Hypothesis

Figure 2. Research Process

1.2 Planning and Conducting Surveys

1.2.1 Designing, Conducting, and Analyzing Surveys


A survey is a way to ask a lot of people a few well-constructed questions. The
survey is a series of unbiased questions that the subject must answer. Some
advantages of surveys are that they are efficient ways of collecting information
from a large number of people, they are relatively easy to administer, a wide variety
of information can be collected and they can be focused (researchers can stick to
just the questions that interest them.) Some disadvantages of surveys arise from
the fact that they depend on the subjects’ motivation, honesty, memory and ability
to respond. Moreover, answer choices to survey questions could lead to vague
data. For example, the choice “moderately agree” may mean different things to
different people or to whoever ends up interpreting the data.

A. Conducting a Survey
There are various methods for administering a survey. It can be done as a face-to
face interview or a phone interview where the researcher is questioning the
subject. A different option is to have a self-administered survey where the subject
can complete a survey on paper and mail it back, or complete the survey online.

There are advantages and disadvantages to each of these methods.

The advantages of face-to-face interviews include fewer misunderstood questions,


fewer incomplete responses, higher response rates, and greater control over the
environment in which the survey is administered; also, the researcher can collect
additional information if any of the respondents’ answers need clarifying. The
disadvantages of face-to-face interviews are that they can be expensive and time-
consuming and may require a large staff of trained interviewers. In addition, the
response can be biased by the appearance or attitude of the interviewer.

17
The advantages of self-administered surveys are that they are less expensive than
interviews, do not require a large staff of experienced interviewers and can be
administered in large numbers. In addition, anonymity and privacy encourage more
candid and honest responses, and there is less pressure on respondents. The
disadvantages of self-administered surveys are that responders are more likely to
stop participating mid-way through the survey and respondents cannot ask them to
clarify their answers. In addition, there are lower response rates than in personal
interviews, and often the respondents who bother to return surveys represent
extremes of the population – those people who care about the issue strongly,
whichever way their opinion leans.

B. Designing a Survey
Surveys can take different forms. They can be used to ask only one question or they
can ask a series of questions. We can use surveys to test out people’s opinions or
to test a hypothesis.
When designing a survey, the following steps are useful:
1. Determine the goal of your survey: What question do you want to answer?
2. Identify the sample population: Whom will you interview?
3. Choose an interviewing method: face-to-face interview, phone interview, self-
administered paper survey, or internet survey.
4. Decide what questions you will ask in what order, and how to phrase them. (This
is important if there is more than one piece of information you are looking for.)
5. Conduct the interview and collect the information.
6. Analyze the results by making graphs and drawing conclusions.

C. Constructing a Survey

Example C.1. Martha wants to construct a survey that shows which sports students
at her school like to play the most.
a) List the goal of the survey.
The goal of the survey is to find the answer to the question: “Which sports do
students at Martha’s school like to play the most?”
b) What population sample should she interview?
A sample of the population would include a random sample of the student
population in Martha’s school. A good strategy would be to randomly select students
(using dice or a random number generator) as they walk into an all-school
assembly.

18
c) How should she administer the survey?
Face-to-face interviews are a good choice in this case. Interviews will be easy to
conduct since the survey consists of only one question which can be quickly
answered and recorded, and asking the question face to face will help eliminate
non-response bias.
d) Create a data collection sheet that she can use to record her results.
In order to collect the data to this simple survey Martha can design a data collection
sheet such as the one below:

Figure 3. Martha’s data collection sheet

This is a good, simple data collection sheet because:


 Plenty of space is left for the tally marks.
 Only one question is being asked.
 Many possibilities are included, but space is left at the bottom in case students
give answers that Martha didn’t think of.
 The answer from each interviewee can be quickly collected and then the data
collector can move on to the next person.
Once the data has been collected, suitable graphs can be made to display the
results.

Example C.2. Raoul wants to construct a survey that shows how many hours per
week the average student at his school works.
a) List the goal of the survey.
The goal of the survey is to find the answer to the question “How many hours per
week do you work?”
b) What population sample will he interview?
Raoul suspects that older students might work more hours per week than younger
students. He decides that a stratified sample of the student population would be
appropriate in this case. The strata are grade levels 9th through 12th. He would

19
need to find out what proportion of the students in his school are in each grade
level, and then include the same proportions in his sample.
c) How would he administer the survey?
Face-to-face interviews are a good choice in this case since the survey consists of
two short questions which can be quickly answered and recorded.
d) Create a data collection sheet that Raoul can use to record his results.
In order to collect the data for this survey Raoul designed the data collection sheet
shown below:

Figure 4. Raoul’s data collection sheet

This data collection sheet allows Raoul to write down the actual numbers of hours
worked per week by students as opposed to just collecting tally marks for several
categories.

D. Display, Analyze, and Interpret Statistical Survey Data


In the previous section we considered two examples of surveys you might conduct
in your school. The first one was designed to find the sport that students like to
play the most. The second survey was designed to find out how many hours per
week students worked.
For the first survey, students’ choices fit neatly into separate categories.
Appropriate ways to display the data might be a pie chart (shows the relationship
of the parts to the whole by visually comparing the sizes of the sections/slices of a
circle) or a bar graph.

20
In Example C.1., Martha interviewed 112 students and obtained the following
results.

Figure 5. Results of Martha’s data collection

a) Make a bar graph of the results showing the percentage of students in


each category.
To make a bar graph, we list the sport categories on the x−axis and let the
percentage of students be represented by the y−axis.
To find the percentage of students in each category, we divide the number of
students in each category by the total number of students surveyed:

21
Figure 6. Percentage calculation from Martha’s data collection results

Now we can make a graph where the height of each bar represents the percentage
of students in each category:

Figure 7. Bar graph showing the percentage of students playing each sport

b. Make a pie chart of the collected information, showing the percentage of


students in each category.
To make a pie chart, we find the percentage of the students in each category by
dividing the number of students in each category as in part a. The central angle of
each slice of the pie is found by multiplying the percentage of students in each

22
category by 360 degrees (the total number of degrees in a circle). To draw a pie-
chart by hand, you can use a protractor to measure the central angles that you find
for each category.

Figure 8. Table showing the sports, percentage and central angle calculations

Figure 9. Pie-chart that represents the percentage of students in each category

For the second survey, actual numerical data can be collected from each student.
In this case we can display the data using a stem-and-leaf plot, a frequency table
(a table that summarizes a data set by stating the number of times each value
occurs within the data set), a histogram, or a box-and-whisker plot.

In the Example C.2., Raoul found that that 30% of the students at his school are
in 9th grade, 26% of the students are in the 10th grade, 24% of the students are

23
in 11th grade and 20% of the students are in the 12th grade. He surveyed a total
of 60 students using these proportions as a guide for the number of students he
interviewed from each grade. Raoul recorded the following data:

Figure 10. Table showing data collected by Raoul

I. Construct a stem-and-leaf plot of the collected data


The ordered stem-and-leaf plot looks as follows:

We can easily see from the stem-and-leaf plot that the mode of the data is 0. This
makes sense because many students do not work in high school.

II. Construct a frequency table with bin size of 5.


We construct the frequency table by counting how many students fit in each
category.

Figure 11. Frequency Table

24
III. Draw a histogram of the data.
The histogram associated with this frequency table is shown below.

Figure 12. Histogram of the data

IV. Find the five number summary of the data and draw a box-and-whisker plot.
The five number summary is as follows:
smallest number = 0
largest number = 22
Since there are 60 data points,

The median is the mean of the 30th and the 31st values:
median = 6.5
Since each half of the list has 30 values in it, then the first and third quartiles are
the medians of each of the smaller lists. The first quartile is the mean of
the 15th and 16th values:
first quartile = 0
The third quartile is the mean of the 45th and 46th values:
third quartile = 12

The associated box-and-whisker plot is shown below.

Figure 13. Box-and-whisker plot


25
1.3 Planning and Conducting Experiments: Introduction to Design of
Experiments
What is the Scientific Method? Do you remember learning about this back in high
school or junior high even? What were those steps again?

Decide what phenomenon you wish to investigate. Specify how you can
manipulate the factor and hold all other conditions fixed, to insure that these
extraneous conditions aren't influencing the response you plan to measure.

Then measure your chosen response variable at several (at least two) settings of
the factor under study. If changing the factor causes the phenomenon to change,
then you conclude that there is indeed a cause-and-effect relationship at work.

How many factors are involved when you do an experiment? Some say two -
perhaps this is a comparative experiment? Perhaps there is a treatment group and
a control group? If you have a treatment group and a control group then, in this
case, you probably only have one factor with two levels.

How many of you have baked a cake? What are the factors involved to ensure a
successful cake? Factors might include preheating the oven, baking time,
ingredients, amount of moisture, baking temperature, etc. -- what else? You
probably follow a recipe so there are many additional factors that control the
ingredients - i.e., a mixture. In other words, someone did the experiment in
advance! What parts of the recipe did they vary to make the recipe a success?
Probably many factors, temperature and moisture, various ratios of ingredients,
and presence or absence of many additives. Now, should one keep all the factors
involved in the experiment at a constant level and just vary one to see what would
happen? This is a strategy that works but is not very efficient. This is one of the
concepts that we will address in this course.

1.3.1 A Quick History of the Design of Experiments (DOE)


All experiments are designed experiments, it is just that some are poorly designed
and some are well-designed.
A. Engineering Experiments
If we had infinite time and resource budgets there probably wouldn't be a big fuss
made over designing experiments. In production and quality control we want to
control the error and learn as much as we can about the process or the underlying

26
theory with the resources at hand. From an engineering perspective we're trying
to use experimentation for the following purposes:
 reduce time to design/develop new products & processes
 improve performance of existing processes
 improve reliability and performance of products
 achieve product & process robustness
 perform evaluation of materials, design alternatives, setting component &
system tolerances, etc.
We always want to fine-tune or improve the process. In today's global world this
drive for competitiveness affects all of us both as consumers and producers.

Robustness is a concept that enters into statistics at several points. At the analysis,
stage robustness refers to a technique that isn't overly influenced by bad data.
Even if there is an outlier or bad data you still want to get the right answer.
Regardless of who or what is involved in the process - it is still going to work.

Figure 14. General model of a process or system

Every experiment design has inputs. Back to the cake baking example: we have
our ingredients such as flour, sugar, milk, eggs, etc. Regardless of the quality of
these ingredients we still want our cake to come out successfully. In every
experiment there are inputs and in addition, there are factors (such as time of
baking, temperature, geometry of the cake pan, etc.), some of which you can
control and others that you can't control. The experimenter must think about factors
that affect the outcome. We also talk about the output and the yield or the response
to your experiment. For the cake, the output might be measured as texture, flavor,
height, size, or flavor.

27
B. Four Eras in the History of DOE
Here's a quick timeline:
 The agricultural origins, 1918 – 1940s
o R. A. Fisher & his co-workers
o Profound impact on agricultural science
o Factorial designs, ANOVA
 The first industrial era, 1951 – late 1970s
o Box & Wilson, response surfaces
o Applications in the chemical & process industries
 The second industrial era, late 1970s – 1990
o Quality improvement initiatives in many companies
o CQI and TQM were important ideas and became management goals
o Taguchi and robust parameter design, process robustness
 The modern era, beginning circa 1990, when economic competitiveness
and globalization are driving all sectors of the economy to be more
competitive.

1.3.2 The Basic Principles of DOE


A. Randomization
This is an essential component of any experiment that is going to have validity. If
you are doing a comparative experiment where you have two treatments, a
treatment and a control, for instance, you need to include in your experimental
process the assignment of those treatments by some random process. An
experiment includes experimental units. You need to have a deliberate process to
eliminate potential biases from the conclusions, and random assignment is a
critical step.
B. Replication
Replication is some in sense the heart of all of statistics. To make this point...
Remember what the standard error of the mean is? It is the square root of the

estimate of the variance of the sample mean, i.e., The width of the confidence
interval is determined by this statistic. Our estimates of the mean become less
variable as the sample size increases.

Replication is the basic issue behind every method we will use in order to get a
handle on how precise our estimates are at the end. We always want to estimate
or control the uncertainty in our results. We achieve this estimate through
replication. Another way we can achieve short confidence intervals is by reducing

28
the error variance itself. However, when that isn't possible, we can reduce the error
in our estimate of the mean by increasing n.

Another way is to reduce the size or the length of the confidence interval is to
reduce the error variance - which brings us to blocking.

C. Blocking
Blocking is a technique to include other factors in our experiment which contribute
to undesirable variation. Much of the focus in this class will be to creatively use
various blocking techniques to control sources of variation that will reduce error
variance. For example, in human studies, the gender of the subjects is often an
important factor. Age is another factor affecting the response. Age and gender
are often considered nuisance factors which contribute to variability and make it
difficult to assess systematic effects of a treatment. By using these as blocking
factors, you can avoid biases that might occur due to differences between the
allocations of subjects to the treatments, and as a way of accounting for some
noise in the experiment. We want the unknown error variance at the end of the
experiment to be as small as possible. Our goal is usually to find out something
about a treatment factor (or a factor of primary interest), but in addition to this, we
want to include any blocking factors that will explain variation.

D. Multi-factor Designs
The point to all of these multi-factor designs is contrary to the scientific method
where everything is held constant except one factor which is varied. The one factor
at a time method is a very inefficient way of making scientific advances. It is much
better to design an experiment that simultaneously includes combinations of
multiple factors that may affect the outcome. Then you learn not only about the
primary factors of interest but also about these other factors. These may be
blocking factors which deal with nuisance parameters or they may just help you
understand the interactions or the relationships between the factors that influence
the response.

E. Confounding
Confounding is something that is usually considered bad! Here is an example. Let's
say we are doing a medical study with drugs A and B. We put 10 subjects on drug
A and 10 on drug B. If we categorize our subjects by gender, how should we
allocate our drugs to our subjects? Let's make it easy and say that there are 10
male and 10 female subjects. A balanced way of doing this study would be to put

29
five males on drug A and five males on drug B, five females on drug A and five
females on drug B. This is a perfectly balanced experiment such that if there is a
difference between male and female at least it will equally influence the results
from drug A and the results from drug B.

An alternative scenario might occur if patients were randomly assigned treatments


as they came in the door. At the end of the study, they might realize that drug A
had only been given to the male subjects and drug B was only given to the female
subjects. We would call this design totally confounded. This refers to the fact that
if you analyze the difference between the average response of the subjects on A
and the average response of the subjects on B, this is exactly the same as the
average response on males and the average response on females. You would not
have any reliable conclusion from this study at all. The difference between the two
drugs A and B, might just as well be due to the gender of the subjects since the
two factors are totally confounded.

Confounding is something we typically want to avoid but when we are building


complex experiments we sometimes can use confounding to our advantage. We
will confound things we are not interested in order to have more efficient
experiments for the things we are interested in. This will come up in multiple factor
experiments later on. We may be interested in main effects but not interactions so
we will confound the interactions in this way in order to reduce the sample size,
and thus the cost of the experiment, but still have good information on the main
effects.

1.3.3 Steps for Planning, Conducting and Analyzing an Experiment


The practical steps needed for planning and conducting an experiment include:
recognizing the goal of the experiment, choice of factors, choice of response,
choice of the design, analysis and then drawing conclusions. This pretty much
covers the steps involved in the scientific method.
1. Recognition and statement of the problem
2. Choice of factors, levels, and ranges
3. Selection of the response variable(s)
4. Choice of design
5. Conducting the experiment
6. Statistical analysis
7. Drawing conclusions, and making recommendations

30
A. Factors
We usually talk about "treatment" factors, which are the factors of primary interest
to you. In addition to treatment factors, there are nuisance factors which are not
your primary focus, but you have to deal with them. Sometimes these are called
blocking factors, mainly because we will try to block on these factors to prevent
them from influencing the results.
There are other ways that we can categorize factors:
Experimental vs. Classification Factors
Experimental Factors
These are factors that you can specify (and set the levels) and then assign
at random as the treatment to the experimental units. Examples would be
temperature, level of an additive fertilizer amount per acre, etc.
Classification Factors
These can't be changed or assigned, these come as labels on the
experimental units. The age and sex of the participants are classification
factors which can't be changed or randomly assigned. But you can select
individuals from these groups randomly.

Quantitative vs. Qualitative Factors


Quantitative Factors
You can assign any specified level of a quantitative factor. Examples:
percent or pH level of a chemical.
Qualitative Factors
These factors have categories which are different types. Examples might
be species of a plant or animal, a brand in the marketing field, gender, -
these are not ordered or continuous but are arranged perhaps in sets.

References:
Aquino, G V. (1971). Essentials of Research and Thesis Writing. 1st ed.
Phoenix Publishing House, Inc.

Best, J. W. (1989). Research in Education. 6th ed. New Jersey: Prentice


Hall, Inc.

Basilo, F.B. (2003). Fundamental Statistics. 1st ed. Trinitas Publishing,


Inc.

31
Montgomery, D. C. (2019). Design and Analysis of Experiments, 10th Edition, John
Wiley & Sons. ISBN 978-1-119-59340-9 Accessed through this link:
https://online.stat.psu.edu/stat503/book/export/html/632

Neo, B., Urwin, M. (2024). 8 Types of Data Analysis. Accessed through this link:
https://builtin.com/data-science/types-of-data-
analysis?need_sec_link=1&sec_link_scene=im

Planning and Conducting Surveys. Accessed through this link:


https://www.ck12.org/statistics/planning-and-conducting-
surveys/lesson/planning and-conducting-surveys-alg-i

32

You might also like