Lecture Notes 1 Introduction To Statistics and Data Analysis
Lecture Notes 1 Introduction To Statistics and Data Analysis
Topics:
I. Introduction
II. Sources of Data
III. Data Presentation
IV. Descriptive Statistics
V. Introduction to Design of Experiments
I. INTRODUCTION
A. What is Statistics?
The word statistics in our everyday life means different things to different people. To a football fan,
statistics are the information about rushing yardage, passing yardage, and first downs, given a
halftime. To a manager of a power generating station, statistics may be information about the
quantity of pollutants being released into the atmosphere. To a school principal, statistics are
information on the absenteeism, test scores and teacher salaries. To a medical researcher
investigating the effects of a new drug, statistics are evidence of the success of research efforts. And to
a college student, statistics are the grades made on all the quizzes in a course this semester.
Each of these people is using the word statistics correctly, yet each uses it in a slightly different way
and for a somewhat different purpose. Statistics is a word that can refer to quantitative data or to a
field of study.
As a field of study, statistics is the science of collecting, organizing and interpreting numerical facts,
which we call data. We are bombarded by data in our everyday life. The collection and study of data
are important in the work of many professions, so that training in the science of statistics is valuable
preparation for variety of careers. Each month, for example, government statistical offices release the
latest numerical information on unemployment and inflation. Economists and financial advisors as
well as policy makers in government and business study these data in order to make informed
decisions. Farmers study data from field trials of new crop varieties. Engineers gather data on the
quality and reliability of manufactured of products. Most areas of academic study make use of
numbers, and therefore also make use of methods of statistics.
Whatever else it may be, statistics is, first and foremost, a collection of tools used for converting raw
data into information to help decision makers in their works.
Definition 1
A population is a collection (or set) of data that describes some phenomenon of interest to
you.
Definition 2
A sample is a subset of data selected from a population
Example 1 The population may be all women in a country, for example, in Vietnam. If from each city
or province we select 50 women, then the set of selected women is a sample.
Example 2 The set of all whisky bottles produced by a company is a population. For the quality
control 150 whisky bottles are selected at random. This portion is a sample.
Lecture Notes 1 – Introduction to Statistics and Data Analysis 2
Engr. Caesar Pobre Llapitan
Definition 3
A statistic is a numerical summary of a sample. By contrast, a numerical summary of a
population is called a parameter.
For example, if we know from ECC data that the average age of all ECC students is 29 that value is a
parameter. On the other hand, if we take a sample of 100 students and find that 63% support a new
initiative at the college, that is a statistic - since it is only a measure of the sample of 100 students, not
the entire student population.
Definition 4
The branch of statistics devoted to the summarization and description of data (population or
sample) is called descriptive statistics.
If it may be too expensive to obtain or it may be impossible to acquire every measurement in the
population, then we will want to select a sample of data from the population and use the sample to
infer the nature of the population.
Definition 5
The branch of statistics concerned with using sample data to make an inference about a
population of data is called inferential statistics.
D. What is Measurement?
In statistics, the term measurement is used more broadly and is more appropriately termed scales of
measurement. Scales of measurement refer to ways in which variables/numbers are defined and
categorized. Each scale of measurement has certain properties which in turn determines the
appropriateness for use of certain statistical analyses. The four scales of measurement are nominal,
ordinal, interval, and ratio.
1. Nominal - Categorical data and numbers that are simply used as identifiers or names.
Examples: Numbers on the back of a baseball jersey (St. Louis Cardinals 1 = Ozzie Smith)
Social security number
Male = 1 Female = 1
Political affiliation; Eye color
3. Interval - A scale which represents quantity and has equal units but for which zero represents
simply an additional point of measurement is an interval scale.
With each of these scales there is direct, measurable quantity with equality of units. In addition,
zero does not represent the absolute lowest value. Rather, it is point on the scale with numbers
both above and below it (for example, -10 degrees Fahrenheit).
Lecture Notes 1 – Introduction to Statistics and Data Analysis 3
Engr. Caesar Pobre Llapitan
4. Ratio - The ratio scale of measurement is similar to the interval scale in that it also represents
quantity and has equality of units. However, this scale also has an absolute zero (no numbers exist
below the zero).
Examples: height and weight; measuring the length of a piece of wood in centimeters
The table below will help clarify the fundamental differences between the four scales of measurement.
Interval and Ratio data are sometimes referred to as parametric and Nominal and Ordinal data are
referred to as nonparametric.
Parametric means that it meets certain requirements with respect to parameters of the population
(for example, the data will be normal - the distribution parallels the normal or bell curve).
- numbers can be added, subtracted, multiplied, and divided
- data are analyzed using statistical techniques identified as Parametric Statistics
As a rule, there are more statistical technique options for the analysis of parametric data and
parametric statistics are considered more powerful than nonparametric statistics.
Nonparametric data are lacking those same parameters and cannot be added, subtracted, multiplied,
and divided. For example, it does not make sense to add Social Security numbers to get a third person.
Nonparametric data are analyzed by using Nonparametric Statistics.
Cohort studies: a group of individuals, cohort, observed over a period or time (can be a long
time) where characteristics about individuals recorded and some individuals studied further
2. Designed Experiment
- Individuals in study assigned to certain group
- Groups are given varying degrees of explanatory variable
- Values of the response variable are recorded for each group
1. Stratified Sampling
Separate population in non-overlapping groups called strata
Obtain simple random sample from each stratum
Stratum should be homogeneous (or similar) in some way
2. Systematic Sampling
Obtained by selecting every kth individual from the population. The first individual
selected is a random number between 1 and k
No frame (list of population) is needed
K is determined when the size of the population, N, is known by dividing by the sample
size and rounding down.
3. Cluster Sampling
Obtained by selecting all individuals within a randomly selected collection or group of
individuals
4. Convenience Sampling
Sample in which the individuals are easily obtained
Self-selected most popular (voluntarily decide to be in sample)
Lecture Notes 1 – Introduction to Statistics and Data Analysis 5
Engr. Caesar Pobre Llapitan
5. Multistage Sampling
Combination of sampling techniques
Examples: Nielsen ratings
D. Types of data
Data can be one of two types, qualitative and quantitative.
Definition 1
Quantitative data are observations measured on a numerical scale.
In other words, quantitative data are those that represent the quantity or amount of something.
Can be shown with a distribution, or summarized with an average, etc.
Commonly used summaries:
Average value
Maximum or Minimum value
Standard deviation (a measure of spread of the data)
Example 1 Height (in centimeters), weight (in kilograms) of each student in a group are both
quantitative data.
2. Discrete
- Can take on only particular values
o number of prerequisite courses (0, 1, 2, …)
o number of students in a course
o shoe sizes (7, 7-1/2, 8, 8-1/2,…)
Definition 2
Nonnumerical data that can only be classified into one of a group of categories are said to be
qualitative data.
In other words, qualitative data are those that have no quantitative interpretation, i.e., they can only
classify into categories.
Example 2 Education level, nationality, sex of each person in a group of people are qualitative data.
A. Introduction
The objective of data description is to summarize the characteristics of a data set. Ultimately, we want
to make the data set more comprehensible and meaningful.
Definition 3
The category frequency for a given category is the number of observations that fall in that
category.
Lecture Notes 1 – Introduction to Statistics and Data Analysis 7
Engr. Caesar Pobre Llapitan
Definition 4
The category relative frequency for a given category is the proportion of the total number of
observations that fall in that category.
Instead of the relative frequency for a category one usually uses percentage for a category, which is
computed as follows
Example 3 The classification of students of a group by the score on the subject “Statistical analysis” is
presented in Table 2.0a. The table of frequencies for the data set generated by computer using the
software SPSS is shown in Figure 2.1.
Valid Cumulative
Frequency Percent Percent Percent
Valid Bad 6 13.3 13.3 13.3
Excelent 18 40.0 40.0 53.3
Good 15 33.3 33.3 86.7
Medium 6 13.3 13.3 100.0
Total 45 100.0 100.0
Figure 2.1 Output from SPSS showing the frequency table for the variable CATEGORY
Bar graphs give the frequency (or relative frequency) of each category with the height or length of the
bar proportional to the category frequency (or relative frequency).
Example 4a (Bar Graph) The bar graph generated by computer using SPSS for the variable
CATEGORY is depicted in Figure 2.2.
Lecture Notes 1 – Introduction to Statistics and Data Analysis 8
Engr. Caesar Pobre Llapitan
Medium
Good
Excelent
Bad
0 2 4 6 8 10 12 14 16 18 20
Figure 2.2 Bar graph showing the number of students of each category
Pie charts divide a complete circle (a pie) into slices, each corresponding to a category, with the
central angle and hence the area of the slice proportional to the category relative frequency.
Example 4b (Pie Chart) The pie chart generated by computer using EXCEL CHARTS for the variable
CATEGORY is depicted in Figure 2.3.
Bad
Excelent
Good
Medium
Figure 2.3 Pie chart showing the number of students of each category
In order to explain what is a stem and what is a leaf we consider the data from the table 2.0b. For this
data for a two-digit number, for example, 79, we designate the first digit (7) as its stem; we call the
last digit (9) its leaf; and for three-digit number, for example, 112, we designate the first two digits (12)
as its stem; we also call the last digit (2) its leaf.
Depending on the data, a display can use one, two or five lines per stem. Among the different stems,
two-line stems are widely used.
Example 5 The quantity of glucose in blood of 100 persons is measured and recorded in Table 2.0b
(unit is mg%). Using SPSS, we obtain the following Stem-and-Leaf display for this data set.
Lecture Notes 1 – Introduction to Statistics and Data Analysis 9
Engr. Caesar Pobre Llapitan
GLUCOSE
GLUCOSE Stem-and-Leaf Plot
Stem width: 10
Each leaf: 1 case(s)
Figure 2.4. Output from SPSS showing the Stem-and-Leaf display for the data set of glucose
The stem and leaf display of Figure 2.4 partitions the data set into 12 classes corresponding to 12
stems. Thus, here two-lines stems are used. The number of leaves in each class gives the class
frequency.
Advantages of a stem and leaf display over a frequency distribution (considered in the next
section):
1. the original data are preserved.
2. a stem and leaf display arrange the data in an orderly fashion and makes it easy to determine
certain numerical characteristics to be discussed in the following chapter.
3. the classes and numbers falling in them are quickly determined once we have selected the
digits that we want to use for the stems and leaves.
A frequency distribution is a table that organizes data into classes. It shows the number of
observations from the data set that fall into each of classes. It should be emphasized that we always
have in mind non-overlapping classes, i.e. classes without common items.
3. For each class, count the number of observations that fall in that class. This number is called
the class frequency.
Class frequency
Class relative frequency=
Total number of observations
Except for frequency distribution and relative frequency distribution one usually uses relative class
percentage, which is calculated by the formula:
Example 6 Construct frequency table for the data set of quantity of glucose in blood of 100 persons
recorded in Table 2.0b (unit is mg%).
Using the software STATGRAPHICS, taking Lower limit = 62, Upper limit = 150 and Total number of
classes = 22 we obtained the following table.
Remarks:
1. All classes of frequency table must be mutually exclusive.
2. Classes may be open-ended when either the lower or the upper end of a quantitative
classification scheme is limitless. For example
Class: age
birth to 7
8 to 15
........
64 to 71
72 and older
3. Classification schemes can be either discrete or continuous. Discrete classes are separate
entities that do not progress from one class to the next without a break. Such class as the
number of children in each family, the number of trucks owned by moving companies.
Discrete data are data that can take only a limit number of values. Continuous data do
progress from one class to the next without a break, they involve numerical measurement
such as the weights of cans of tomatoes, the kilograms of pressure on concrete ... Usually,
continuous classes are half-open intervals. For example, the classes in Table 2.1 are half-open
intervals [62, 66), [66, 70) ...
1. Histogram
When plotting histograms, the phenomenon of interest is plotted along the horizontal axis, while the
vertical axis represents the number, proportion or percentage of observations per class interval –
depending on whether or not the particular histogram is respectively, a frequency histogram, a
relative frequency histogram or a percentage histogram.
Histograms are essentially vertical bar charts in which the rectangular bars are constructed at
midpoints of classes.
Example 7 Below we present the frequency histogram for the data set of quantities of glucose, for
which the frequency table is constructed in Table 2.1.
Lecture Notes 1 – Introduction to Statistics and Data Analysis 12
Engr. Caesar Pobre Llapitan
20
15
Frequency
10
0
68
76
84
92
100
108
116
124
132
140
Quantity of glucoza (mg%)
Figure 2.5 Frequency histogram for quantities of glucose, tabulated in Table 2.1
Remark: When comparing two or more sets of data, the various histograms cannot be constructed on
the same graph because superimposing the vertical bars of one on another would cause difficulty in
interpretation. For such cases it is necessary to construct relative frequency or percentage polygons.
2. Polygons
As with histograms, when plotting polygons, the phenomenon of interest is plotted along the
horizontal axis while the vertical axis represents the number, proportion or percentage of
observations per class interval – depending on whether or not the particular polygon is respectively, a
frequency polygon, a relative frequency polygon or a percentage polygon. For example, the frequency
polygon is a line graph connecting the midpoints of each class interval in a data set, plotted at a
height corresponding to the frequency of the class.
Example 8 Figure 2.6 is a frequency polygon constructed from data in Table 2.1.
20
15
Frequency
10
0
68
76
84
92
100
108
116
124
132
140
Advantages of polygons:
1. The frequency polygon is simpler than its histogram counterpart.
2. It sketches an outline of the data pattern more clearly.
Lecture Notes 1 – Introduction to Statistics and Data Analysis 13
Engr. Caesar Pobre Llapitan
3. The polygon becomes increasingly smooth and curve like as we increase the number of classes
and the number of observations.
A cumulative frequency distribution enables us to see how many observations lie above or below
certain values, rather than merely recording the number of items within intervals.
A “less-than” cumulative frequency distribution may be developed from the frequency table as follows:
Suppose a data set is divided into n classes by boundary points x1, x2, ..., xn, xn+1. Denote the classes by
C1, C2, ..., Cn. Thus, the class Ck = [xk, xk+1). See Figure 2.7.
C1 C2 Ck Cn
x1 x2 xk xk+1 xn xn+1
Suppose the frequency and relative frequency of class Ck is fk and rk (k=1, 2, ..., n), respectively. Then
the cumulative frequency that observations fall into classes C1, C2, ..., Ck or lie below the value xk+1 is
the sum f1+f2+...+fk. The corresponding cumulative relative frequency is r1 +r2+...+rk.
Example 9 Table 2.1 gives frequency, relative frequency, cumulative frequency and cumulative
relative frequency distribution for quantity of glucose in blood of 100 students. According to this table
the number of students having quantity of glucose less than 90 is 16.
A graph of cumulative frequency distribution is called an “less-than” ogive or simply ogive. Figure 2
shows the cumulative frequency distribution for quantity of glucose in blood of 100 students (data
from Table 2.1)
120
100
Cumulative frequency
80
60
40
20
0
68 76 84 92 10 0 10 8 11 6 12 4 13 2 14 0
Quantity of glucoza (mg%)
Figure 2.8 Cumulative frequency distribution for quantity of glucose (for data in Table 2.1)
Lecture Notes 1 – Introduction to Statistics and Data Analysis 14
Engr. Caesar Pobre Llapitan
Exercises 1
1) A national cancer institure survey of 1,580 adult women recently responded to the question “In
your opinion, what is the most serious health problem facing women?” The responses are
summarized in the following table:
2) The administrator of a hospital has ordered a study of the amount of time a patient must wait
before being treated by emergency room personnel. The following data were collected during a
typical day:
a) Arrange the data in an array from lowest to heighest. What comment can you make about
patient waiting time from your data array?
b) Construct a frequency distribution using 6 classes. What additional interpretation can you
give to the data from the frequency distribution?
c) Construct the cumulative relative frequency polygon and from this ogive state how long 75%
of the patients should expect to wait.
3) Bacteria are the most important component of microbial eco systems in sewage treatment plants.
Water management engineers must know the percentage of active bacteria at each stage of the
sewage treatment. The accompanying data represent the percentages of respiring bacteria in 25
raw sewage samples collected from a sewage plant.
4) At a newspaper office, the time required to set the entire front page in type was recorded for 50
days. The data, to the nearest tenth of a minute, are given below.
Lecture Notes 1 – Introduction to Statistics and Data Analysis 15
Engr. Caesar Pobre Llapitan
20.8 22.8 21.9 22.0 20.7 20.9 25.0 22.2 22.8 20.1
25.3 20.7 22.5 21.2 23.8 23.3 20.9 22.9 23.5 19.5
23.7 20.3 23.6 19.0 25.1 25.0 19.5 24.1 24.2 21.8
21.3 21.5 23.1 19.9 24.2 24.1 19.8 23.9 22.8 23.9
19.7 24.2 23.8 20.7 23.8 24.3 21.1 20.9 21.6 22.7
A. Measures of Location
1. Mean
Definition 1
The arithmetic mean of a sample (or simply the sample mean) of n observations
x 1 , x 2 , …, x n , denoted by x̄ is computed as
n
x x2 ... xn x i
x 1 i 1
n n
Definition 1a
The population mean is defined by the formula
x i
Sum of the values of all observations in population
i1
N Total number of observations in population
Note that the definitions of the population mean and the sample mean are the same. It is also valid
for the definition of other measures of central tendency. But in the next section we will give different
formulas for variances of population and sample.
Example 1 Consider 7 observations: 4.2, 4.3, 4.7, 4.8, 5.0, 5.1, 9.0.
By definition
x̄ = (4.2 + 4.3 + 4.7 + 4.8 + 5.0+ 5.1 + 9.0)/7 = 5.3
Indeed, if in the above example we compute the mean of the first 6 numbers and exclude the 9.0
value, then the mean is 4.7. The one extreme value 9.0 distorts the value we get for the mean. It would
be more representative to calculate the mean without including such an extreme value.
2. Median
Definition 2
The median m of a sample of n observations
x , x , …, x
1 2 n arranged in ascending or
descending order is the middle number that divides the data set into two equal halves: one
half of the items lie above this point, and the other half lie below it.
Example 2 Find the median of the data set consisting of the observations 7, 4, 3, 5, 6, 8, 10.
Since the number of observations is odd, n = 2 x 4 - 1, then median m = x4 = 6. We see that a half of the
observations, namely, 3, 4, 5 lie below the value 6 and another half of the observations, namely, 7, 8
and 10 lies above the value 6.
Example 3 Suppose we have an even number of the observations 7, 4, 3, 5, 6, 8, 10, 1. Find the
median of this data set.
Advantage of the median over the mean: Extreme values in data set do not affect the median as
strongly as they do the mean.
Indeed, if in Example 1 we have
3. Mode
Definition 3
The mode of a data set
x , x , …, x
1 2 n is the value of x that occurs with the greatest
frequency, i.e., is repeated most often in the data set.
70 88 95 101 106
79 93 96 101 107
83 93 97 103 108
86 93 97 103 112
87 95 98 106 115
70 88 95 101 106
79 93 96 101 107
83 93 97 103 108
86 93 97 103 112
87 95 98 106 115
This data set contains 25 numbers. We see that, the value of 93 is repeated most often. Therefore, the
mode of the data set is 93.
Multimodal distribution: A data set may have several modes. In this case it is called multimodal
distribution.
4. Geometric mean
Lecture Notes 1 – Introduction to Statistics and Data Analysis 18
Engr. Caesar Pobre Llapitan
Definition 4
Suppose all the n observations in a data set
x 1 , x 2 , …, x n >0 . Then the geometric mean of
the data set is defined by the formula
xG G.M n x 1 x 2 ...x n
The geometric mean is appropriate to use whenever we need to measure the average rate of change
(the growth rate) over a period of time.
From the above formula it follows
1 n
log xG log xi
n i1
Thus, the logarithm of the geometric mean of the values of a data set is equal to the arithmetic mean
of the logarithms of the values of the data set.
B. Measures of Variability
Just as measures of central tendency locate the “center” of a relative frequency distribution, measures
of variation measure its “spread”.
The most commonly used measures of data variation are the range, the variance and the standard
deviation.
1. Range
Definition 5
The range of a quantitative data set is the difference between the largest and smallest values in
the set.
Range = Maximum - Minimum,
where Maximum = Largest value, Minimum = Smallest value.
Definition 6
The population variance of the population of the observations x is defined the formula
x
i
2 i1
N
where: 2
=population variance
xi = the item or observation
= population means
N = total number of observations in the population.
From the Definition 3.6 we see that the population variance is the average of the squared distances of
the observations from the mean.
Lecture Notes 1 – Introduction to Statistics and Data Analysis 19
Engr. Caesar Pobre Llapitan
Definition 7
The standard deviation of a population is equal to the square root of the variance
x i
2 i 1
Note that for the variance, the units are the squares of the units of the data. And for the standard
deviation, the units are the same as those used in the data.
Definition 6a
The sample variance of the sample of the observations
x 1 , x 2 , …, x n is defined the formula
n 2
x i x
s
2 i 1
n1
where: s2 =sample variance
x = sample mean
n = total number of observations in the sample
Remark: In the denominator of the formula for s2 we use n-1 instead n because statisticians proved
that if s2 is defined as above then s2 is an unbiased estimate of the variance of the population from
which the sample was selected (i.e. the expected value of s2 is equal to the population variance).
Chebyshev’s Theorem
For any data set with the mean x and the standard deviation s at least 75% of the values will fall
within the interval x 2s and at least 89% of the values will fall within the interval x 3 s .
We can measure with even more precision the percentage of items that fall within specific ranges
under a symmetrical, bell-shaped curve. In these cases, we have:
We need a relative measure that will give us a feel for the magnitude of the deviation relative to the
magnitude of the mean. The coefficient of variation is one such relative measure of dispersion.
Definition 8
The coefficient of variation of a data set is the relation of its standard deviation to its mean
Standard deviation
100%
cv = Coefficient of variation = Mean
Example 6 Suppose that each day laboratory technician A completes 40 analyses with a standard
deviation of 5. Technician B completes 160 analyses per day with a standard deviation of 15. Which
employee shows less variability?
At first glance, it appears that technician B has three times more variation in the output rate than
technician A. But B completes analyses at a rate 4 times faster than A. Taking all this information into
account, we compute the coefficient of variation for both technicians:
For technician A: cv=5/40 x 100% = 12.5%
For technician B: cv=15/60 x 100% = 9.4%.
So, we find that, technician B who has more absolute variation in output than technician A, has less
relative variation.
An engineer is someone who solves problems of interest to society by the efficient application of
scientific principles by
1. refining an existing product or process; or
2. designing a new product or process that meets customers’ needs
The engineering, or scientific, method is the approach to formulating and solving these problems
following these steps:
1. Develop a clear and concise description of the problem.
2. Identify, at least tentatively, the important factors that affect this problem or that may play a
role in its solution.
3. Propose a model for the problem, using scientific or engineering knowledge of the
phenomenon being studied. State any limitations or assumptions of the model.
4. Conduct appropriate experiments and collect data to test or validate the tentative model or
conclusions made in steps 2 and 3.
5. Refine the model on the basis of the observed data.
Lecture Notes 1 – Introduction to Statistics and Data Analysis 21
Engr. Caesar Pobre Llapitan
Statistical techniques are a powerful aid in designing new products and systems, improving existing
designs, and designing, developing, and improving production processes.
A. Terminology
Response variable: The outcome of an experiment
Factor: Each variable that affects the response variable and has several
alternatives
Level: The values that a factor can assume
Primary Factor: The factors whose effects need to be quantified
Secondary Factor: Factors that impact the performance but whose impact we are not
interested in quantifying
Replication: Repetition of all or some experiments
Experimental Unit: Any entity that is used for the experiment
Interaction: Two factors A and B interact if the effect of one depends upon the level
of the other
Experiment Controlled study conducted to determine the effect varying one or
more
explanatory variable or factors has on a response variable
B. Fundamental Principles
The fundamental principles in design of experiments are solutions to the problems in
experimentation posed by the two types of nuisance factors and serve to improve the efficiency of
experiments. Those fundamental principles are
Randomization
Replication
Blocking
Orthogonality
Factorial experimentation
Lecture Notes 1 – Introduction to Statistics and Data Analysis 22
Engr. Caesar Pobre Llapitan
Randomization is a method that protects against an unknown bias distorting the results of the
experiment.
Replication increases the sample size and is a method for increasing the precision of the experiment.
Replication increases the signal-to-noise ratio when the noise originates from uncontrollable nuisance
variables. A replicate is a complete repetition of the same experimental conditions, beginning with the
initial setup. A special design called a Split Plot can be used if some of the factors are hard to vary.
Blocking is a method for increasing precision by removing the effect of known nuisance factors. An
example of a known nuisance factor is batch-to-batch variability. In a blocked design, both the
baseline and new procedures are applied to samples of material from one batch, then to samples from
another batch, and so on. The difference between the new and baseline procedures is not influenced
by the batch-to-batch differences. Blocking is a restriction of complete randomization, since both
procedures are always applied to each batch. Blocking increases precision since the batch-to-batch
variability is removed from the “experimental error.”
Orthogonality in an experiment results in the factor effects being uncorrelated and therefore more
easily interpreted. The factors in an orthogonal experiment design are varied independently of each
other. The main results of data collected using this design can often be summarized by taking
differences of averages and can be shown graphically by using simple plots of suitably chosen sets of
averages. In these days of powerful computers and software, orthogonality is no longer a necessity,
but it is still a desirable property because of the ease of explaining results.
Factorial experimentation is a method in which the effects due to each factor and to combinations
of factors are estimated. Factorial designs are geometrically constructed and vary all the factors
simultaneously and orthogonally. Factorial designs collect data at the vertices of a cube in p-
dimensions (p is the number of factors being studied). If data are collected from all of the vertices, the
design is a full factorial, requiring 2 p runs. Since the total number of combinations increases
exponentially with the number of factors studied, fractions of the full factorial design can be
constructed. As the number of factors increases, the fractions become smaller and smaller (1/2, 1/4,
1/8, 1/16, …). Fractional factorial designs collect data from a specific subset of all possible vertices and
require 2p – q runs, with 2-q being the fractional size of the design. If there are only three factors in the
experiment, the geometry of the experimental design for a full factorial experiment requires eight
runs, and a one-half fractional factorial experiment (an inscribed tetrahedron) requires four runs
Factorial designs, including fractional factorials, have increased precision over other types of designs
because they have built-in internal replication. Factor effects are essentially the difference between
the average of all runs at the two levels for a factor, such as “high” and “low.” Replicates of the same
points are not needed in a factorial design, which seems like a violation of the replication principle in
design of experiments. However, half of all the data points are taken at the high level and the other
half are taken at the low level of each factor, resulting in a very large number of replicates. Replication
is also provided by the factors included in the design that turn out to have nonsignificant effects.
Because each factor is varied with respect to all of the factors, information on all factors is collected by
each run. In fact, every data point is used in the analysis many times as well as in the estimation of
every effect and interaction. Additional efficiency of the two-level factorial design comes from the fact
that it spans the factor space, that is, puts half of the design points at each end of the range, which is
the most powerful way of determining whether a factor has a significant effect.
Uses
The main uses of design of experiments are
Discovering interactions among factors
Screening many factors
Lecture Notes 1 – Introduction to Statistics and Data Analysis 23
Engr. Caesar Pobre Llapitan
Design
An experimental design consists of specifying the number of experiments, the factor level
combinations for each experiment, and the number of replications.