100% found this document useful (1 vote)
589 views

Lecture Notes 1 Introduction To Statistics and Data Analysis

This document provides an introduction to statistics and data analysis. It discusses key topics including what statistics is, populations and samples, descriptive versus inferential statistics, and scales of measurement. Statistics is defined as the science of collecting, organizing, and interpreting numerical data. Populations refer to all data of interest, while samples are subsets of data selected from populations. Descriptive statistics describes data, while inferential statistics uses samples to make inferences about populations. There are four scales of measurement: nominal, ordinal, interval, and ratio, which determine what statistical analyses can be used. Sources of data come from either observational studies where data is not manipulated, or designed experiments where individuals are assigned to groups.

Uploaded by

Frendick Legaspi
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
589 views

Lecture Notes 1 Introduction To Statistics and Data Analysis

This document provides an introduction to statistics and data analysis. It discusses key topics including what statistics is, populations and samples, descriptive versus inferential statistics, and scales of measurement. Statistics is defined as the science of collecting, organizing, and interpreting numerical data. Populations refer to all data of interest, while samples are subsets of data selected from populations. Descriptive statistics describes data, while inferential statistics uses samples to make inferences about populations. There are four scales of measurement: nominal, ordinal, interval, and ratio, which determine what statistical analyses can be used. Sources of data come from either observational studies where data is not manipulated, or designed experiments where individuals are assigned to groups.

Uploaded by

Frendick Legaspi
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Lecture Notes 1 – Introduction to Statistics and Data Analysis 1

Engr. Caesar Pobre Llapitan

Topics:
I. Introduction
II. Sources of Data
III. Data Presentation
IV. Descriptive Statistics
V. Introduction to Design of Experiments

I. INTRODUCTION

A. What is Statistics?
The word statistics in our everyday life means different things to different people. To a football fan,
statistics are the information about rushing yardage, passing yardage, and first downs, given a
halftime. To a manager of a power generating station, statistics may be information about the
quantity of pollutants being released into the atmosphere. To a school principal, statistics are
information on the absenteeism, test scores and teacher salaries. To a medical researcher
investigating the effects of a new drug, statistics are evidence of the success of research efforts. And to
a college student, statistics are the grades made on all the quizzes in a course this semester.

Each of these people is using the word statistics correctly, yet each uses it in a slightly different way
and for a somewhat different purpose. Statistics is a word that can refer to quantitative data or to a
field of study.

As a field of study, statistics is the science of collecting, organizing and interpreting numerical facts,
which we call data. We are bombarded by data in our everyday life. The collection and study of data
are important in the work of many professions, so that training in the science of statistics is valuable
preparation for variety of careers. Each month, for example, government statistical offices release the
latest numerical information on unemployment and inflation. Economists and financial advisors as
well as policy makers in government and business study these data in order to make informed
decisions. Farmers study data from field trials of new crop varieties. Engineers gather data on the
quality and reliability of manufactured of products. Most areas of academic study make use of
numbers, and therefore also make use of methods of statistics.

Whatever else it may be, statistics is, first and foremost, a collection of tools used for converting raw
data into information to help decision makers in their works.

B. Populations and samples


In statistics, the data set that is the target of your interest is called a population. Notice that, a
statistical population does not refer to people as in our everyday usage of the term; it refers to a
collection of data.

Definition 1
A population is a collection (or set) of data that describes some phenomenon of interest to
you.

Definition 2
A sample is a subset of data selected from a population

Example 1 The population may be all women in a country, for example, in Vietnam. If from each city
or province we select 50 women, then the set of selected women is a sample.

Example 2 The set of all whisky bottles produced by a company is a population. For the quality
control 150 whisky bottles are selected at random. This portion is a sample.
Lecture Notes 1 – Introduction to Statistics and Data Analysis 2
Engr. Caesar Pobre Llapitan

Definition 3
A statistic is a numerical summary of a sample. By contrast, a numerical summary of a
population is called a parameter.

For example, if we know from ECC data that the average age of all ECC students is 29 that value is a
parameter. On the other hand, if we take a sample of 100 students and find that 63% support a new
initiative at the college, that is a statistic - since it is only a measure of the sample of 100 students, not
the entire student population.

C. Descriptive and inferential statistics


If you have every measurement (or observation) of the population in hand, then statistical
methodology can help you to describe this typically large set of data. We will find graphical and
numerical ways to make sense out of a large mass of data. The branch of statistics devoted to this
application is called descriptive statistics.

Definition 4
The branch of statistics devoted to the summarization and description of data (population or
sample) is called descriptive statistics.

If it may be too expensive to obtain or it may be impossible to acquire every measurement in the
population, then we will want to select a sample of data from the population and use the sample to
infer the nature of the population.

Definition 5
The branch of statistics concerned with using sample data to make an inference about a
population of data is called inferential statistics.

D. What is Measurement?
In statistics, the term measurement is used more broadly and is more appropriately termed scales of
measurement. Scales of measurement refer to ways in which variables/numbers are defined and
categorized. Each scale of measurement has certain properties which in turn determines the
appropriateness for use of certain statistical analyses. The four scales of measurement are nominal,
ordinal, interval, and ratio.

1. Nominal - Categorical data and numbers that are simply used as identifiers or names.

Examples: Numbers on the back of a baseball jersey (St. Louis Cardinals 1 = Ozzie Smith)
Social security number
Male = 1 Female = 1
Political affiliation; Eye color

2. Ordinal - An ordinal scale of measurement represents an ordered series of relationships or rank


order.

Examples: first, second, third places in a competition; Likert-type scales; IQ scores

3. Interval - A scale which represents quantity and has equal units but for which zero represents
simply an additional point of measurement is an interval scale.

Examples: 60-degree Fahrenheit or -10 degrees Fahrenheit; Measurement of Sea Level

With each of these scales there is direct, measurable quantity with equality of units. In addition,
zero does not represent the absolute lowest value. Rather, it is point on the scale with numbers
both above and below it (for example, -10 degrees Fahrenheit).
Lecture Notes 1 – Introduction to Statistics and Data Analysis 3
Engr. Caesar Pobre Llapitan

4. Ratio - The ratio scale of measurement is similar to the interval scale in that it also represents
quantity and has equality of units. However, this scale also has an absolute zero (no numbers exist
below the zero).

Examples: height and weight; measuring the length of a piece of wood in centimeters

A negative value is not possible.

The table below will help clarify the fundamental differences between the four scales of measurement.

Indicates Indicates direction Indicates amount


Absolute zero
difference of difference of difference
Nomina
X
l
Ordinal X X
Interval X X X
Ratio X X X X

Interval and Ratio data are sometimes referred to as parametric and Nominal and Ordinal data are
referred to as nonparametric.

Parametric means that it meets certain requirements with respect to parameters of the population
(for example, the data will be normal - the distribution parallels the normal or bell curve).
- numbers can be added, subtracted, multiplied, and divided
- data are analyzed using statistical techniques identified as Parametric Statistics

As a rule, there are more statistical technique options for the analysis of parametric data and
parametric statistics are considered more powerful than nonparametric statistics.

Nonparametric data are lacking those same parameters and cannot be added, subtracted, multiplied,
and divided. For example, it does not make sense to add Social Security numbers to get a third person.
Nonparametric data are analyzed by using Nonparametric Statistics.

II. SOURCES OF DATA

A. Statistical Study Types


1. Observational Study
- Characteristics or individuals studied but data not manipulated or influenced
- Ex post facto (after the fact) because data had already been gathered
- Does not allow a researcher to claim causation, only association

Reason for Observational Studies


 Don’t collect data that has already been collected!
o Reason 1: To learn characteristics of a population
o Reason 2: To determine whether there is an association between two or more variables where
the values of the variables have already been determined.

Types of Observational Studies


 Cross-sectional studies: collect information about individuals at a specific point in time or
over a very short period of time
 Case-control studies: retrospective, requires individuals to look back in time or researchers
look at existing records
Lecture Notes 1 – Introduction to Statistics and Data Analysis 4
Engr. Caesar Pobre Llapitan

 Cohort studies: a group of individuals, cohort, observed over a period or time (can be a long
time) where characteristics about individuals recorded and some individuals studied further
2. Designed Experiment
- Individuals in study assigned to certain group
- Groups are given varying degrees of explanatory variable
- Values of the response variable are recorded for each group

Reasons for Designed Experiments


o Use when control of certain variables is desired
o If cause and effect relationships among variables desired

B. Simple Random Sampling


A sample of size n from a population of size N is obtained through simple random sampling if every
possible sample of size n has an equally likely chance of occurring.

C. Other Types of Sampling

1. Stratified Sampling
 Separate population in non-overlapping groups called strata
 Obtain simple random sample from each stratum
 Stratum should be homogeneous (or similar) in some way

Advantages of stratified sampling


o Allows fewer individuals to be surveyed while obtaining the same or more information
o Allows analysis to determine significance differences between the strata or groups

2. Systematic Sampling
 Obtained by selecting every kth individual from the population. The first individual
selected is a random number between 1 and k
 No frame (list of population) is needed
 K is determined when the size of the population, N, is known by dividing by the sample
size and rounding down.

Advantages of systematic sampling


o Population size does not have to be known
o Provides more information for a given cost more than other sampling types
o Easier to do, less likely for interviewer error in getting sample

3. Cluster Sampling
 Obtained by selecting all individuals within a randomly selected collection or group of
individuals

Questions in cluster sampling


o How do I cluster the population?
o How many clusters?
o How many individuals should be in each cluster?

Clusters homogeneous – more clusters with fewer individuals per cluster


Clusters heterogeneous – fewer clusters with more individuals per cluster

4. Convenience Sampling
 Sample in which the individuals are easily obtained
 Self-selected most popular (voluntarily decide to be in sample)
Lecture Notes 1 – Introduction to Statistics and Data Analysis 5
Engr. Caesar Pobre Llapitan

 Examples: Magazine or Internet surveys


 Not good for making inferences about population

5. Multistage Sampling
 Combination of sampling techniques
Examples: Nielsen ratings

D. Types of data
Data can be one of two types, qualitative and quantitative.

Definition 1
Quantitative data are observations measured on a numerical scale.

In other words, quantitative data are those that represent the quantity or amount of something.
 Can be shown with a distribution, or summarized with an average, etc.
 Commonly used summaries:
Average value
Maximum or Minimum value
Standard deviation (a measure of spread of the data)

Example 1 Height (in centimeters), weight (in kilograms) of each student in a group are both
quantitative data.

Two kinds of Quantitative Data


1. Continuous
- Can take on any value in an interval
- Could have any number of decimals
o weight, home value, height
o 2.45, 7.63454, 4.0, etc.

2. Discrete
- Can take on only particular values
o number of prerequisite courses (0, 1, 2, …)
o number of students in a course
o shoe sizes (7, 7-1/2, 8, 8-1/2,…)

Levels of Measurement for Quantitative Data


1. Interval level (a.k.a differences or subtraction level)
- Intervals of equal length signify equal differences in the characteristic.
o The difference in 90° and 100° Fahrenheit is the same as the difference between
80° and 90° Fahrenheit.
- Differences make sense, but ratios do not.
o 100° Fahrenheit is not twice as hot as 50° Fahrenheit.
- Occurs when a numerical scale does not have a ‘true zero’ start point (i.e. it has an
arbitrary zero).
o Zero does not signify an absence of the characteristic.
Does 0° Fahrenheit represent an absence of heat?
- Designates an equal-interval ordering.
o 1 to 2 has the same meaning as 3 to 4.
- May initially look like a qualitative ordinal variable (e.g. low, med, high), but levels are
quantitative in nature and the differences in levels have consistent meaning.
o Scale for evaluation:
- IQ tests (interval scale).
Lecture Notes 1 – Introduction to Statistics and Data Analysis 6
Engr. Caesar Pobre Llapitan

o We don’t have meaning for a 0 IQ.


o A 120 IQ is not twice as intelligent as a 60 IQ.
- Calendar years (interval scale).
o An interval of one calendar year (2005 to 2006, 2014 to 2015) always has the
same meaning.
o But ratios of calendar years do not make sense because the choice of the year 0
is arbitrary and does not mean “the beginning of time.”
o Calendar years are therefore at the interval level of measurement.

2. Ratio level (even more meaning than interval level)


- At this level, both differences and ratios are meaningful.
o Two 2 oz glasses of water IS equal to one 4 oz glass of water
o 4 oz of water is twice as much as 2 oz of water.
- Occurs when scale does have a ‘true zero’ start point.
o 0 oz of water is a ‘true zero’ as it is empty, absence of water.
- Ratios involve division (or multiplication) rather than addition or subtraction.

Definition 2
Nonnumerical data that can only be classified into one of a group of categories are said to be
qualitative data.

In other words, qualitative data are those that have no quantitative interpretation, i.e., they can only
classify into categories.

Example 2 Education level, nationality, sex of each person in a group of people are qualitative data.

Levels of Measurement for Qualitative Data


1. Nominal level (by name)
- No natural ranking or ordering of the data exists.
o political affiliation (dem, rep, ind)

2. Ordinal level (by order)


- Provides an order, but can’t get a precise mathematical difference between levels.
o heat (low, medium, high)
o movie ratings (1-star, 2-star, etc.)
Watching two 2-star** movies isn’t the same as watching one 4-star**** movie
- Could be coded numerically

III. DATA PRESENTATION

A. Introduction
The objective of data description is to summarize the characteristics of a data set. Ultimately, we want
to make the data set more comprehensible and meaningful.

B. Qualitative data presentation


When describing qualitative observations, we define the categories in such a way that each
observation can fall in one and only one category. The data set is then described by giving the number
of observations, or the proportion of the total number of observations that fall in each of the
categories.

Definition 3
The category frequency for a given category is the number of observations that fall in that
category.
Lecture Notes 1 – Introduction to Statistics and Data Analysis 7
Engr. Caesar Pobre Llapitan

Definition 4
The category relative frequency for a given category is the proportion of the total number of
observations that fall in that category.

Number of observations falling in that category


Relative frequency for a category =
Total number of observations

Instead of the relative frequency for a category one usually uses percentage for a category, which is
computed as follows

Percentage for a category = Relative frequency for the category x 100%

Example 3 The classification of students of a group by the score on the subject “Statistical analysis” is
presented in Table 2.0a. The table of frequencies for the data set generated by computer using the
software SPSS is shown in Figure 2.1.

Table 2.0a the classification of students


No of CATEGORY No of CATEGORY No of CATEGORY No of CATEGORY
stud. stud. stud. stud
1 Bad 13 Good 24 Good 35 Good
2 Medium 14 Excellent 25 Medium 36 Medium
3 Medium 15 Excellent 26 Bad 37 Good
4 Medium 16 Excellent 27 Good 38 Excellent
5 Good 17 Excellent 28 Bad 39 Good
6 Good 18 Good 29 Bad 40 Good
7 Excellent 19 Excellent 30 Good 41 Medium
8 Excellent 20 Excellent 31 Excellent 42 Bad
9 Excellent 21 Good 32 Excellent 43 Excellent
10 Excellent 22 Excellent 33 Excellent 44 Excellent
11 Bad 23 Excellent 34 Good 45 Good
12 Good
CATEGORY

Valid Cumulative
Frequency Percent Percent Percent
Valid Bad 6 13.3 13.3 13.3
Excelent 18 40.0 40.0 53.3
Good 15 33.3 33.3 86.7
Medium 6 13.3 13.3 100.0
Total 45 100.0 100.0
Figure 2.1 Output from SPSS showing the frequency table for the variable CATEGORY

C. Graphical description of qualitative data


Bar graphs and pie charts are two of the most widely used graphical methods for describing
qualitative data sets.

Bar graphs give the frequency (or relative frequency) of each category with the height or length of the
bar proportional to the category frequency (or relative frequency).

Example 4a (Bar Graph) The bar graph generated by computer using SPSS for the variable
CATEGORY is depicted in Figure 2.2.
Lecture Notes 1 – Introduction to Statistics and Data Analysis 8
Engr. Caesar Pobre Llapitan

Medium

Good

Excelent

Bad

0 2 4 6 8 10 12 14 16 18 20

Figure 2.2 Bar graph showing the number of students of each category

Pie charts divide a complete circle (a pie) into slices, each corresponding to a category, with the
central angle and hence the area of the slice proportional to the category relative frequency.

Example 4b (Pie Chart) The pie chart generated by computer using EXCEL CHARTS for the variable
CATEGORY is depicted in Figure 2.3.

Bad

Excelent

Good

Medium

Figure 2.3 Pie chart showing the number of students of each category

D. Graphical description of quantitative data: Stem and Leaf displays


One of graphical methods for describing quantitative data is the stem and leaf display, which is widely
used in exploratory data analysis when the data set is small.

In order to explain what is a stem and what is a leaf we consider the data from the table 2.0b. For this
data for a two-digit number, for example, 79, we designate the first digit (7) as its stem; we call the
last digit (9) its leaf; and for three-digit number, for example, 112, we designate the first two digits (12)
as its stem; we also call the last digit (2) its leaf.

Steps to follow in constructing a Stem and Leaf Display


1. Divide each observation in the data set into two parts, the Stem and the Leaf.
2. List the stems in order in a column, starting with the smallest stem and ending with the
largest.
3. Proceed through the data set, placing the leaf for each observation in the appropriate stem
row.

Depending on the data, a display can use one, two or five lines per stem. Among the different stems,
two-line stems are widely used.

Example 5 The quantity of glucose in blood of 100 persons is measured and recorded in Table 2.0b
(unit is mg%). Using SPSS, we obtain the following Stem-and-Leaf display for this data set.
Lecture Notes 1 – Introduction to Statistics and Data Analysis 9
Engr. Caesar Pobre Llapitan

Table 2.0b Quantity of glucose in blood of 100 students (unit: mg%)


70 79 80 83 85 85 85 85 86 86
86 87 87 88 89 90 91 91 92 92
93 93 93 93 94 94 94 94 94 94
95 95 96 96 96 96 96 97 97 97
97 97 98 98 98 98 98 98 100 100
101 101 101 101 101 101 102 102 102 103
103 103 103 104 104 104 105 106 106 106
106 106 106 106 106 106 106 107 107 107
107 108 110 111 111 111 111 111 112 112
112 115 116 116 116 116 119 121 121 126

GLUCOSE
GLUCOSE Stem-and-Leaf Plot

Frequency Stem & Leaf

1.00 Extremes (=<70)


1.00 7. 9
2.00 8 . 03
11.00 8 . 55556667789
15.00 9 . 011223333444444
18.00 9 . 556666677777888888
18.00 10 . 001111112223333444
16.00 10 . 5666666666677778
9.00 11 . 011111222
6.00 11 . 566669
2.00 12 . 11
1.00 Extremes (>=126)

Stem width: 10
Each leaf: 1 case(s)

Figure 2.4. Output from SPSS showing the Stem-and-Leaf display for the data set of glucose

The stem and leaf display of Figure 2.4 partitions the data set into 12 classes corresponding to 12
stems. Thus, here two-lines stems are used. The number of leaves in each class gives the class
frequency.

Advantages of a stem and leaf display over a frequency distribution (considered in the next
section):
1. the original data are preserved.
2. a stem and leaf display arrange the data in an orderly fashion and makes it easy to determine
certain numerical characteristics to be discussed in the following chapter.
3. the classes and numbers falling in them are quickly determined once we have selected the
digits that we want to use for the stems and leaves.

Disadvantage of a stem and leaf display:


Sometimes not much flexibility in choosing the stems.
Lecture Notes 1 – Introduction to Statistics and Data Analysis 10
Engr. Caesar Pobre Llapitan

E. Tabulating quantitative data: Relative frequency distributions


Frequency distribution or relative frequency distribution is most often used in scientific publications
to describe quantitative data sets. They are better suited to the description of large data sets and they
permit a greater flexibility in the choice of class widths.

A frequency distribution is a table that organizes data into classes. It shows the number of
observations from the data set that fall into each of classes. It should be emphasized that we always
have in mind non-overlapping classes, i.e. classes without common items.

Steps for constructing a frequency distribution and relative frequency distribution:


1. Decide the type and number of classes for dividing the data set, lower limit and upper limit of
the classes:
Lower limit < Minimum of values
Upper limit > Maximum of values

2. Determine the width of class intervals:

Upper limit - Lower limit


Width of class intervals =
Total number of classes

3. For each class, count the number of observations that fall in that class. This number is called
the class frequency.

4. Calculate each class relative frequency

Class frequency
Class relative frequency=
Total number of observations

Except for frequency distribution and relative frequency distribution one usually uses relative class
percentage, which is calculated by the formula:

Relative class percentage = Class relative frequency × 100%

Example 6 Construct frequency table for the data set of quantity of glucose in blood of 100 persons
recorded in Table 2.0b (unit is mg%).

Using the software STATGRAPHICS, taking Lower limit = 62, Upper limit = 150 and Total number of
classes = 22 we obtained the following table.

Table 2.1 Frequency distribution for glucose in blood of 100 persons

Class Lower Upper Midpoint Frequency Relative Cumulative Cum. Rel.


Limit Limit Frequency Frequency Frequency
0 62 66 64 0 0 0 0
1 66 70 68 1 0.01 1 0.01
2 70 74 72 0 0 1 0.01
3 74 78 76 0 0 1 0.01
4 78 82 80 2 0.02 3 0.03
5 82 86 84 8 0.08 11 0.11
6 86 90 88 5 0.05 16 0.16
7 90 94 92 14 0.14 30 0.3
8 94 98 96 18 0.18 48 0.48
9 98 102 100 11 0.11 59 0.59
Lecture Notes 1 – Introduction to Statistics and Data Analysis 11
Engr. Caesar Pobre Llapitan

10 102 106 104 18 0.18 77 0.77


11 106 110 108 6 0.06 83 0.83
12 110 114 112 8 0.08 91 0.91
13 114 118 116 5 0.05 96 0.96
14 118 122 120 3 0.03 99 0.99
15 122 126 124 1 0.01 100 1
16 126 130 128 0 0 100 1
17 130 134 132 0 0 100 1
18 134 138 136 0 0 100 1
19 138 142 140 0 0 100 1
20 142 146 144 0 0 100 1
21 146 150 0 0 100 1

Remarks:
1. All classes of frequency table must be mutually exclusive.
2. Classes may be open-ended when either the lower or the upper end of a quantitative
classification scheme is limitless. For example

Class: age
birth to 7
8 to 15
........
64 to 71
72 and older

3. Classification schemes can be either discrete or continuous. Discrete classes are separate
entities that do not progress from one class to the next without a break. Such class as the
number of children in each family, the number of trucks owned by moving companies.
Discrete data are data that can take only a limit number of values. Continuous data do
progress from one class to the next without a break, they involve numerical measurement
such as the weights of cans of tomatoes, the kilograms of pressure on concrete ... Usually,
continuous classes are half-open intervals. For example, the classes in Table 2.1 are half-open
intervals [62, 66), [66, 70) ...

F. Graphical description of quantitative data: histogram and polygon


There is an old saying that “one picture is worth a thousand words”. Indeed, statisticians have
employed graphical techniques to describe sets of data more vividly. Bar charts and pie charts were
presented in Figure 2.2 and Figure 2.3 to describe qualitative data. With quantitative data summarized
into frequency, relative frequency tables, however, histograms and polygons are used to describe the
data.

1. Histogram
When plotting histograms, the phenomenon of interest is plotted along the horizontal axis, while the
vertical axis represents the number, proportion or percentage of observations per class interval –
depending on whether or not the particular histogram is respectively, a frequency histogram, a
relative frequency histogram or a percentage histogram.

Histograms are essentially vertical bar charts in which the rectangular bars are constructed at
midpoints of classes.
Example 7 Below we present the frequency histogram for the data set of quantities of glucose, for
which the frequency table is constructed in Table 2.1.
Lecture Notes 1 – Introduction to Statistics and Data Analysis 12
Engr. Caesar Pobre Llapitan

20

15

Frequency
10

0
68
76
84
92
100
108
116
124
132
140
Quantity of glucoza (mg%)

Figure 2.5 Frequency histogram for quantities of glucose, tabulated in Table 2.1

Remark: When comparing two or more sets of data, the various histograms cannot be constructed on
the same graph because superimposing the vertical bars of one on another would cause difficulty in
interpretation. For such cases it is necessary to construct relative frequency or percentage polygons.

2. Polygons
As with histograms, when plotting polygons, the phenomenon of interest is plotted along the
horizontal axis while the vertical axis represents the number, proportion or percentage of
observations per class interval – depending on whether or not the particular polygon is respectively, a
frequency polygon, a relative frequency polygon or a percentage polygon. For example, the frequency
polygon is a line graph connecting the midpoints of each class interval in a data set, plotted at a
height corresponding to the frequency of the class.

Example 8 Figure 2.6 is a frequency polygon constructed from data in Table 2.1.

20

15
Frequency

10

0
68
76
84
92
100
108
116
124
132
140

Quantity of glucoza (mg%)

Figure 2.6 Frequency polygon for data of glucose in Table 2.1

Advantages of polygons:
1. The frequency polygon is simpler than its histogram counterpart.
2. It sketches an outline of the data pattern more clearly.
Lecture Notes 1 – Introduction to Statistics and Data Analysis 13
Engr. Caesar Pobre Llapitan

3. The polygon becomes increasingly smooth and curve like as we increase the number of classes
and the number of observations.

G. Cumulative distributions and cumulative polygons


Other useful methods of presentation which facilitate data analysis and interpretation are the
construction of cumulative distribution tables and the plotting of cumulative polygons. Both may be
developed from the frequency distribution table, the relative frequency distribution table or the
percentage distribution table.

A cumulative frequency distribution enables us to see how many observations lie above or below
certain values, rather than merely recording the number of items within intervals.

A “less-than” cumulative frequency distribution may be developed from the frequency table as follows:

Suppose a data set is divided into n classes by boundary points x1, x2, ..., xn, xn+1. Denote the classes by
C1, C2, ..., Cn. Thus, the class Ck = [xk, xk+1). See Figure 2.7.

C1 C2 Ck Cn

x1 x2 xk xk+1 xn xn+1

Figure 2.7 Class intervals

Suppose the frequency and relative frequency of class Ck is fk and rk (k=1, 2, ..., n), respectively. Then
the cumulative frequency that observations fall into classes C1, C2, ..., Ck or lie below the value xk+1 is
the sum f1+f2+...+fk. The corresponding cumulative relative frequency is r1 +r2+...+rk.

Example 9 Table 2.1 gives frequency, relative frequency, cumulative frequency and cumulative
relative frequency distribution for quantity of glucose in blood of 100 students. According to this table
the number of students having quantity of glucose less than 90 is 16.

A graph of cumulative frequency distribution is called an “less-than” ogive or simply ogive. Figure 2
shows the cumulative frequency distribution for quantity of glucose in blood of 100 students (data
from Table 2.1)

120
100
Cumulative frequency

80
60
40
20
0
68 76 84 92 10 0 10 8 11 6 12 4 13 2 14 0
Quantity of glucoza (mg%)

Figure 2.8 Cumulative frequency distribution for quantity of glucose (for data in Table 2.1)
Lecture Notes 1 – Introduction to Statistics and Data Analysis 14
Engr. Caesar Pobre Llapitan

Exercises 1
1) A national cancer institure survey of 1,580 adult women recently responded to the question “In
your opinion, what is the most serious health problem facing women?” The responses are
summarized in the following table:

The most serious health Relative


problem for women frequency
Breast cancer 0.44
Other cancers 0.31
Emotional stress 0.07
High blood pressure 0.06
Heart trouble 0.03
Other problems 0.09

a. Use one of graphical methods to describe the data.


b. What proportion of the respondents believe that high blood pressure or heart trouble is the
most serious health problem for women?
c. Estimate the percentage of all women who believe that some type of cancer is the most serious
health problem for women?

2) The administrator of a hospital has ordered a study of the amount of time a patient must wait
before being treated by emergency room personnel. The following data were collected during a
typical day:

WAITING TIME (MINUTES)


12 16 21 20 24 3 11 17 29 18
26 4 7 14 25 2 26 15 16 6

a) Arrange the data in an array from lowest to heighest. What comment can you make about
patient waiting time from your data array?
b) Construct a frequency distribution using 6 classes. What additional interpretation can you
give to the data from the frequency distribution?
c) Construct the cumulative relative frequency polygon and from this ogive state how long 75%
of the patients should expect to wait.

3) Bacteria are the most important component of microbial eco systems in sewage treatment plants.
Water management engineers must know the percentage of active bacteria at each stage of the
sewage treatment. The accompanying data represent the percentages of respiring bacteria in 25
raw sewage samples collected from a sewage plant.

42.3 50.6 41.7 36.5 28.6


40.7 48.1 48.0 45.7 39.9
32.3 31.7 39.6 37.5 40.8
50.1 39.2 38.5 35.6 45.6
34.9 46.1 38.3 44.5 37.2

a) Construct a relative frequency distribution for the data.


b) Construct a stem and leaf display for the data.
c) Compare the two graphs of parts a and b.

4) At a newspaper office, the time required to set the entire front page in type was recorded for 50
days. The data, to the nearest tenth of a minute, are given below.
Lecture Notes 1 – Introduction to Statistics and Data Analysis 15
Engr. Caesar Pobre Llapitan

20.8 22.8 21.9 22.0 20.7 20.9 25.0 22.2 22.8 20.1
25.3 20.7 22.5 21.2 23.8 23.3 20.9 22.9 23.5 19.5
23.7 20.3 23.6 19.0 25.1 25.0 19.5 24.1 24.2 21.8
21.3 21.5 23.1 19.9 24.2 24.1 19.8 23.9 22.8 23.9
19.7 24.2 23.8 20.7 23.8 24.3 21.1 20.9 21.6 22.7

a) Arrange the data in an array from lowest to heighest.


b) Construct a frequency distribution and a “less-than” cumulative frequency distribution
from the data, using intervals of 0.8 minutes.
c) Construct a frequency polygon from the data.
d) Construct a “less-than” ogive from the data.
e) From your ogive, estimate what percentage of the time the front page can be set in less than
24 minutes.

IV. DESCRIPTIVE STATISTICS

A. Measures of Location

1. Mean

Definition 1
The arithmetic mean of a sample (or simply the sample mean) of n observations
x 1 , x 2 , …, x n , denoted by x̄ is computed as
n

x  x2  ...  xn x i
x 1  i 1

n n

Definition 1a
The population mean is defined by the formula

x i
Sum of the values of all observations in population
 i1

N Total number of observations in population
Note that the definitions of the population mean and the sample mean are the same. It is also valid
for the definition of other measures of central tendency. But in the next section we will give different
formulas for variances of population and sample.

Example 1 Consider 7 observations: 4.2, 4.3, 4.7, 4.8, 5.0, 5.1, 9.0.
By definition
x̄ = (4.2 + 4.3 + 4.7 + 4.8 + 5.0+ 5.1 + 9.0)/7 = 5.3

Advantages of the mean:


1. It is a measure that can be calculated and is unique.
2. It is useful for performing statistical procedures such as comparing the means from several
data sets.

Disadvantages of the mean:


It is affected by extreme values that are not representative of the rest of the data.
Lecture Notes 1 – Introduction to Statistics and Data Analysis 16
Engr. Caesar Pobre Llapitan

Indeed, if in the above example we compute the mean of the first 6 numbers and exclude the 9.0
value, then the mean is 4.7. The one extreme value 9.0 distorts the value we get for the mean. It would
be more representative to calculate the mean without including such an extreme value.

2. Median

Definition 2
The median m of a sample of n observations
x , x , …, x
1 2 n arranged in ascending or
descending order is the middle number that divides the data set into two equal halves: one
half of the items lie above this point, and the other half lie below it.

Formula for calculating median of an arranged in ascending order data set


 xk if n  2k  1 (n is odd)

m  Median   1
 2  xk  xk  1  if n  2k (n is even)

Example 2 Find the median of the data set consisting of the observations 7, 4, 3, 5, 6, 8, 10.

Solution First, we arrange the data set in ascending order


3 4 5 6 7 8 10.

Since the number of observations is odd, n = 2 x 4 - 1, then median m = x4 = 6. We see that a half of the
observations, namely, 3, 4, 5 lie below the value 6 and another half of the observations, namely, 7, 8
and 10 lies above the value 6.

Example 3 Suppose we have an even number of the observations 7, 4, 3, 5, 6, 8, 10, 1. Find the
median of this data set.

Solution First, we arrange the data set in ascending order


1 3 4 5 6 7 8 10.

Since the number of the observations n = 2 × 4, then by Definition

Median = (x4 + x5)/2 = (5 + 6)/2 = 5.5

Advantage of the median over the mean: Extreme values in data set do not affect the median as
strongly as they do the mean.
Indeed, if in Example 1 we have

Mean = 5.3, median = 4.8.

The extreme value of 9.0 does not affect the median.

3. Mode

Definition 3
The mode of a data set
x , x , …, x
1 2 n is the value of x that occurs with the greatest
frequency, i.e., is repeated most often in the data set.

Example 4 Find the mode of the data set in Table 1.

Table 1 Quantity of glucose (mg%) in blood of 25 students


Lecture Notes 1 – Introduction to Statistics and Data Analysis 17
Engr. Caesar Pobre Llapitan

70 88 95 101 106
79 93 96 101 107
83 93 97 103 108
86 93 97 103 112
87 95 98 106 115

Solution First we arrange this data set in the ascending order

70 88 95 101 106
79 93 96 101 107
83 93 97 103 108
86 93 97 103 112
87 95 98 106 115

This data set contains 25 numbers. We see that, the value of 93 is repeated most often. Therefore, the
mode of the data set is 93.

Multimodal distribution: A data set may have several modes. In this case it is called multimodal
distribution.

Example 5 The data set


0 2 6 9
0 4 6 10
1 4 7 11
1 4 8 11
1 5 9 12

has two modes: 1 and 4. This distribution is called bimodal distribution.

Advantage of the mode


Like the median, the mode is not unduly affected by extreme values. Even if the high values are very
high and the low value is very low, we choose the most frequent value of the data set to be the modal
value We can use the mode no matter how large, how small, or how spread out the values in the data
set happen to be.

Disadvantages of the mode:


1. The mode is not used as often to measure central tendency as are the mean and the median.
Too often, there is no modal value because the data set contains no values that occur more
than once. Other times, every value is the mode because every value occurs the same number
of times. Clearly, the mode is a useless measure in these cases.
2. When data sets contain two, three, or many modes, they are difficult to interpret and
compare.

Comparing the Mean, Median and Mode


 In general, for a data set 3 measures of central tendency: the mean, the median and the mode
are different. For example, for the data set in Table 3.1, mean =96.48, median = 97 and mode =
93.
 If all observations in a data set are arranged symmetrically about a observation then this
observation is the mean, the median and the mode.
 Which of these three measures of central tendency is better? The best measure of central
tendency for a data set depends on the type of descriptive information you want. For most
data sets encountered in business, engineering and computer science, this will be the MEAN.

4. Geometric mean
Lecture Notes 1 – Introduction to Statistics and Data Analysis 18
Engr. Caesar Pobre Llapitan

Definition 4
Suppose all the n observations in a data set
x 1 , x 2 , …, x n >0 . Then the geometric mean of
the data set is defined by the formula

xG  G.M  n x 1 x 2 ...x n

The geometric mean is appropriate to use whenever we need to measure the average rate of change
(the growth rate) over a period of time.
From the above formula it follows

1 n
log xG   log xi
n i1

where log is the logarithm function of any base.

Thus, the logarithm of the geometric mean of the values of a data set is equal to the arithmetic mean
of the logarithms of the values of the data set.

B. Measures of Variability

Just as measures of central tendency locate the “center” of a relative frequency distribution, measures
of variation measure its “spread”.

The most commonly used measures of data variation are the range, the variance and the standard
deviation.

1. Range

Definition 5
The range of a quantitative data set is the difference between the largest and smallest values in
the set.
Range = Maximum - Minimum,
where Maximum = Largest value, Minimum = Smallest value.

2. Variance and standard deviation

Definition 6
The population variance of the population of the observations x is defined the formula

 x  
i
 
2 i1

N
where:  2
=population variance
xi = the item or observation
 = population means
N = total number of observations in the population.

From the Definition 3.6 we see that the population variance is the average of the squared distances of
the observations from the mean.
Lecture Notes 1 – Introduction to Statistics and Data Analysis 19
Engr. Caesar Pobre Llapitan

Definition 7
The standard deviation of a population is equal to the square root of the variance

 x i  
   2 i 1

Note that for the variance, the units are the squares of the units of the data. And for the standard
deviation, the units are the same as those used in the data.

Definition 6a
The sample variance of the sample of the observations
x 1 , x 2 , …, x n is defined the formula

n 2

 x i  x
s 
2 i 1

n1
where: s2 =sample variance
x = sample mean
n = total number of observations in the sample

The standard deviation of the sample is


s  s2

Remark: In the denominator of the formula for s2 we use n-1 instead n because statisticians proved
that if s2 is defined as above then s2 is an unbiased estimate of the variance of the population from
which the sample was selected (i.e. the expected value of s2 is equal to the population variance).

Uses of the standard deviation


The standard deviation enables us to determine, with a great deal of accuracy, where the values of a
frequency distribution are located in relation to the mean. We can do this according to a theorem
devised by the Russian mathematician P.L. Chebyshev (1821-1894).

Chebyshev’s Theorem
For any data set with the mean x and the standard deviation s at least 75% of the values will fall
within the interval x  2s and at least 89% of the values will fall within the interval x  3 s .

We can measure with even more precision the percentage of items that fall within specific ranges
under a symmetrical, bell-shaped curve. In these cases, we have:

The Empirical Rule


If a relative frequency distribution of sample data is bell-shaped with mean x and standard deviation
s, then the proportions of the total number of observations falling within the intervals x  s , x  2s ,
x  3 s are as follows:
x  s : Close to 68%
x  2s : Close to 95%
x  3 s : Near 100%
Lecture Notes 1 – Introduction to Statistics and Data Analysis 20
Engr. Caesar Pobre Llapitan

3. Relative dispersion: The coefficient of variation


The standard deviation is an absolute measure of dispersion that expresses variation in the same units
as the original data. For example, the unit of standard deviation of the data set of height of group
students is centimeter; the unit of standard deviation of the data set of their weight is kilogram. Can
we compare the values of these standard deviations? Unfortunately, no, because they are in the
different units.

We need a relative measure that will give us a feel for the magnitude of the deviation relative to the
magnitude of the mean. The coefficient of variation is one such relative measure of dispersion.

Definition 8
The coefficient of variation of a data set is the relation of its standard deviation to its mean
Standard deviation
 100%
cv = Coefficient of variation = Mean

This definition is applied to both population and sample.

The unit of the coefficient of variation is percent.

Example 6 Suppose that each day laboratory technician A completes 40 analyses with a standard
deviation of 5. Technician B completes 160 analyses per day with a standard deviation of 15. Which
employee shows less variability?

At first glance, it appears that technician B has three times more variation in the output rate than
technician A. But B completes analyses at a rate 4 times faster than A. Taking all this information into
account, we compute the coefficient of variation for both technicians:
For technician A: cv=5/40 x 100% = 12.5%
For technician B: cv=15/60 x 100% = 9.4%.

So, we find that, technician B who has more absolute variation in output than technician A, has less
relative variation.

V. OVERVIEW TO DESIGN OF EXPERIMENT

An engineer is someone who solves problems of interest to society by the efficient application of
scientific principles by
1. refining an existing product or process; or
2. designing a new product or process that meets customers’ needs

The engineering, or scientific, method is the approach to formulating and solving these problems
following these steps:
1. Develop a clear and concise description of the problem.
2. Identify, at least tentatively, the important factors that affect this problem or that may play a
role in its solution.
3. Propose a model for the problem, using scientific or engineering knowledge of the
phenomenon being studied. State any limitations or assumptions of the model.
4. Conduct appropriate experiments and collect data to test or validate the tentative model or
conclusions made in steps 2 and 3.
5. Refine the model on the basis of the observed data.
Lecture Notes 1 – Introduction to Statistics and Data Analysis 21
Engr. Caesar Pobre Llapitan

6. Manipulate the model to assist in developing a solution to the problem.


7. Conduct an appropriate experiment to confirm that the proposed solution to the problem is
both effective and efficient.
8. Draw conclusions or make recommendations based on the problem solution.

Figure 1-1 The engineering method. Retrieved from the reference.

Statistical techniques are a powerful aid in designing new products and systems, improving existing
designs, and designing, developing, and improving production processes.

Design of Experiments (DOE) techniques enables designers to determine simultaneously the


individual and interactive effects of many factors that could affect the output results in any design.
DOE also provides a full insight of interaction between design elements; therefore, it helps turn any
standard design into a robust one. Simply put, DOE helps to pin point the sensitive parts and
sensitive areas in designs that cause problems in Yield. Designers are then able to fix these problems
and produce robust and higher yield designs prior going into production.

A. Terminology
Response variable: The outcome of an experiment
Factor: Each variable that affects the response variable and has several
alternatives
Level: The values that a factor can assume
Primary Factor: The factors whose effects need to be quantified
Secondary Factor: Factors that impact the performance but whose impact we are not
interested in quantifying
Replication: Repetition of all or some experiments
Experimental Unit: Any entity that is used for the experiment
Interaction: Two factors A and B interact if the effect of one depends upon the level
of the other
Experiment Controlled study conducted to determine the effect varying one or
more
explanatory variable or factors has on a response variable

B. Fundamental Principles
The fundamental principles in design of experiments are solutions to the problems in
experimentation posed by the two types of nuisance factors and serve to improve the efficiency of
experiments. Those fundamental principles are
 Randomization
 Replication
 Blocking
 Orthogonality
 Factorial experimentation
Lecture Notes 1 – Introduction to Statistics and Data Analysis 22
Engr. Caesar Pobre Llapitan

Randomization is a method that protects against an unknown bias distorting the results of the
experiment.

Replication increases the sample size and is a method for increasing the precision of the experiment.
Replication increases the signal-to-noise ratio when the noise originates from uncontrollable nuisance
variables. A replicate is a complete repetition of the same experimental conditions, beginning with the
initial setup. A special design called a Split Plot can be used if some of the factors are hard to vary.

Blocking is a method for increasing precision by removing the effect of known nuisance factors. An
example of a known nuisance factor is batch-to-batch variability. In a blocked design, both the
baseline and new procedures are applied to samples of material from one batch, then to samples from
another batch, and so on. The difference between the new and baseline procedures is not influenced
by the batch-to-batch differences. Blocking is a restriction of complete randomization, since both
procedures are always applied to each batch. Blocking increases precision since the batch-to-batch
variability is removed from the “experimental error.”

Orthogonality in an experiment results in the factor effects being uncorrelated and therefore more
easily interpreted. The factors in an orthogonal experiment design are varied independently of each
other. The main results of data collected using this design can often be summarized by taking
differences of averages and can be shown graphically by using simple plots of suitably chosen sets of
averages. In these days of powerful computers and software, orthogonality is no longer a necessity,
but it is still a desirable property because of the ease of explaining results.

Factorial experimentation is a method in which the effects due to each factor and to combinations
of factors are estimated. Factorial designs are geometrically constructed and vary all the factors
simultaneously and orthogonally. Factorial designs collect data at the vertices of a cube in p-
dimensions (p is the number of factors being studied). If data are collected from all of the vertices, the
design is a full factorial, requiring 2 p runs. Since the total number of combinations increases
exponentially with the number of factors studied, fractions of the full factorial design can be
constructed. As the number of factors increases, the fractions become smaller and smaller (1/2, 1/4,
1/8, 1/16, …). Fractional factorial designs collect data from a specific subset of all possible vertices and
require 2p – q runs, with 2-q being the fractional size of the design. If there are only three factors in the
experiment, the geometry of the experimental design for a full factorial experiment requires eight
runs, and a one-half fractional factorial experiment (an inscribed tetrahedron) requires four runs

Factorial designs, including fractional factorials, have increased precision over other types of designs
because they have built-in internal replication. Factor effects are essentially the difference between
the average of all runs at the two levels for a factor, such as “high” and “low.” Replicates of the same
points are not needed in a factorial design, which seems like a violation of the replication principle in
design of experiments. However, half of all the data points are taken at the high level and the other
half are taken at the low level of each factor, resulting in a very large number of replicates. Replication
is also provided by the factors included in the design that turn out to have nonsignificant effects.
Because each factor is varied with respect to all of the factors, information on all factors is collected by
each run. In fact, every data point is used in the analysis many times as well as in the estimation of
every effect and interaction. Additional efficiency of the two-level factorial design comes from the fact
that it spans the factor space, that is, puts half of the design points at each end of the range, which is
the most powerful way of determining whether a factor has a significant effect.

Uses
The main uses of design of experiments are
 Discovering interactions among factors
 Screening many factors
Lecture Notes 1 – Introduction to Statistics and Data Analysis 23
Engr. Caesar Pobre Llapitan

 Establishing and maintaining quality control


 Optimizing a process, including evolutionary operations (EVOP)
 Designing robust products

Design
An experimental design consists of specifying the number of experiments, the factor level
combinations for each experiment, and the number of replications.

Steps in Designing an Experiment


 Identify the problem to be solved
 Determine factors that affect response variable
 Determine the number of experimental units
 Determine the level of each factor
o Control
o Randomize
 Conduct experiment
o Replication
o Collect and process data
 Test the claim

C. Prerequisites to Conducting Experiment


There are certain characteristics of an experiment that are prerequisites to conducting a meaningful
experiment. Some of these are:
 The experiment should have well defined objectives. These should include identifying the
factors and their ranges; choosing experimental procedure and equipment; and stating the
applicability of the results.
 As much as possible, effects of the independent factors should not be obscured by other
variables. This is accomplished by designing the experiment such that the effects of
uncontrolled variables are minimized.
 As much as possible the experiment should be free from bias. This involves the use of
randomization and replications.
 The experiment should provide a measure of precision (experimental error), unless it is known
from previous experimentation. Replications provide the measure of precision while
randomization assures the validity of the measure of precision.
 The expected precision of the experiment should be sufficient to meet the defined objectives.
There generally is a trade-off between the expense of additional experimentation and the
precision of the results. These trade-offs should be examined prior to the collection of data.
Also, greater precision may be obtained by use of blocked designs when appropriate.

In planning an experiment, you have to decide


1. what measurement to make (the response)
2. what conditions to study
3. what experimental material to use (the units)

You might also like