Ch1 Prob&Stat NEW
Ch1 Prob&Stat NEW
GENERAL PRINCIPLES IN
STATISTICS
1
WHAT DO SCIENTISTS DO?
A scientist is someone who solves problems of
interest to society with the efficient application of
scientific principles by:
• Refining existing products
• Designing new products or processes
2
STATISTICS SUPPORTS THE CREATIVE
PROCESS
The field of statistics deals with the collection,
presentation, analysis, and use of data to:
• Make decisions
• Solve problems
• Design products and processes
It is the science of learning information from data.
3
BASIC TYPES OF STUDIES
6
INTRODUCTION: BASIC TERMS
Population Vs. Sample
Population
Sample
TYPE OF VARIABLES
Variables
Quantitative Qualitative
The bar graph and the pie chart are two types of graphs that are
commonly used to display qualitative data.
%10
%50
%40
II. Organizing and graphing Quantitative variables
How to organize and display quantitative data.
Frequency distribution of quantitative
variable Single valued classes
Example: A sample of 10 students is selected, and asked how many cars
owned by your household.
3 1 0 2 1
1 2 1 1 0
Cars Owned Frequency
0 2
1 5
2 2
3 1 2
0
Graphical presentation of quantitative data
Figure 1 Figure 2
DESCRIBING DATA USING NUMERICAL
MEASURES
We already discussed that Frequency distribution and graphs are
important component of statistics, however it is also important to
numerically describe the main characteristics of a data set. We will
talk about two numerical summary measures. In particular, the
measures that we will discuss include measures of:
1. Central tendency
2. Dispersion or spread
1. Measures of Central Tendency
A measure of central tendency gives the center of a histogram or a
frequency distribution curve. Now, we will discusses four different
measures of central tendency: the mean, trimmed mean, the
median and the mode.
DESCRIBING DATA USING NUMERICAL
MEASURES
I. Mean
The mean is the most frequently used measure of central tendency.
Sum of all values
Mean
Number of values
x
Mean for population data:
N
x
Mean for sample data: x
n
Example: The following are the ages (in years) of all eight employees
of a small company:
53 32 61 27 39 44 49 57
Calculate the mean age of these employees.
DESCRIBING DATA USING NUMERICAL
MEASURES
The population mean is
x 362 45.25
N 8
x 32 39 57
x 42.67
n 3
Sometime a data set may contain a few very small or a few very large
values. Such values are called outliers or extreme values.
DESCRIBING DATA USING NUMERICAL
MEASURES
The down Table lists the total sales of six Palestinian companies for 2014.
We should know that the mean is not always the best measure of
central tendency because it is heavily influenced by outliers.
Sometimes other measures of central tendency give a more accurate
impression of a data set. For example, when a data set has outliers,
instead of using the mean, we can use either the trimmed mean or
the median as a measure of central tendency.
DESCRIBING DATA USING NUMERICAL
MEASURES
II. Trimmed Mean
The trimmed mean is calculated by dropping a certain percentage of values
from each end of a ranked data set. The trimmed mean is especially useful
as a measure of central tendency when a data set contains a few outliers at
each end.
Example: Suppose the following data give the ages (in years) of 10
employees of a company:
47 53 38 26 39 49 19 67 31 23
To calculate the 10% trimmed mean, first we rank these data values in
increasing order; then drop 10% of the smallest values and 10% of the
largest values. The mean of the remaining 80% of the values will give the
10% trimmed mean.
X19 23 26 31 38 39 47 49 53 X67
x 306
x 38.25
n 8 29
DESCRIBING DATA USING NUMERICAL
MEASURES
III. Median
Another important measure of central tendency is the median which
is the value of the middle term in a data set that has been ranked in
increasing order.
• If n is odd the median is the middle number
• If n is even the median is the mean of the middle two numbers
Example: Suppose the following data give the ages (in years) of 10
employees of a company: 47 53 38 26 39 49 19 67 31 23
First, we rank the given data in increasing order as follows:
19 23 26 31 38 39 47 49 53 67
38 39
Median 38.5
2
The advantage of using the median as a measure of central tendency is that it
is not influenced by outliers. 30
DESCRIBING DATA USING NUMERICAL
MEASURES
IV. Mode
The mode is the value that occurs with the highest frequency in a
data set.
Example: The following data give the speeds (in miles per hour) of
eight cars that were stopped on a road for speeding violations:
77 82 74 81 79 84 74 78
A major shortcoming of the mode is that a data set may have none or
may have more than one mode, whereas it will have only one mean
and only one median.
RELATIONSHIP AMONG THE MEAN, MEDIAN AND
MODE
As discussed previously, two of the many shapes that a histogram
can assume are symmetric and skewed.
Knowing the values of the mean, median, and mode can give us
some idea about the shape of a frequency distribution curve.
I. For a symmetric histogram and frequency distribution curve with
one peak (see down Figure), the values of the mean, median, and
mode are identical, and they lie at the center of the distribution.
RELATIONSHIP AMONG THE MEAN, MEDIAN AND
MODE
II. For a histogram skewed to the right (see the down Figure), the
value of the mean is the largest, that of the mode is the smallest,
and the value of the median lies between these two. (Notice that
the mode always occurs at the peak point). The value of the mean
is the largest in this case because it is sensitive to outliers that
occur in the right tail. These outliers pull the mean to the right.
RELATIONSHIP AMONG THE MEAN, MEDIAN AND
MODE
III. If a histogram and a frequency distribution curve are skewed to
the left (see the down Figure), the value of the mean is the
smallest and that of the mode is the largest, with the value of the
median lying between these two. In this case, the outliers in the
left tail pull the mean to the left.
DESCRIBING DATA USING NUMERICAL
MEASURES
2. Measures of Dispersion
The measures of central tendency, such as the mean, median, and
mode, do not reveal the whole picture of the distribution of a data
set. Two data sets with the same mean may have completely
different spreads. The variation among the values of observations
for one data set may be much larger or smaller than for the other
data set. (Note that the words dispersion, spread, and variation
have the same meaning).
Consider the following two data sets on the ages (in years) of all
workers working for each of two small companies.
Company 1: 47 38 35 40 36 45 39
Company 2: 70 33 18 52 27
DESCRIBING DATA USING NUMERICAL
MEASURES
2. Measures of Dispersion
The mean age of workers in both these companies is the same, 40
years. If we do not know the ages of individual workers at these two
companies and are told only that the mean age of the workers at both
companies is the same, we may deduce that the workers at these two
companies have a similar age distribution.
As we can observe, however, the variation in the workers’ ages for
each of these two companies is very different. As illustrated in the
diagram, the ages of the workers at the second company have a much
larger variation than the ages of the workers at the first company.
Company 1
35 36 38 39 40 45 47
Company 2
18 27 33 36 52 70
DESCRIBING DATA USING NUMERICAL
MEASURES
2. Measures of Dispersion
Thus, the mean, median, or mode by itself is usually not a sufficient
measure to reveal the shape of the distribution of a data set. We also
need a measure that can provide some information about the variation
among data values.
The measures that help us learn about the spread of a data set are
called the measures of dispersion. The measures of central tendency
and dispersion taken together give a better picture of a data set than
the measures of central tendency alone. Here we will discuss three
measures of dispersion: range, variance, and standard deviation.
DESCRIBING DATA USING NUMERICAL
MEASURES
I. Range
The range is the simplest measure of dispersion to calculate. It is
obtained by taking the difference between the largest and the smallest
values in a data set.
Example: The following are the ages (in years) of all eight employees
of a small company:
53 32 61 27 39 44 49 57
The range, like the mean, has the disadvantage of being influenced
by outliers.
DESCRIBING DATA USING NUMERICAL
MEASURES
II. Variance and Standard Deviation
The standard deviation is the most-used measure of dispersion. The
value of the standard deviation tells how closely the values of a data
set are clustered around the mean.
N
(x x) 2
Sample variance s2
n 1
Example: suppose the final scores of a sample of four students are 82,
95, 67, and 92, respectively.
Calculate the variance and standard deviation for these data.
The mean score for these four students is
82 95 67 92
x 84
4
DESCRIBING DATA USING NUMERICAL
MEASURES
II. Variance and Standard Deviation
x (x x) (x x)2
82 82-84 = -2 4
95 95-84 = 11 121
67 67-84 = -17 289
92 92-84 = 8 64
(x x) 0 (x x)2 478
(x x)2 s2
478
159.3
Sample variance = s
2
n 1 3