0% found this document useful (0 votes)
20 views35 pages

Introduction To Statistics

GBS 541

Uploaded by

kaszulu1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views35 pages

Introduction To Statistics

GBS 541

Uploaded by

kaszulu1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

CHAPTER 1

INTRODUCTION TO STATISTICAL ANALYSIS

Reading

Newbold 1.1, 1.3, parts of 1.2.

Anderson, Sweeney, and Williams Chapter 1

Wonnacott and Wonnacott Chapter 1

James T Mc Clave, P. George Benson Chapter 1

Introductory Comments

This Chapter sets the framework for the book. Read it carefully, because the ideas
introduced are a basis to this subject and research Methodology.

1. Random Sampling, Deductive and Inductive Statistics.

Random Sampling

Only in exceptional circumstances is it possible to consider every member of the


population. In most cases only a sample of the population can be considered and
the results contained from this sample must be generalized to apply to the
population.

In order that these generalizations should be accurate the sample must be random,
that is, every possible sample has an equal chance of selection and the choice of a
member of the sample must not be influenced by previous selection; this is simple
random sampling.

1
Example 1

Suppose that a population consists of six measurements, 1, 2, 3, 4, 5, and 7. List


all possible different samples of two measurements that could be selected from
the population. Give the probability associated with each sample in a random
sample of n  2 measurement selected from the populations.

Solution

All possible samples are listed below

Sample Measurements
1 1,2
2 1,3
3 1,4
4 1,5
5 1,7
6 2,3
7 2,4
8 2,5
9 2,7
10 3,4
11 3,5
12 3,7
13 4,5
14 4,7
15 5,7

Now let us suppose that I draw a single sample of n = 2 measurement from the 15
possible sample of two measurements. The sample selected is called a random sample if
every sample had an equal probability (1/15) being selected.

It is rather unlikely that we would ever achieve a truly random sample, because the
probabilities of selection will not always be exactly equal. But we do the best we can.
One of the simplest and most reliable ways to select a random sample of n measurements
from a population is to use a table of random numbers (See Appendix B). Random
number tables are constructed in such a way that, no matter where you start in the tables
no matter what direction you move, the digits occur randomly and with equal probability.
Thus if we wished to choose a random sample of n = measurements from a population
containing 100 measurements, we could label the measurements in the population from
0 to 99 (or 1 to 100). Then referring to Appendix Vii and choosing a random starting
point, the next 10 two-digit numbers going across the page would indicate the labels of
the particular measurements to be included in the random sample. Similarly, by moving
up or down the page, we would also obtain a random sample.

2
Example 2

A small community consists of 850 families. We wish to obtain a random sample of 20


families to ascertain public acceptance of a wage and price freeze. Refer to Appendix B
to determine which families should be sampled.

Solution

Assuming that a list of all families in the community is available such as a telephone
directory), we could label the families from 0 to 849 (or equivalently, from 1 to 850).
Then referring to the Appendix, we choose a starting point. Suppose we have decided to
start at line 1, column 4. Going down the page we will choose the first 20 three-digit
numbers between 000 and 849 from Table B, we have

511 791 099 671 152


584 045 783 301 568
754 750 059 498 701
258 266 105 469 160

These 20 members identify the 20 families that are to be included in our example/

Deductive and Inductive Statistics.

The reasoning that is used in statistics hinges on understanding two types of logic,
namely deductive and inductive logic. The type of logic that reasons from the particular
(sample) to the general (Population) is known as inductive logic, while the type that
reasons from the general to the particular is known as deductive logic.

Learning Objectives

After working through this chapter, you should be able to:

 Explain what random sampling is


 Explain the difference between a population and a sample

3
CHAPTER 2

METHODS OF ORGANISING AND PRESENTING DATA

Reading

Newbold Chapter 2

James T Mc Clave and P George Benson Chapter 2

Tailoka Frank P Chapter 3

Introductory Comments

This Chapter contains themes to do with the understanding of data. We find graphical
representations from the data, which allow one to easily see its most important
characteristics. Most of the graphical representations are very tedious to construct
without the use of a computer. However, one understands much more if one tries a few
with pencil and a paper.

Graphical Representations Of Data

Types of business data; methods of representation of qualitative data, cumulative


frequency distribution.

Types of business data. Although the number of business phenomena that can be
measured is almost limitless, business data can generally be classified as one of two
types: quantitative or qualitative.

Quantitative data are observations that are measured on a numerical scale. Examples of
quantitative business data are:

i. The monthly unemployment percentage


ii. Last year’s sales for selected firms.
iii. The number of women executives in an industry.

4
Qualitative data is one that is not measurable, in the sense that height is measured, or
countable, as people entering a store. Many characteristics can be classified only in one
of asset of category. Examples of qualitative business data are:

i) The political party affiliations of fifty randomly selected business executives.


Each executive would have one and only one political party affiliation.

ii) The brand of petrol last purchased by seventy four randomly selected car owners.
Again, each measurement would fall into one and only one category.

Notice that each of the examples has nonnumerical or qualitative measurements.

Graphical methods for describing qualitative data.

(a) The Bar Graph

For example, suppose a woman’s clothing store located in the downtown area of a
large city wants to open a branch in the suburbs. To obtain some information
about the geographical distribution of its present customers, the Store manager
conducts a survey in which each customer is asked to identify her place of
residence with regard to the city’s four quadrants. Northwest (NW), North east
(NE), Southwest (SW), or Southeast (SE). Out of town customers are excluded
from the survey. The response of n = 30 randomly selected resident customers
might appear as in Table 1.1 (note that the symbol n is used here and throughout
this course to represent the sample size i.e. the number of measurements in a
sample). You can see that each of the thirty measurements fall in one and only
one of the four possible categories representing the four quadrants of the city.

Table 1.1. Customer resident Survey: n = 30

Customer Resident Customer Residence Customer Residence


1 NW 11 NW 21 NE
2 SE 12 SE 22 NW
3 SE 13 SW 23 SW
4 NW 14 NW 24 SE
5 SW 15 SW 25 SW
6 NW 16 NE 26 NW
7 NE 17 NE 27 NW
8 SW 18 NW 28 SE
9 NW 19 NW 29 NE
10 SE 20 SW 30 SW

A natural and useful technique for summarizing qualitative data is to tabulate the
frequency or relative frequency of each category.

5
Definition:

The frequency for a category is the total number of measurements that fall in the
category. The frequency for a particular category, say category i will be denoted by the
symbol fi .

The relative frequency for a category is the frequency of that category divided by the
total number of measurements; that is, the relative frequency for category I is

fi
Relative frequency =
n

Where n = total number of measurements in the sample

fi = frequency for the i category.

The frequency for a category is the total number of measurements in that category,
whereas the relative frequency for a category is the proportion of measurements in the
category. Table 1.2 shows the frequency and relative frequency for the customer
residences listed in Table 1.1. Note that the sum of the frequencies should always equal
the total number of measurements in the sample and the sum of the relative frequencies
should always equal 1 (except for rounding errors) as in Table 1.2.

Category Frequency Relative Frequency


NE 5 5/30 = .167
NW 11 11/30 = .367
SE 6 6/30 = .200
SW 8 8/30 = .267

Total 30 1

A common means of graphically presenting the frequencies or relative frequencies for


qualitative data is the bar chart. For this type of chart, the frequencies (or relative
frequencie) are represented by bars-one bar for each category.

The height of the bar for a given category is proportional to the category frequency (or
relative frequency). Usually the bars are placed in a vertical position with the base of the
bar on the horizontal axis of the graph. The order of the bars on the horizontal axis is
unimportant. Both a frequency bar chart and a relative frequency bar chart for the
customer’s residence are shown in Figure 1.1.

6
10

Relative
5 Frequency
Frequency

0
NE NW SE SW

Residential quadrant

a) A frequency bar chart.

.50

.25

NE NW SE SW

Residential Quadrant

b) A Relative Frequency bar chart.

Figure 1.1

7
b) The Pie Chart

The second method of describing qualitative data sets is the pie chart. This is
often used in newspaper and magazine articles to depict budgets and other
economic information. A complete circle (the pie) represents the total number of
measurements. This is partitioned into a number of slices with one slice for each
category. For example, since a complete circle spans 360o, if the relative
frequency for a category is .30, the slice assigned to that category is 30% of 360
or (.30) (36) = 108o.

108o

Figure 1.2 The portion of a pie char corresponding to a relative frequency of .3.

Graphical Methods for Describing Quantitative Data.

The Frequency Histogram and Polygon.

The histogram (often called a frequency distribution) is the most popular graphical
technique for depicting quantitative data. To introduce the histogram we will use thirty
companies selected randomly from the 1980 Financial Magazine (the top 500 companies
in sales for calendar year 1979). The variable X we will be interested in is the earnings
per share (E/S) for these thirty companies. The earnings per share is computed by
dividing the year’s net profit by the total number of share of common stock outstanding.
This figure is of interest to the economic community because it reflects the economic
health of the company.

The earnings per share figures for the thirty companies are shown (to the nearest ngwee)
in Table 1.3.

8
Company E/S Company E/S` Company E/S
1 1.85 11 2.80 21 2.75
2 3.42 12 3.46 22 6.58
3 9.11 13 8.32 23 3.54
4 1.96 14 4.62 24 4.65
5 6.48 15 3.27 25 0.75
6 5.72 16 1.35 26 2.01
7 1.72 17 3.28 27 5.36
8 .8.56 18 3.75 28 4.40
9 0.72 19 5.23 29 6.49
10 6.28 20 2.92 30 1.12

How to construct a Histogram

1. Arrange the data in increasing order, from smallest to largest measurement.

2. Divide the interval from the smallest to the largest measurement into between five
and twenty equal sub-intervals, making sure that:

a) Each measurement falls into one and only one measurement class.

b) No measurement falls on a measurement class boundary.

Use a small number of measurement classes if you have a small amount of


data; use a larger number of classes for large amount of data.

3. Compute the frequency (or relative frequency) of measurements in each


measurement class.

4. Using a vertical axis of about three-fourths the length of the horizontal axis, plot
each frequency (or relative frequency) as a rectangle over the corresponding
measurement class.

Using a number of measurements, n = 30, is not large, we will use six classes to
span the distance between the smallest measurements, 0.72, and the largest
measurement, 9.11. This distance divided by 6 is equal to

Largest measurement – smallest measurement = 9.11 – 0.72


Number of intervals 6
 1.4

9
By locating the lower boundary of the first class interval at 0.715 (slightly below the
smallest measurement) and adding 1.4, we find the upper boundary to be 2.115. Adding
1.4 again, we find the upper boundary of the second class to be 3.515. Continuing this
process, we obtain the six class intervals shown in the table below. Note that each
boundary falls on a 0.005 value (one significant digit more than the measurement), which
guarantees that no measurement will fall on a class boundary.

The next step is to find the class frequency and calculate the class relative frequencies

Class Measurement Class Class relative


Class Frequency Frequency
1 0.715 – 2.115 8 8/30 = .267
2 2.115 – 3.515 7 7/30 = .233
3 3.515 – 4.915 5 5/30 = .167
4 4.915 – 6.315 4 4/30 = .133
5 6.315 –7.715 3 3/30 = .100
6 7.715 – 9.115 3 3/30 = .100

Total 30 1.00

Table 1.4

Definition

The class frequency for a given class, say class i, is equal to the total number of
measurements that fall in that class. The class frequency for class I is denoted by the
symbol f i .

Definition

The class relative frequency for a given class, say class i, is equal to the class frequency
divided by the total number n of measurements, i.e.

fi
Relative frequency for class i =
n

10
8

0.517 2.115 3.515 4.915 6.315 7.715 9.115


Earnings per share
a) Frequency Histogram.

.3

.2

.1

0.715 2.115 3.515 4.915 6.315 7.715 9.115

Earnings per share


(b) Relative Frequency histogram

11
Cumulative Frequency Distribution

It is often useful to know the number or the proportion of the total number of
measurements that are less than or equal to those contained in a particular class. These
quantities are called the class cumulative frequency and the class cumulative relative
frequency respectively.

For example, if the classes are numbered from the smallest to the largest values of x, 1, 2,
3, 4, . . . , then the cumulative frequency for the third class would equal the sum of the
class frequencies corresponding to classes 1, 2, and 3.

Cumulative frequency for class 3  f1  f 2  f3

f1  f 2  f 3
Similarly, cumulative relative frequency for class 3  where n is the total
n
number of measurements in the sample.

Cumulative frequencies and cumulative relative frequencies for earning per share data.

Class No. Measurement Class Cumulative Class Relative Class


class Frequency frequency Frequency Cumulative
Relative
Frequency
1 0.715 - 2.115 8 8 8/30 = .267 8/30 =.267

2 2.115 – 3.515 7 (8 + 7) = 15 7/30 = .233 15/30 = .500

3 3.155 – 4.915 5 (15 + 5) = 20 5/30= .167 20/30 = .667

4 4.915 – 6.315 4 (20 + 4) = 24 4/30 = .133 24/30 = .800

5 6.315 – 7.715 3 (24 + 3) = 27 3/30 = .100 27/30 = .900

6 7.715 – 9.115 3 (27 + 3) = 30 3/100 = .100 30/30 = 1.00


30

Cumulative relative frequency Distribution for earnings per share data.

12
1.0

Cumulative
Relative .8
Frequency

.6

.4

.2

0.715 2.115 3.115 4.915 6.315 7.715 9.115


Earnings per share

Learning Objective

After working through this Chapter you should be able to:

 Draw a pie chart, bar chart and also construct frequency tables, relative
frequencies, and histogram.

 Interpret the diagrams. You will understand the importance of captions, axis
labels and graduation of axes.

13
CHAPTER 3

DESCRIPTIVE MEASURES

Reading

Newbold Chapter 2

Wonnacott and Wonnacolt Chapter 2

Tailoka Frank P. Chapter 4

James T McClave , Lawrence Lapin L and P George Benson Chapter 3

Introductory Comments

This Chapter contains themes which allow one to easily se the most important
characteristics of data. The idea is to find simple numbers like the mean, variance which
will summarize those characteristics.

3. Numerical Description of Data.

The Mode; A measure of Central tendency.

Definition.

The mode is the measure that occurs with the greatest frequency in the data set.
Because if emphasizes data concentration, the mode has application in marketing
as well as in description of large data sets collected by state and federal agencies.

Unless the data set is rather large, the mode may not be very meaningful. For
example, consider the earning per share measurements for the thirty financial
companies we used in the previous chapter. If you were to re-examine these data,
you would find that none of the thirty measurements is duplicated in this sample.
This, strictly speaking, all thirty measurements are mode for this sample.
Obviously, this information is of no practical use for data description. We can
calculate a more meaningful mode by constructing a relative frequency histogram
for the data. The interval containing the most measurements is called the modal
class and the mode is taken to be the midpoint of this class interval.

14
The modal class, the one corresponding to the interval 0.715 – 2.115 lies to the left side
of the distribution. The mode is the midpoint of this interval; that is

0.715  2.115
Mode =  1.415
2

In the sense that the mode measures data concentration, it provides a measure of central
tendency of the data.

The Arithmetic mean

A measurement of Central Tendency

The most popular and best understood measure of central Tendency for a quantitative
data set is the arithmetic (or simply the mean):

Definition

The mean of a set of quantitative data is equal to the sum of the measurements divided by
the number of measurement contained in the data set. The mean of a sample is denoted
by x (read “x bar”) and represent the formula for this calculation as follows:-

Example 1

Calculate the mean of the following five simple measures,. 5, 3, 8, 5,6.

Solution

Using the definition of the sample mean and demand shorthand notation we find

 5  3  8  5  6 27
x 11
xi    5.4.
5 5 5

The mean of this sample is 5.4

The sample mean will play an important role in accomplishing our objective of making
inferences about populations based on sample information. For this reason it is important
to use a different symbol when we want to discuss the mean of a population of
measurement s i.e. the mean of the entire set of measurements in which we are interested.
We use the Greek letter  (“mu”) for the population mean

15
The Median: Another measure of Central Tendency

The median of a data set is the number such that half the measurements fall below the
median and half fall above. The median is of most value in describing large data sets. If
the data set is characterized by a relative frequency histogram, the median is the point on
the x-axis such that half the area under the histogram lies above the median and half lies
below. For a small, or even a large but finite, number of measurements, there may be
many numbers that t satisfy the property indicated in the figure on the next page. For this
reason, we will arbitrarily calculate the media of a data.

Calculating a median

1. If the number of n of measurements in a data set is odd, the median is the middle
number when the measurements are arranged in ascending (or descending) order.

2.. If the number of n of measurements is even, the median is the mean of the two
middle measurements when the measurements are arranged in ascending (or
descending) order.

Example 2

Consider the following sample of n = 7 measurements.

5, 7, 4, 5, 20, 6, 2

a) Calculate the median of this sample

b) Eliminate the last measurement (the 2) and calculate the median of the remaining
n = 6 measurements.

Solution

a) The seven measurements in the sample are first arranged in ascending order

2, 4, 5, 5, 6, 7, 20

Since the number of measurements is odd, the median is the middle measure.
Thus, the median of this sample is 5.

b) After removing the 2 from the set of measurements, we arrange the sample
measurements in ascending order as follows:

4, 5, 5, 6, 7, 20

16
Now the number of measurements is even, and so we average the middle two
measurements. The median is (5+6)/2 = 5.5.
Comparing the mean and the median

1. If the median is less than the mean, the data set is skewed to the right.

Relative
Frequency

Median Mean

Rightward Skewness measurement units

Mean  Mode
Skewness 
s tan dard deviation

3( mean  median)

s tan dard deviation

2. The median will equal the mean when the data set is symmetric.

Median Mean
Measurement unit

Symmetry

17
3. If the median is greater than the mean, the data set is skewed to the left.

Mean Median

The range: A measure of variability

Measures of Variation

Definition:

The range of a data. Set is equal to the largest measurement minus the smallest measure.
When dealing with grouped data, there are two procedures which are not adopted for
determining the range.

1. Range = class mark of highest class – class mark of lowest class.


2. Range = upper class boundary of highest class – lower class boundary of lowest
class.

Variance and Standard Deviation

The Sample Variance for a sample of n measurements is equal to the squared distances
from the mean divided by (n-1). In symbols using S 2 to represent the simple variances,

(x  x)i
2

S2  i 1

n 1

18
The second step in finding a meaningful measure of data variability is to calculate the
standard deviation of the data set.

The sample standard deviation , s, is defined as the positive square root of the sample
variance, S 2 thus,

(x  x) i
2

S  S2  i 1

n 1

The corresponding quantity, the population standard deviation, measure the variability of
the measurements in the population and is denoted by  (‘sigma’). The population
variances will therefore be denoted by  2 .

Example 3

Calculate the standard deviation of the following sample. 2, 3, 3, 3, 4.

Solution

For this set of data, x  3. Then

(2  3) 2  (3  2) 2 (3  3) 2  (4  3) 2
S
5 1

2
  0.5  0.71
4

Shortcut formular for simple variance

2
( sum of sample measurement )
( sum of square of sample measurement ) 
S2  n
n 1
2
 n 
  x1 
xi   i 1 
n


i 1
2

n
n 1

19
Example 4

Use the shortcut formula to compute the variances of these two samples of five measures
each.

Sample 1: 1, 2, 3, 4, 5 Sample 2:2, 3, 3, 3, 4

Solution

We first work with sample 1. The quantities needed are:

x
i 1
1 = 1 + 2 + 3 + 4 + 5 = 15, and

x
i 1
2
1  12  22  32  42  52

 1  4  9  16  25  55

2
 5 
  xi 
x1   i 1 
n
(15) 2
 2

5
55 
5
S 2  i 1 
5 1 4

55  45 10
  2.5
4 4

Similarly, for sample 2 we get

x
i 1
i = 2 + 3 + 3 + 3 + 4 = 15

5
Add x
i 1
2
1  2 2  32  32  32  4 2  4  9  9  9  16  47

20
Then the variance for sample 2 is

2
 5 
  xi 
x1   i 1 
n
(15) 2
 2

5
47 
5
S 2  i 1 
5 1 4

47  45 2
   0.5
4 4

Example 5

The earnings per share measurements for thirty companies selected randomly from 1980
Financial/Daily mail are listed here. Calculate the sample variance S 2 and the standard
deviation, S, from these measurements.

1.85 5.72 2.80 1.35 2.75 2.01


3.42 1.72 3.46 3.28 6.58 5.36
9.11 8.56 8.32 3.75 3.54 4.40
1.96 0.72 4.62 5.23 4.65 6.49
6.48 6.28 3.27 2.92 0.75 1.12

Solution

The calculation of the sample variance , S 2 , would be very tedious for this example if we
tried to use the formula,

30

 ( x  x)
i
2

S2  i 1

30  1

because it would be necessary to compute all thirty squared distances from the mean.
However, for the shortcut formula we need only compute:

21
30

x
i 1
i  1.85  3.42  . . .  1.12  122.47 and

30

x
i 1
2
i  (1.85) 2  (3.42) 2  . . .  (1.12) 2  6.57.5239

2
 30 
  x1 
 i 1 
30
(122.47) 2
i x 2

30
657.5239 
30
S 2  i 1 
30  1 29

 5.4331

Notice that we retained four decimal places in the calculation of S 2 to reduce rounding
errors, even though the original data were accurate to only two decimal places.

The standard deviation is

S  S 2  5.4331  2.33

Interpreting the Standard Deviation

If we are comparing the variability of two samples selected from a population, the sample
with the larger standard deviation is the more variable of the two. Thus, we know how to
interpret the standard deviation on a relative or comparative basis, but we have not
explained how it provides a measure of variability for a single sample.

One way to interpret the standard deviation as a measure of variability of a data set would
be to answer questions each as the following. How many measurements are within 1
standard deviation of the mean? How many measurements are within 2 standard
deviation of the mean? For a specific data set, we can answer the questions by counting
the number of measurements in each of the intervals. However, if we are interested on
obtaining a general answer to these questions, the problem is more difficult. There are
two guidelines to help answer the questions of how many measurements fall within 1, 2,
and 3 standard deviations of the mean. The first set, which applied to any sample, is
derived from a theorem proved by the Russian Mathematician Chebyshev. The second
set, the Empirical Rule is based on empirical evidence that has accumulated over time
and applies to samples that posses mould shaped frequency distributions those that are
approximately symmetric, with a clustering of measurement about the mid point of the

22
distribution (the mean, median and mode should all be about the same) and that laid off
as we move away from the center of the histogram.

Aids to the Interpretation of a Standard deviation.

1. A rule (from Chebyshev’s theorem) that applied to any sample of measure


regardless of the shape of the frequency distribution.

a. It is possible that none of the measurements will fall within 1 standard


deviation of the means ( x  S to x S ).

b. At least ¾ of the measurement will fall within 2 standard deviations of the


mean ( x  2S to x  2S ).

c. At least 8/9 of the measurements will fall within 3 standard deviations of


the mean ( x  3S to x  3S ).

2. A rule of thumb, called the empirical rule, that applies to samples with frequency
distributions that are mould-shaped:

a) Approximately 68% of the measurements will fall within 1 standard


deviation of the mean ( x  S to x S ).

b) Approximately 95% of the measurements will fall within 2 standard


deviations of the mean ( x  2S to x  2S ).

c) Essentially all the measurements will fall within 3 standard deviations of


the mean ( x  3S to x  3S ).

Example 6

Refer to the data for earnings per share for thirty companies selected randomly from the
1980 Financial/Daily Mail. x  4.08, S  2.33. Calculate the fraction of the thirty
measurements that lie within the intervals x  S , x  2S , and x  3S , and compare the
results with those of the Chebyshev and Empirical rule.

23
Solution

x  S , x  S )  (4.08  2.33, 4.08  2.33)  (1.75, 6.41)

A check of the measurements show that 19 of the 30 measurements i.e., approximately


63% are within 1 standard deviation of the mean.

( x  2S , x  2S )  (4.08  4.66, 4.08  4.66)  (0.58, 8.74)


Contains 29 measurements, or approximately 97% of the n  30 measurements. Finally
the 3 standard deviation interval around x

( x  3S , x  3S )  (4.08  6.99, 4.08  6.99)  (2.91, 11.07).

contains all the measurements. These 1, 2 and 3 standard deviations percentages (63, 97,
and 100) agree fairly well with the approximations of 68%, 95% and 100%, given by the
Empirical Rule for mould-shape distributions.

Example 7

The aid for interpreting the value of a standard deviation can be put to an immediate
practical use as a check on the calculation of the standard deviation. Suppose you have a
data set for which the smallest measurement is 20 and the largest is 80. You have
calculated the standard deviation of the data set to be S = 190.

How can you use the Chebyshev or empirical rule to provide a rough check on your
calculated value of S?

Solution

The larger the number of measurements in a data set, the greater will be the tendency for
very large or very small measurements (extreme values) to appear in the data set. But
from the Rules, you know that most of the measurements (approximately 95% if the
distribution is mould-shaped) will be within 2 standard deviations of the mean, and
regardless of how many measurements are in the data set, almost all of them will fall 3
standard deviations of the mean. Consequently we would expect the range to be between
4 and 6 standard deviations – i.e. between 4s and 6s.

24
Range – largest measurement – smallest measurement = 80 – 20 = 20.

x  2S x x  2S

Range 4S

The relation between the range and the Standard deviation.

Then if we let the range equal 6S, we obtain

Range = 6S
60 = 6S
S = 10

Or, if we let the range equal 4S, we obtain a larger (and more conservative) value for S,
namely

Range = 4S
60 = 6S
S = 15

Now you can see that it does not make much difference whether you let the range equal
4S (which is more realistic for most data set) or 6S (which is reasonable for large data
sets). It is clear than your calculated value, S = 190, is too large, and you should check
your calculations.

25
Calculating a mean and standard Deviation from Grouped data

If your data have been grouped in classes of equal width and arranged in a frequency
table, you can use the following formulas to calculate x , S2, and S

xi  Midpoint of the ith class


f i = Frequency of the ith class
K = Number of classes

x f i i
x i 1

n
2
K 
 xi f i 
x12 f i   i 1 
K

 n
S 2  i 1
n 1
S  S2

Example 8

Compute the mean and standard deviation for the earnings per share data using the
grouping shown in the frequency Table 1.4.

Solution

The six class interval, midpoints, and frequencies are shown in the accompanying table.

Table 1.4 Earnings per share

Class Class Midpoint Class frequency


fi
0.715 – 2.115 1.415 8
2.115 – 3.515 2.815 7
3.515 – 4.915 4.215 5
4.915 – 6.315 5.615 4
6.315 – 7.015 7.015 3
7.715 – 9.115 8.415 3
n   fi  30

26
K

x f i i
x i 1
 (1.415)(8)  (2.815)(7)  (4.215)(5)  . . .  (8.415)(3) / 30
n
120.85
  4.03
30

2
K 
 xi f i 
x12 f i   i 1 
K

 n
S 2  i 1
n 1

K
We found x f
i 1
i i = 120.85 when we calculated x, therefore

((1.415) 2 (8)  (2.815) 2 (7)  . . .  (8.415) 2 (3))  (120.85)3 / 30


S2 
30  1
646.49875  486.82408

29
 5.5060
S  5.5060  2.35.

You will notice that values of x, S 2 , and S from the formulas for grouped data usually do
not agree with these obtained for the raw data ( x  4.03 and S = 2.311). This is because
we have substituted the value of the class mid point for each value of x in a class
interval. Only when every value of a x in each class is equal to its respective class
midpoint will the formulas for grouped and for ungrouped data give exactly the same
answers for x, S 2 , and S. Otherwise, the formulas for grouped data will give only the
approximations to these numerical descriptive measures.

Measures of Relative Standing

Descriptive measures of the relationship of a measurement to the rest of the data are
called measure of relative standing.

One measure of relative standing of a particular measurement is its percentile ranking.

27
Definition

Let x1 , x2 , . . . , xn be a set of n measurements arranged in increasing (or decreasing)


order. The pth percentile is a number x such that p% of the measurements fall below the
pth percentile and (100 – p)% fall above it.

For example: if oil company A report that its yearly sales are in the 90th percentile of all
companies in the industry, the implication is that 90% of all oil companies have yearly
sales less that A’s, and only 10% have yearly sales exceeding company A’s.

Relative
Frequency

.90
.10

Company A’s sales. Yearly sales.

Another measure of relative standing in popular use is the Z-score. The Z-score makes
use of the mean and standard deviation of the data set in order to specify the location of a
measurement.

Definition

The sample Z-score for a measurement x is

xx
Z
S

The population Z-Score for a measurement x is

x
Z

The Z-score represents the distance between a given measurement x and the mean
expressed in standard units.

28
Example 9

Suppose 200 steel workers are selected, and the annual income of each is determined.
The mean and standard deviation are x  K14,000, S  K 2,000

Suppose Chipo’s annual income is K12, 000 what is his sample Z-score?

K8,000 K12,000 K14,000 K20,000


x  3S x x x  3S

Annual income of steel workers.

Solution

Chipo’s annual income lies below the mean income of the 200 steel workers.

x  x 12000  14000
We compute Z    1.0
S 2000

Which tells us that Chipo’s annual income is 1.0 standard deviation below the sample
mean, in short, his sample Z-score is –1.0.

Example 10

Suppose a female bank executive believes that her salary is low as a result of sex
discrimination. To try to substantiate her belief, she collects information on the salaries
of her counterparts in the banking business. She finds that their salaries have a mean of
K17, 000 and a standard deviation of K1, 000. Her salary is K13, 500. Does this
information support her claim of sex discrimination?

Solution

The analysis might proceed as follows: First, we calculate the Z-score for the woman’s
salary with respect to those of her male counterparts. Thus

13500  17000
Z  3.5
1000

29
The implication is that the woman’s salary is 3.5 standard deviations below the mean of
the male distribution. Furthermore, if a check of the male salary data shows that the
frequency distribution is mould-shaped, we can infer that very few salaries in this
distribution should have a Z-score less than –3, as shown in the figure.

Relative
Frequency

Z-Score = -3.5

13.500 17,000
Salary (K)

Male Salary Distribution

Therefore, a Z-score of –3.5 represents either a measurement from a distribution different


from the male salary distribution or a very unusual (highly improbable) measurement for
the male salary distribution.

Well, which of the two situations do you think prevails? Do you think the woman’s
salary is simply an usually low one in the distribution of salaries, or do you think her
claim of salary discrimination is justified? Most people would probably conclude that
her salary does not come from the male salary distribution.

However, the careful investigator should require more information before inferring sex
discrimination as the case. We would want to know more about the data collection
technique the woman used, and more about her competence at her job. Also perhaps
other factors like the length of employment should be considered in the analysis.

30
Learning Objectives

After working through this Chapter you should be able to

 Calculate the arithmetic mean, standard deviation, variance, median, and


quartiles for grouped or ungrouped data.

 Explain the use of all the above quartiles.

31
Sample Examination Questions

1. (a) Briefly state, with reasons, the type of chart which would best convey the
information for each of the following:

(i) Students at the University classified by programme of study.

(ii) Members of a professional association classified by age.

(iii) Numbers of cars taxed for 2002, 2003 and 2004 in areas A, B and
C of a city.

(b) The weekly cost (K) of rented accommodation was recorded for 100
students living in an area.

Amount in Thousand of
Kwachas Frequency
0–4 3
5–9 17
10 – 14 24
15 – 19 31
20 – 24 19
25 - 29 6

(i) Draw a histogram.

(ii) Give the median and the interquartile range.

(iii) Calculate the mean, mode, and standard deviation.

(iv) What conclusions can you draw from the data?

32
2. The data below are per capita per week numbers of cigarettes sold for 38 states in
a country.

19.20 26.82 19.24 27.18 25.96 30.14


29.27 21.10 28.91 29.92 29.64 21.94
22.58 29.92 26.91 43.40 30.18 23.86
28.56 24.75 24.32 24.78 22.17
20.96 27.38 24.44 26.89 41.46
21.08 23.57 15.80 32.10 24.44
29.04 31.34 29.60 23.12 17.08

(a) Plot the data using an approximate graphical method.

(b) Give the mean, the median and the mode.

(c) Assuming this is a normal distribution, and given a standard deviation of


these figures of 4.387, what proportion of the states would expect to have
more than 20 cigarettes smoked per capita per week?

(d) How does this compare with the actual situation as shown in the table
above?

3. (a) Briefly state, with reasons, the type of chart which would best convey in
each of the following:

(i) A country’s total import of cigarettes by source.

(ii) Students in higher education classified by age.

(iii) Number of students registered for secondary school in year 2001,


2002 and 2003 for areas X, Y, and Z of a country.

(b) The weekly cost (K’000) of rented accommodation was recorded for 40
students living in an area.

35 56 33 30 31 55 29 27
21 32 43 33 29 27 30 29
26 26 27 26 35 32 28 27
31 27 33 24 27 28 33 49
22 19 46 36 26 38 36 55

33
(i) Summarize the data in a frequency distribution table.

(ii) Calculate the mean and the standard deviation from your frequency
table.

(iii) Plot a histogram for these data. What is the value of the median?

(iv) What conclusions can you draw from these data?

4. (a) Given below is a sample of 25 observations, calculate:

(i) The range (ii) The arithmetic mean


(iii) The median (iv) The lower quartile
(v) The upper quartile (vi) The quartile deviation
(vii) The mean deviation (viii) The standard deviation

5 18 29 42 50 61
8 20 33 43 54 63
10 21 35 46 56 67
11 25 39 48 58 69
14

(b) Explain the term ‘measure of dispersion’ and state briefly the advantage and
disadvantage of using the following measures of dispersion:

(i) Range
(ii) Mean deviation
(iii) Standard deviation

34
5. A machine produces the following number of rejects in each successive period of
five minutes.

20 55 58 40 15 28 21 29 30 17
84 58 7 40 41 67 28 19 26 26
16 25 55 43 22 66 32 29 11 21
26 42 57 73 27 66 7 23 17 35
27 42 13 28 24 37 34 27 24 12

(a) Construct a frequency distribution from these data, using seven class
intervals of equal width.

(b) Using the frequency distribution, calculate:

(i) the mean


(ii) the standard deviation

(c) Briefly explain the meaning of your calculated measures.

35

You might also like