Introduction To Statistics Module
Introduction To Statistics Module
Compiled by:
FEBRUARY, 2013
Statistics for Agribusiness Introduction to Statistics
Table of Contents
3.1. INTRODUCTION TO STATISTICS 1
3.1.1 INTRODUCTION 1
3.1.2 OBJECTIVES 1
3.1.3 SECTIONS 1
3.1.3.1 INTRODUCTORY CONCEPTS IN STATISTICS 1
3.1.3.1.1 Definitions of Statistics 2
3.1.3.1.2 Categories of Statistics 3
3.1.3.1.3 Functions and scopes of statistics 3
3.1.3.1.4 Limitations of statistics 5
3.1.3. 2 GRAPHICAL AND NUMERICAL DESCRIPTIVE TECHNIQUES
6
3.1.3.2.1 Graphical Descriptive Techniques 7
3.1.3.2.2 Numerical Descriptive Techniques 12
3.1.3.3 MEASURES OF DISPERSION 22
3.1.3.3. 1. Absolute measures of dispersion 22
3.1.3.3.2. Relative Measures of Dispersion 24
3.1.3.4 PROBABILITY THEORIES 27
3.1.3.4.1 Basic Concepts of Probability 27
3.1.3.4.2 Definitions and Types of Probability 29
3.1.3.4.3 Basic Rules of Probability 30
3.1.3.4.4 Normal Probability Distribution 33
3.1.3.5. CONCEPTS OF SAMPLING AND THEIR APPLICATIONS 37
3.1.3.5.1 Basic Concepts of Sampling 37
3.1.3.5.2 Probability and Non-Probability Sampling Methods 39
3.1.3.5.3 Sampling Problems (Errors in Sample Survey) 43
3.1.3.5.4 Sampling Distributions 45
3.1.3.5.4.1 Sampling Distribution of the Mean 45
3.1.3.5.4.2 Sampling Distribution of the Proportions 51
3.1.3.6 STATISTICAL ESTIMATION AND HYPOTHESIS TESTING 55
3.1.3.6.2 Determining the Sample Size 59
3.1.3.6.3 Hypothesis Testing 60
3.1.3.6.3.1 Basic Concepts in Hypothesis Testing 61
3.1.3.6.3.2 Hypothesis Tests about a Population Mean 62
ii
Jimma, Haramaya, Hawassa, Ambo, Adama, Samara and Wolaita Sodo Universities
Statistics for Agribusiness Introduction to Statistics
iii
Jimma, Haramaya, Hawassa, Ambo, Adama, Samara and Wolaita Sodo Universities
Introduction to Statistics
3.1.1 Introduction
The learning task was designed to equip students with the ability to identify
the importance and application areas of statistics in their field of study;
interpret statistical information, reports, charts and figures; choose
appropriate sampling methods and procedures; explain the basic concepts of
probability distributions and their application; use estimation and testing
methods for predication and generalization purposes. In addition, the
learning task attempts to enable students to describe data collection tools and
procedures.
3.1.2 Objectives
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
3.1.3 SECTIONS
Define statistics? What are the applications of statistics in the sphere of human
activity? What are the limitations of statistics?
Definitions of Statistics
Statistics has been defined differently by different authors from time to time.
One can find more than hundred definitions in the literature of statistics.
Statistics can be used either as plural or singular.
When it is used as plural, it is a systematic presentation of facts and figures. It is
in this context that majority of people use the word statistics.
When statistics is used as singular, it is defined as the science of collecting,
organizing, presenting, analyzing and interpreting numerical data for useful
purposes. According to this definition, the area of statistics incorporates the
following five elements:
(a) Proper collection of data
The data itself forms the foundation of statistical analysis, and hence the data
must be carefully and accurately collected and accumulated. If the data is faulty
it will lead a wrong conclusion. The data may be available from existing
published sources which may already be organized into a presentable form, or it
may be collected by the investigator itself.
(b) Organization and classification of data
The collected data must now be edited in order to correct any inconsistencies,
biases, omissions and irrelevant answers in the survey or any mistakes in the
necessary computations. After editing, the data must be classified in suitable
terms according to some common characteristics of the elements of data. This
makes it easier for presentation.
(c) Presentation of data
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
The organized data can now be presented in the form of tables or diagrams.
This presentation in an orderly manner facilitates the understanding as well as
the analysis of data.
(d) Analysis of data
The basic purpose of data analysis is to make it useful for certain conclusions.
This analysis may simply be critical observation of data to draw conclusions
about it or it may involve highly complex and sophisticated mathematical
techniques.
(e) Interpretation of the data
Interpretation means drawing conclusions from the data which form the basis of
decision making. Correct interpretation requires a high degree of skill and
experience and is necessary in order to draw valid conclusions.
Categories of Statistics
Statistics can be divided into two broad categories such as descriptive statistics
and inferential statistics.
(i) Descriptive statistics
It is a collection of methods that are used to summarize and describe the
important characteristics of a set of measurements (data). As the name
suggests, it merely describes the data and consists of methods used in the
collection, organization, presentation and analysis of the data in order to
describe the various characteristics of such data. The methods can be either
graphical or computational. For e.g. frequency distribution tables, pie-charts,
bar-graphs, summaries of data, etc.
(ii)Inferential statistics
It is a set of procedures used to make inferences about the population
characteristics from the information contained in the sample. It can be used, for
example, to predict the price of fertilizer in the coming year based on the
sample information in this year, to estimate the effect of the intensity of rainfall
on plant growth, etc. Figure 1.1 below shows the major divisions of statistics.
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
stock market prices of individual stocks and their trends are highly complex to
comprehend, but a graph of price trends, gives us the picture at a glance.
(3) It facilitates comparisons of data: The absolute figures themselves do not
convey any significant meaning. Statistical devices such as averages,
percentages, ratios, etc are the tools that can be employed for the purpose of
comparison.
(4) It helps in formulating and testing hypothesis for the purpose of
correlation: It helps us establish a relationship between two or more variables.
For example, the degree of association between the extent of training and
productivity can be obtained by using statistical tools.
(5) It helps in predicting future trends: Statistical methods are highly useful
in analyzing the past data and predicting some future trends. For example, the
scales for a particular product for the next year can be computed by knowing
the sales volumes for that product for the previous years and the current market
trends and possible changes in the variables that affect the sales.
(6) It helps the central management in formulating policies: Based upon the
forecast of future trends, events or demand, the management can revise their
policies and plans to meet the future needs.
Limitations of statistics
Now a day’s statistics has become an inevitable part of our life. Though the use
and application of statistics is vast, it is not so easy to collect information or
data. Steps have ample the probability of mistake in its proper application can
be happened. Besides these, statistics has some limitation, some of which are:
(a) Statistics is not suitable to the study of qualitative phenomenon: Since
statistics is basically a science and deals with a set of numerical data, it is
applicable to the study of only these subjects of enquiry, which can be
expressed in terms of quantitative measurements. As a matter of fact, qualitative
phenomenon like honesty, poverty, beauty, intelligence etc, cannot be expressed
numerically and any statistical analysis cannot be directly applied on these
qualitative phenomenons.
(b) Statistics does not study individuals: Statistics does not give any specific
importance to the individual items; in fact it deals with an aggregate of objects.
6
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Individual items, when they are taken individually do not constitute any
statistical data and do not serve any purpose for any statistical enquiry.
(c) Statistical laws are not exact: It is well known that mathematical and
physical sciences are exact. But statistical laws are not exact and statistical laws
are only approximations. Statistical conclusions are not universally true. They
are true only on an average.
(d)Statistics is liable to be misused: Statistics must be used only by experts;
otherwise, statistical methods are the most dangerous tools on the hands of the
inexpert. The use of statistical tools by the inexperienced and untraced persons
might lead to wrong conclusions. Statistics can be easily misused by quoting
wrong figures of data.
(e) Statistics can analyze only aggregated observation or data: Any
statistics is a collection of data. Individual observation does not belong to
statistics hence, statistics analyses a collection of data and enlighten the overall
estimated result. For- example the average income of the laborers of a business
can be estimated by observing their per capital income. Average income does
not particularize anybody or neglect anybodies income. In this respect, statistics
gives an overall idea.
(f) Statistics rules are mutable: In some brunches of science some
unchangeable principles and data are available .but in statistics they are not
found. The principles of statistics are variable and changeable, approximate etc.
(g) Statistics is simply a method: The practical solution of any problem can be
done in many ways. Statistics is one of the methods of solving the problem. Its
evidences give the ideas of any matter beforehand. Its evidences are to be
relatively supported by other observation or data.
Learning activities
1. Which definition of statistics is more important? How?
2. What are the major importance and limitations of statistics?
Continuous Assessment
Question and answering, Quiz
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Summary
In this section you have learnt the introductory concepts of statistics. Here, you
have acquainted yourself with the definitions of statistics. When statistics is
used as singular, it incorporates the five elements such as collecting, organizing,
presenting, analyzing and interpreting of numerical data. Statistics can also be
categorized as descriptive and inferential statistics. Descriptive statistics is used
to collecting, organizing, summarizing, and presenting of numerical data where
as inferential statistics is used to performing hypothesis testing, determining
relationships between variables, and making predictions. Finally, you have
learnt the major functions and limitations of statistics.
What are the various graphical descriptive techniques? What are various types
of measures of central tendencies? Which descriptive technique is more
important? Why?
The subject area of descriptive statistics includes procedures used to summarize
masses of data and present them in an understandable way. Thus, we can
classify and examine the techniques (or procedures) of descriptive statistics into
two: graphical descriptive techniques (frequency distribution) and numerical
descriptive techniques (measures of central tendency). In this learning task, we
will discuss the various types of graphical descriptive techniques and numerical
descriptive techniques.
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
class limit (LCL) and upper class limit (UCL). While LCL identifies the
smallest possible data value assigned to the class, the UCL identifies the
largest possible data value assigned to the class.
(d)A class mark (mid-class) – is the average value of limits of a given class.
The class limits can be either stated or real.
(e) A class width (class interval) – is the difference between: Successive lower
class limits, or successive upper class limits, or successive class marks, or
the real class limits of a given class.
The three steps of constructing a quantitative frequency distribution are now
discussed briefly as follows:
(1) Determining the number of classes: Classes should be mutually exclusive.
They are formed by specifying values that will be used to group or classify
the elements in the data set. Data sets with a large number of elements
usually require a large number of classes. The objective is to use just
enough classes to capture the inherent variation within the data. Therefore,
as a rule of thumb we shall use from 5-20 classes. In general, smaller
number of classes is appropriate for small data sets, and larger number of
classes is appropriate for large data sets. The appropriate number of classes
may be decided by Yule’s formula, which is as follows:
10
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
From the above example, since the chosen number of class is 5, the
approximate class width is (46-8)/6 = 6.3≈6. It may be more convenient to
round this up (or down) to the nearest integer number. The appropriate class
width may be obtained through a trial and error process, which should be
viewed as a feature of developing a frequency distribution with quantitative
data. It is, however, desirable for the class width to be the same in order to
facilitate a meaningful interpretation. For the purpose of this exercise we will
use a common class width of 6.
(3) Determining class limits: Once we decide the number of classes, the class
width and the lower limit of the first class, it will be simple to obtain the
values of the lower and upper class limits of each class. Thus, in the above
example, the first interval ranges from 8 to 14, the second 15 to 21… and the
last class 43 to 49. However, the initial and terminal intervals (or limits) are
again determined by the judgment of the investigator. This then completes
the necessary steps for the construction of quantitative frequency
distributions. The frequency distribution for the data set in our example is as
follows:
Table 2.3: Frequency distribution for audit time in days
11
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
2. Classes must be exhaustive (or all inclusive) - they must provide a place to
record every data value in the data set.
3. The number of classes and the class limits must be chosen in a way that
empty classes (i.e., classes with zero frequencies) do not occur.
4. All classes shall have the same interval if possible.
(a) ‘from the top’ cumulative frequencies, which start with the first frequency
from the top, then following frequencies are added cumulatively down ward.
(b) ‘from the bottom’ cumulative frequencies, which start with the first
frequency from the bottom, then following frequencies are added
cumulatively upward.
Table 2.4: Cumulative frequency distributions for audit time in days for n = 30
12
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Where, CMi = Class mark of the ith class, LCLi and UCLi =
the lower and upper class limit for the ith class limit, respectively. For
example, the CM of the first class for a frequency distribution in the above
example is:
13
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
(1) Arithmetic mean: It is also called arithmetic average, or simply the mean,
or the average. It is the most familiar and useful measure of average. It is
computed by dividing the sum of all data values by the total number of
observations in a data set. It provides a measure of a central location. The
arithmetic mean can further be divided into the following three types:
(i) Simple arithmetic mean (SAM): It is the arithmetic mean for ungrouped
data. If X is a variable having n values, x1, x2, …, xn, then its SAM can be
computed as:
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
(iii) Weighted arithmetic mean (WAM): If the frequencies given above are
substituted by weights, then it is called WAM. If X is a variable having n values
(or class marks), x1, x2, …, xk with the corresponding weights of w1, w2, …, wn,
then its WAM can be computed as:
Example 2.5: Suppose profits per order for small, medium and large orders are
Birr 1, 2 and 3, respectively. Thus, the average profit per order is obtained by
using a formula of SAM. That is,
Suppose, in the example at hand, the numbers of small, medium and large
orders are 120, 60 and 20, respectively. Then the SAM of the three profits does
not tell us what average profit was actually earned. Thus, to find that, each
profit is weighted (or multiplied) by w, the number of orders having this profit.
The actual average profit per order in this case can be obtained by using a
formula of WAM. That is,
Thus, the weighted average profit per order is Birr 1.5. The predominance of
lower profit orders makes the weighted average lower than the simple average.
(iv) Geometric mean: If X is a variable having n values, x1, x2,…, xn, then
the geometric mean (GM) of a variable X for ungrouped data is computed as
the nth root of the product of n values. That is,
16
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
For grouped data, suppose X is a variable having k values, x1, x2, …, xk,
occurring with the corresponding frequencies f1, f2, …, fk, then the geometric
mean (GM) of a variable X is computed as:
Example 2.6: Compute the GM of the data set given in example 2.3.
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Solution: We can find the average yearly percent change in fuel expenditures
by using a geometric mean of growth factors as follows:
Step 3: Covert the geometric mean of the growth factors into an average yearly
percentage change.
(v) Harmonic Mean (HM): Harmonic mean is the reciprocal of the arithmetic
mean of the reciprocals. If X is a variable having n values, x1, x2, …, xn, then
the HM of a variable X for ungrouped data is computed as:
For grouped data, suppose X is a variable having k values, x1, x2, …, xk,
occurring with the corresponding frequencies f1, f2, …, fk, then the HM of a
variable X is computed as:
Example 8: What will be the HM of the data set given in example 2.3 above?
Xi 3 5 6 6 7 10 12
Reciprocal 1/3 1/5 1/6 1/6 1/7 1/10 1/12
18
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Note that HM ≤ GM ≤ AM but HM, GM and AM are equal where all the
numbers in the data set are the same. In general, because the mean takes into
account the value of every observation in a sample, it can be greatly distorted
(or affected) by extreme value(s).
(2) Median: The median is the value of the middle observation in a set of
observations (or data), which have been arrayed in order of magnitude. Thus, to
find the median, first put the numbers in ascending or descending order and
then find the middle position in the array. The median is therefore a position
average.
Example 2.9: (a) Suppose there are n=5 data items such as: 1, 5, 4, 1 and 9.
What is the median of this data set? (b) If the data set consists of n=6 such as 1,
1, 4, 8, 6 and 10, then what will be the median?
Solution: In order to find the median of a given data set, we first arrange the
data set in either ascending or descending order.
(a) The data set is then arranged in ascending order as: 1, 1, 4, 5, 9. Since n=5 is
19
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
(b) The data set is arranged in ascending order as: 1, 1, 4, 6, 8, 10. Since n=6 is
even, its median is the average of two middle values of ,
which equals to 5. i.e.,
For the case of grouped data (i.e., data in a frequency distribution table), the
median will be approximated as follows.
The median class of a distribution is a class where the sum of frequencies (or
cumulative frequency) is greater than or equal to n/2 for the first time. In a
frequency distribution, the median had to be interpolated in the class interval
containing the median, assuming the observations are uniformly distributed
within the median class.
20
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
The median is the 50-50 number in a data set, meaning 50% of the data set (or
observations) fall below 6.8 while the rest 50% fall above 6.8.
The median is resistant to extreme observations. i.e., the value of an outlier does
not affect the median. But, an outlier has a marked effect upon the arithmetic
mean.
Other measures of location for approximating the median for a frequency are
quartiles, deciles and percentiles. Formulas to get quartiles, deciles and
percentiles are:
limit of the ith deciles; CF = Cumulative frequency of the pre-ith deciles class;
fi = Frequency of ith decile class and C = Class interval.
21
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
In general, we should first find the value of the following factors such as
ith quartile, the ith decile and the ith
percentile class, respectively. Note that median = Q2=D5=P50
Solution: First we should get the quartile one class (Q1). In order to get the Q1
class, we first find the value of to 20 the
cumulative frequency (CF) 20 is equal to 20 at the first time, the second class
(7.00 – 9.99) is Q1 class. Now, once we get the Q1 class, we can compute the
value of the median by using the formula.
22
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Interpretation: The outputs per labor-hour for 25% of the total working days
lie below 6.995 while the rest 75% lie above it. Since all the observed values in
the quartile one class are included in the computation of Q1, the value of Q1
equals to the upper boundary of Q1 class.
To get D8, we first find the value which equals to . Since
the CF of 70 is greater than 64 at the first time, the 5th class (16.00 – 18.99) is
8th decile class (D8).
Interpretation: The output per labor-hour for 80% of the total working days lie
below 17.61 while the rest 20% fall above it.
It implies that the output per labor-hour for 20% of the total working days lie
below 9.195 while the rest 80% fall above it.
(3) Mode (Mo): Mode is defined as the value that occurs most frequently in a
data set. A given data set in which each value occurs only once or each value
occurs with the same frequency has no mode (unique characteristics which is
impossible in the case of other measures of central tendency). Hence, the mode
is not necessarily unique. A frequency distribution with one mode is called
unimodal, with two modes bimodal and with more than two modes called
23
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
multimodal frequency distribution. It is also the only measure of the average for
the variables measured in nominal scale. In case of continuous frequency
Where, L is the lower limit (or boundary) of a modal class; f is frequency of the
modal class; f 1 and f2 are the frequencies of the classes preceding and
succeeding the modal class, respectively; C is class interval.
If the distribution is moderately asymmetrical, the mean, median and mode
obey the empirical relationship:
The modal class of a distribution is also the 3 rd class (i.e.,7 – 9). Thus,
Comparing Mean, Median and Mode: The values of the mean, median and
mode computed based on a data set in example 2.4 are 6.53, 6.8, and 7.32,
respectively. This order reveals that mean < median < mode would indicate the
distribution of this data set is left skewed (panel a, Fig 3 below). The opposite
order (i.e., mean > median > mode) would indicate a right skewed distribution
(Panel b, Fig 3). If mean = median = mode, it is symmetrical (or normal)
distribution (Panel c, Fig 3). Both left and right skewed distributions are also
called asymmetric distributions.
24
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Learning activities
Continuous assessment
Test and Quiz
Individual assignment on constructing frequency distribution tables and
computing measures of central tendency and dispersion of a given data set.
Summary
25
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
What is the importance of measuring variability of data set? What are the
commonly used measures of dispersion in statistics?
Absolute measures of dispersion
Absolute measures of dispersion consist of range, inter-quartile range, mean
(absolute) deviation, variance, standard deviation and standard error of mean.
It is the difference between the first (Q1) and the third (Q3) quartiles. It shows
the interval width which contains the middle half of the data set.
IQR= Q3 – Q1
26
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
It is the arithmetic mean of the absolute deviations of each observation from the
mean. It measures on the average departure of each observation from the mean.
&
Where, n is sample size and the deviation may be from the mean or median; k =
Number of classes; fi Frequency of the ith class; Xi = Class mark of the ith class
The difference of each observation from the mean is called a deviation. It shows
how much a number varies from the mean.
27
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Sample
Population
Example 3.1: Find the mean and variance of the following data set: 46, 54, 42,
46, 32
28
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
We can ignore the units of measurement for the moment but emphasize that the
variance is an important summary statistic that captures the degree of dispersion
inherent in a data set. A major use is when dispersion is being compared across
two samples of data. The sampling variance will be seen to play an important
role later in the discussion series when we explore statistical inference and
hypothesis testing.
The measure of the deviations of each sample mean from mean of means of
repeated samples drawn from the same population. It is the measure of how
much the value of the mean may vary from sample to sample taken from the
same population. Or it is the standard deviation of the distribution of all
possible sample means, if samples of the same size were repeatedly taken from
the same population.
It is estimated as:
This can be proved as follows:
29
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Two or more groups of data sets may not be compared on the basis of absolute
measures of dispersion (although they express the variations in the same units
as the original data) due to the following reasons:
Under these two conditions, the relative measures of dispersion provide a better
indicator of variability since it is a unit free and measures dispersion of a
variable relative to its mean. The relative measures of dispersion consist of
relative range, coefficient of inter-quartile range, coefficient of mean deviation,
and coefficient of variation.
30
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
The coefficient of variation will be small if the variation is small of the two
groups, the one with less C.V. said to be more consistent.
Example 3.2: Consider the distribution of the yields (per plot) of two ground
nut varieties. For the first variety, the mean and standard deviation are 82 kg
and 16 kg respectively. For the second variety, the mean and standard deviation
are 55 kg and 8 kg respectively.
variety:
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Learning activities
1. Calculate S.D. for the yields (in kg per plot) of a cotton variety recorded
from seven plots 5, 6, 7, 7, 9, 4, and 5?
2. If the mean wage rate for production employees is $3.50 per hour with a
standard deviation of $0.20, what would be the effect on and S of raising
all rates by 10%.
3. Discuss why the relative measures of dispersion are better indicators of
variability than the absolute measures if one need to compare the dispersions
of two or more different groups of data sets, which are measured in different
units and/or their means are quite different.
Continuous assessment
Test and Quiz, and individual assignment on selecting representative samples
that can cover the variability of the population under study through using the
appropriate sampling methods
Summary
Measures of central tendency are not adequate to describe a given data set as
they do not provide any information about the spread of the data set (or
observations) about the center and from each other. The subject areas of
measures of dispersion are therefore used to describe the amount of scatter
32
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
around the center of the distribution and from each other. If there were no
variability within population, there would be no need for statistics. Since a
single item or sampling unit would tell all that we need to know about the
population as a whole. Hence, in order to know more about the population from
the sample, only average is not enough. The importance of measuring
variability is therefore to determine how representative the average is and to
compare the variability of two or more data sets.
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
(c) An event- is a subset of the sample space. Thus, the events in no. 2 in
example 4.1 above, can be:
A = {An even number will turn up} = {2, 4, 6}; B = {An odd number will
turn up} = {1, 3, 5}; C = {A single number will turn up} = {1}, {2}, {3},
{4}, {5}, or {6}, etc.
If an experiment is performed and the outcome is a member of an event in
which we are interested in, then we say that the event has occurred. Otherwise,
it has not occurred. Let A = {2, 4, 6} be an event of a random experiment of
throwing a fair die. Now on a throw of die 2, 4 or 6 appear we say event A has
occurred/happened. If on a throw, 5 appears, then event A has not occurred.
Thus, we can say an event occurs if the experiment gives rise to an outcome
belonging to the event.
(d) A simple event- an event consisting of exactly one element. Thus,
appearance of 1 in a throw of a die is a simple event, where as the appearance
of a multiple of 3 is not a simple event. Every (non-empty) event can be written
as disjoint union of simple events.
(e) Complementary event- All the outcomes in the sample space except the
given events. Thus, the complementary event of an event A is the non-
occurrence of event A in the sample space. It is denoted by or S-A.
contains those elements of the sample space which do not belong to A.
(f)Exhaustive events- Events are said to be (collectively) exhaustive if they
exhaust all the possible outcomes of an experiment. Thus, in the experiment of
tossing two coins, the events (a) two heads, (b) two tails, and (c) one tail, one
head exhaust all the outcomes; hence they are (collectively) exhaustive events.
(g) Equally likely events- Events which have the same probability of
occurrence. Two events A and B are said to be equally likely when there is no
reason to assume that event A occurs more than event B or vice versa. Thus,
p(A) = p(B). For instance, the two events {H} and {T} in the experiment of
tossing an ordinary coin are equally likely.
(h) A joint event- is the occurrence of two or more events in one trial.
(i)Mutually exclusive events- Two events which cannot happen at the same
time. Two events are said to be mutually exclusive if the occurrence of one
34
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
event precludes the occurrence of another event. For example, in the experiment
of tossing a coin, the occurrence of both head and tail at the same time is not
possible. Thus, the occurrence of head precludes the occurrence of tail, and vice
versa. i.e., two events A and B are mutually exclusive if they are disjoint (
). Thus,
(j) Independent events- Two events are independent if the occurrence of one
has no effect on the occurrence of the other. Thus, two events A and B are said
to be independent if . Two events are dependent if the
occurrence of the first event affects the occurrence of the second event in a way
the probability is changed.
(k) Sure event- An event A is said to be sure if p(A) = 1. Thus, S is a sure event
as p(S) = 1.
(l) Impossible event- An event A is called impossible if p(A) = 0. Thus, empty
set is an impossible event since .
(m) Odds in favor of an event: The odds that an event occurs can be found
using the ratio of the number of ways it can occur to the number of ways it
cannot occur. Let A be an event and odds in favor of A (or odds for A) are
‘a’ to ‘b’ (or a:b).
Thus, . If , then .
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
This implies that the probability of an event happening in the long run is the
ratio of the number of times the event occurred in the past to the total number of
observations.
Subjective probability: Subjective probability is defined as the degree of
believe assigned to the occurrence of an event by a particular individual. This
type of probability uses value judgment based on an educated guess or estimate,
employing opinions and inexact information. Subjective probabilities are also
called assessed probabilities.
Note that whether probabilities are objective or subjective, once they are
established, they are used in the same way according to the basic rules of
probability. In general, the basic rules of probability are applied to both
objective and subjective probabilities.
36
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
The probability of event B occurring that event A has already occurred is read
"the probability of B given A" and is written: P(B|A). This is called conditional
probability. i.e.,
If events are independent, then the probability of them both occurring is the
product of the probabilities of each occurring.
4) Rules for conditional probability: The probability of an event occurring
given that another event has already occurred is called a conditional probability.
The conditional probability of the event A given that event B has occurred is
denoted by P(A|B).
, provided P(B) > 0.
The probability that event B occurs, given that event A has already occurred is
, provided P (A) > 0.
Conditional probability is therefore the ratio of joint probability to marginal
probability. Marginal probability refers to the probability of an individual
given event.
Examples 4.3
1. The question, "Do you smoke?" was asked of 100 people. Results are shown
in the table.
Do you smoke?
Yes No Total
Male 19 41 60
Female 12 28 40
Total 31 69 100
(a) What is the probability of a randomly selected individual being a male
who smokes?
38
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Solutions:
(a) This is just a joint probability. The number of "Male and Smoke"
divided by the total = 19/100 = 0.19
(b)This is the total for male divided by the total = 60/100 = 0.60. Since no
mention is made of smoking or not smoking, this is a marginal
probability, it includes all the cases.
(d)This time, you're told that you have a male - think of stratified sampling,
this is a conditional probability. What is the probability that the male
smokes? Well, 19 males smoke out of 60 males, so 19/60 = 0.317.
(e) This time, you're told that you have a smoker and asked to find the
probability that the smoker is also male. This is also a conditional
probability. There are 19 male smokers out of 31 total smokers, so 19/31
= 0.6129.
5) Rule of total probability: Let be complementary events and let A
denote an arbitrary event. Then, P(A) = P(AB) + P(AB’). The rule says that the
probability of the event A occurring is the sum of probabilities for all joint
events in which A occur.
Remarks:
(i) The events of interest here are are called prior
probabilities,
(ii) P(B|A) and P(B’|A) are called posterior (revised) probabilities.
(iii) Baye’s Theorem is important in several fields of applications.
39
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
6) Counting rules
(a) Basic principle of counting (mn rule): Suppose that two experiments
are to be performed. Then if experiment 1 can result in any one of m possible
outcomes and if, for each outcome of experiment 1, there are n possible
outcomes of experiment 2, then together there are mn possible outcomes of the
two experiments.
(b)Generalized basic principle of counting: If r experiments that are to be
performed are such that the first one may result in any of n1 possible outcomes,
and if for each of these n1 possible outcomes there are n2 possible outcomes of
the second experiment, and if for each of the possible outcomes of the first two
experiments there are n3 possible outcomes of the third experiment, and if, . . .,
then there are a total of n1.n2…nr possible outcomes of the r experiments.
(c) Permutations: (Ordered arrangements): For r ≤ n, we define
permutation: The number of ways of ordering n distinct objects taken r at a
time (order is important). It is given by
;
Thus,
Thus,
Normal Probability Distribution
The normal distribution is probably the most important distribution in statistics.
It is a probability distribution of a continuous random variable and is often used
to model the distribution of discrete random variable as well as the distribution
of other continuous random variables. The basic from of normal distribution is
40
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
that of a bell, it has single mode and is symmetric about its central values. The
flexibility of using normal distribution is due to the fact that the curve may be
cantered over any number on the real line and it may be flat or peaked to
correspond to the amount of dispersion in the values of random variable.
Empirical rule: Given a set of measurements x1, x2. . . xn, that is bell shaped.
Then approximately
1. 68.26% of the measurements lie within one SD of their sample mean:
2. 95.44% of the measurements lie within two SDs of their sample mean:
3. 99.73% of the measurements lie within three SDs of their sample mean:
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
42
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
2. If the two numbers have the same sign, then subtract; if they are different
signs, then add. If there is only one z-score, then use the inequality to
determine the second sign (< is negative, and > is positive).
Finding Z-scores from probabilities- This is more difficult and requires you
to use the table inversely. You must look up the area between zero and the
value on the inside part of the table, and then read the z-score from the outside.
Finally, decide if the z-score should be positive or negative, based on whether it
was on the left side or the right side of the mean. Remember, z-scores can be
negative, but areas or probabilities cannot be.
Tabulated Values: Values of P(0 ≤ Z ≤ z) are tabulated in the appendix of any
statistics book.
Critical Values: Zα of the standard normal distribution are given by P(Z ≥ Zα) =
α which is in the tail of the distribution.
Examples 4.4:
1. Find the probability from the standard Z-Table such that
a. P (0 ≤ z ≤ 1) b. P(−1 ≤ z≤ 1) c. (−2 ≤ z ≤ 2) d. P(−3 ≤ z ≤ 3)
Solutions: a) 0.3413, b) 0.6826, c) 0.9544, d) 0.9974
Excel formula for this is =NORMSDIST (z). For example, can be
written by the following excel formula:
= NORMSDIST (1) – NORMSDIST (0) = 0.841345-0.5 = 0.341345
2. Find Z-score (i.e., z0) from the standard Z-Table such that
a. P(Z > z0) = 0.100 b. P(-z0<Z < z0) = 0.050 c. P(Z > z0) = 0.025 d. P(-z0<
Z ) = 0.010
Answer: a) z0 = 1.28, b) z0 = 0.07, c) z0 = 1.96, d) z0 = 2.33,
Excel Formula for this is =NORMSINV (zo). For example,P (Z>zo) = 0.100,
= NORMSINV (1-0.100) = 1.28; P (-z0<Z < z0) = 0.050, =
NORMSINV(0.5+0.025)= 0.07
Standard score (Z-score)- is obtained by the following formula:
The mean of the standard scores is zero and the standard deviation is 1.
43
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Learning Activities
1. In a sample of 100 plots of land, 25 have red (R), 15 grey (G), 50 black (D)
and 10 others (O) soil type. Set up a frequency distribution and find the
probabilities of the occurrence of the following events:
A = {A plot which has red soil type}
B = {A plot which has red or black soil type}
C = {A plot which has neither red or black soil type}
2. Suppose a manufacturing firm receives spare parts from two different
suppliers . Currently, 80% of the spare parts are purchased from
supplier I and the rest 20% from supplier II . It is also known that 5%
and 2% of Supplier 1’s and Supplier 2’s provision is defective, respectively. If a
firm uses a defective part in the process of production, the processing machine
will be broken down so that it will stop its production.
(i) If a firm receives a spare part from one of the two suppliers, what is the
probability of that spare part being: (a) Good? (b) Defective?
(ii) Suppose a firm now receives a defective spare part, what is the
probability that it came from
(a) Supplier 1? (b) Supplier 2?
3. If X is a normal random variable with parameters μ = 3 and = 9, find
a. P (2 < X < 5), b. P(X >0), c. P(X >9).
Continuous Assessment
44
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Test and Quiz, and individual assignment on how to estimate and test
significance of the relationship between different variables
Summary
The theory of probability forms the basis of statistical inference, the drawing
of inferences on the basis of a random sample of data.
Bayes’ theorem provides a formula for calculating a conditional probability.
It forms the basis of Bayesian statistics, allowing us to calculate the
probability of a hypothesis being true, based on the sample evidence and
prior beliefs.
The Normal distribution is appropriate for problems where the random
variable has the familiar bell-shaped distribution. This often occurs when the
variable is influenced by many, independent factors, none of which
dominates the others.
Each of these distributions is actually a family of distributions, differing in
the parameters of the distribution. Both the Binomial and Normal
distributions have two parameters: n and P in the former case, μ and σ 2 in the
latter. The Poisson distribution has one parameter, its mean μ.
The mean of a random sample follows a normal distribution, because it is
influenced by many independent factors (the sample observations), none of
which dominates in the calculation of the mean. This statement is always true
if the population from which the sample is drawn follows a normal
distribution.
If the population is not normally distributed then the Central Limit Theorem
states that the sample mean is normally distributed in large samples. In this
case ‘large’ means a sample of about 30 or more.
Have you come across the words sampling and population? If yes, please try to
describe what is sampling and population? Why do we need to take sample
instead of studying the population as a whole?
45
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
1. Cost reduction. Sampling saves cost of the study. It will be very expensive
to cover the entire population in the study. Only government can do some large
scale census. For example, population census has been conducted every 10
years in Ethiopia.
2. Timeliness. It saves time. Census involves a great deal of time to contact the
whole population.
3. The physical impossibility of checking all the items when a population
contains infinitely many members. Thus, it is impossible to generate detailed
46
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Sampling frame: It is also called source list from which sample is to be drawn.
A sampling frame is a list that closely approximates all the elements in the
population. If a sampling frame is not available, you have to prepare it if the
size of the population is small. Such a list should be comprehensive, reliable
and appropriate.
Sampling units: They are the collections of elements which do not overlap and
exhaust the entire of the population. Sampling unit may be a geographical area
such as region, district, village, kebele, etc., or a construction unit such as
house, flat, etc., or it may be an individual.
Sampling elements: They are the units of analysis (the units of final
observations) or cases in a population. It can be a person, a group, or an
organization that is being measured.
Sampling ratio: It is the ratio of sample size to size of the target population. If,
for example, the size of a given population is 20000 and the size of the sample
47
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
drawn from this population is 200, then the sampling ratio will be 200 divided
by 20000, which equals to 0.01 or 1%.
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
we simply go down the list taking every Kth individual, starting with a randomly
selected one among the first n individual. Thus, in systematic sampling only the
first unit is selected randomly and the remaining units of the sample are selected
at fixed intervals.
49
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
The sampling ratio and the size of sample elements drawn from each stratum
for question (a) can be computed as follows:
First choose the stratum which has the smallest size and then divide the size
of each stratum by the size of the smallest stratum. Thus, you will get the
following proportions among the three strata:
These proportions mean that if you draw 1 element from Jews, at the same
time you should draw 3 and 6 elements from Catholic and Protestant,
respectively to be included in your sample. In this way you can draw a
proportional stratified sample from each stratum. Thus, in order to keep this
proportionality, the sampling ratio will be calculated using these figures.
If you then add these proportions (6+3+1), you will get 10, a value of total
proportion. Finally, you divide the smallest proportion which equals to 1 by
the total proportion 10, you will get a sampling ratio of 0.1 for all
strata.
The elements of the sample drawn from each stratum will be:
The sampling ratio and the size of sample elements drawn from each stratum in
the case of question (b) can be computed as:
The elements of the sample drawn from each stratum will be:
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
51
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
big studies extending to a considerable large geographical area, say the entire
country.
52
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Since a sample has both kind of errors, where as a census has only non-
sampling error, you might conclude that the advantage really rests with the
census. But, the scale of taking a census makes it difficult to reduce the risk of
non-sampling error. Many sources of bias, for instance, management problems,
faulty measurement, lost or corrupted data will be easier to control in a tightly
constructed sample survey than in a full census. Moreover, sampling error can
be controlled (or at least the extent of it can be estimated) with sample surveys.
Thus, there are occasions when a sample survey could produce less error overall
than full census. Occasionally, sample results may not be representative. Bad
sample can be detected if we check whether the sample approximately matches
with the percentage for gender, race, educational level, etc given by the latest
data from census. In general, there are two types of errors, which may happen in
sample surveys: sampling and non-sampling errors.
53
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
(a) Sampling errors: Sampling errors are random variations in the sample
estimates around the true population parameters. For example, while population
mean is fixed for the given population, the sample mean (which estimates the
population mean) will vary from sample to sample. A discrepancy will arise
between the population parameter and sample statistic. The error thus
introduced by this statistical discrepancy is called sampling error.
Sampling errors can be estimated only for probability (or random) samples.
Random sampling allows unbiased estimates of sampling error. The
measurement of sampling errors is known as the precision of the sampling plan.
Thus, there are two major factors that cause sampling errors:
Sample bias: This is caused by the method of sample selection. When the
method of selection is inappropriate, statistical discrepancy between population
parameter and sample statistic will arise and will persist even when the sample
size is large. The sample bias is due to selection of the sample, which does not
truly represent the population from which it is drawn. Errors due to sample bias
could be corrected by the use of proper sampling method.
By chance: Even in the absence of sample bias, discrepancy between
population parameters and sample statistics could occur due to chance, as a
sample will never be the same as the population from which it is drawn and will
never reproduce exactly the characteristics of the population.
Non-sampling errors: are likely to occur in both sample surveys and censuses.
Some of the non-sampling errors are described as follows:
1. Non-coverage (sampling frame defects): Omission of part of the
intended population. For example, soldiers, students living in campus,
people in the hospitals, prisoners, etc are typically excluded from national
sample.
2. Non-response error: Some people refuse to be interviewed because they
may be ill, too busy, or simply don’t trust the interviewer.
3. Response error: This occurs due to response bias, which is a result of
vague, inaccurate, or wrong answers given by the respondent. For
example, there may be tendency on the part of the respondent to make
54
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Sampling Distributions
55
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
The expected value of is simply the population mean and this can be
written more formally as: . (Proof)
Various random samples can be expected to generate a variety of values. It
can be shown that with simple random sampling, the expression for the standard
deviation of depends on whether the underlying population is finite or
infinite. It might be useful to define some notation that will be used
subsequently.
= the standard deviation of all possible values; = the standard
deviation of the underlying population; n= the sample size; N= the population
size.
If the population is finite, the standard deviation of is given by:
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
A B C D E
0 3 6 3 18
(a)What is the population mean (µ)?
(b)What is the population standard deviation ( )?
(c) Construct the sampling distribution of the mean ( ) for a sample size of
3?
(d)What is the mean of the sampling distribution of the mean, (i.e.,the
mean of means)?
(e) What is the standard deviation of the sampling distribution the mean
(standard error of the mean, )?
(f) What observations can be made with respect to the population and
sampling distribution?
Solutions:
(a) Population mean (µ)
(b)Population standard deviation ( )
Thus, 10 simple random samples of size 3 can be drawn from our population.
The means for all these possible samples are given in Table 5.1.
Table 5.1 Sample Means ( ) from Population Values
Population Sample mean (
Samples Sample values
values )
A=0 ABC 0, 3, 6 3
B=3 ABD 0, 3, 3 2
C=6 ABE 0, 3, 18 7
D=3 ACD 0, 6, 3 3
57
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
E = 18 ACE 0, 6, 18 8
ADE 0, 3, 18 7
BCD 3, 6, 3 4
BCE 3, 6, 18 9
BDE 3, 3, 18 8
CDE 6, 3, 18 9
The sampling distribution of the means for sample size n=3 is given in Table
5.2.
Table 5.2 Sampling Distribution of the Means for n=3
No of means Probability of (p(
Sample mean (
(Frequency = ))
)
f) (Relative frequency)
2 1 0.1
3 2 0.2
4 1 0.1
7 2 0.2
8 2 0.2
9 2 0.2
(d)Mean of the sampling distribution ( )
, by using values in Table
5.1.
Or
,
by using values in Table 5.2.
(e) Standard error of the mean ( )
It is possible to find by using the values and their associated frequencies
of stated in
Table 5.2.
58
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
59
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
This is the formula for standard normal variable for the distribution of . Thus,
values of z computed from the formula can be used to enter the standard normal
table in the usual manner.
The Central Limit Theorem (CLT): When the population is normally
distributed, it is obvious that the sampling distribution of the mean also normal.
Yet decision makers must deal with many populations that are not normally
distributed or the distributions of their original data are unknown. Under these
conditions, we invoke what is called the Central Limit Theorem. In the context
of this application, the CLT states that in selecting simple random samples of
size n from a population, the sampling distribution of the mean can be
approximated by a normal probability distribution as the sample size becomes
large. It has been generally found that for most populations, a sample size of n
30 makes the normal approximation a reasonable assumption. However, the
larger the sample size, the better is the approximation to a normal probability
function. If, on the other hand, the population distribution is known and has a
normal probability distribution, the sampling distribution of is a normal
probability distribution for any sample size.
The CLT states that if x1, x2, ..., xn be simple random samples drawn from a
population x (whose distribution may not be normal), then the distribution of
the mean ( ) approaches the normal distribution with mean µ and standard
deviation as the sample size n becomes large. That is, if
.
Example 5.3: Suppose hourly wages of workers in an industry have a mean
wage rate of $5.00 per hour and a standard deviation of $0.60.
(a) What is the probability that the mean wage of a random sample of 50
workers will be between $5.10 and $5.20?
60
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
This implies that the probability of the sample mean wage is between $5.10 to
$5.20 is 10.99% or 0.1099.
(b)By simply substituting only n=36 instead of n=50 and keeping other values
as they were, you get a probability value of 0.1359. This value is larger than
the above value because the sample size this case is lower than before and
hence its standard error of mean is larger and thereby the range of the z-
values will be higher.
To this end, for a normally distributed population, the sample means are
normally distributed, regardless of the size of the sample. However, for any
large population (n 30), the sample means are nearly normally distributed.
Note that the CLT is not restricted only to normally distributed populations, but
the tendency also occurs for all populations usually encountered as the sample
size n increases. Thus, empirically CLT implies that if you have a sample size
of n which is sufficiently large (n 30), one can technically make inferences
based on the assumptions of normal distribution. The larger the sample size n
is, the smaller the standard error of the mean ( ) and the taller and thinner is
the distribution of means ( ).
61
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Where, n in this case is the number of all possible samples of a particular size
that can be drawn from a population.
The standard error of the proportion ( ) is the standard deviation of all
possible sample proportions. is then computed by applying the following
formula:
, if the sample size of the population
size N.
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
A B C D E
0 3 6 3 18
This population contains three even numbers (0, 6 & 18) and two odd numbers
(3 & 3).
(a) What will be the population proportion of even numbers (p) (i.e., the
successes are now even numbers)?
(b)Construct the sampling distribution of proportions ( ) for a sample size n
= 3.
(c) What is the mean of the sampling distribution of the proportion, (i.e.,
the mean of means)?
(d)What is the standard deviation (or standard error) of the sampling
distribution of the proportion (i.e., standard error of the proportion, )?
Solutions:
(a) Population proportion of even numbers (p)
Given that X = 3 and N = 5. Then,
(b)Sampling distribution of the proportion ( )
In order to obtain the sampling distribution of the proportion ( ), we should
first find the number of simple random samples of size n=3 that can be drawn
without replacement from a population of size N=5. The total possible samples
can be obtained by using the rule of combination,
Thus, 10 simple random samples of size 3 can be
drawn from our population. The proportions for all these possible samples are
given in Table 5.3.
Table 5.3 Sample Proportions ( ) from Population Values
Population Sample proportions (
Samples Sample values
values )
A=0 ABC 0, 3, 6 2/3
B=3 ABD 0, 3, 3 1/3
C=6 ABE 0, 3, 18 2/3
D=3 ACD 0, 6, 3 2/3
E = 18 ACE 0, 6, 18 1
ADE 0, 3, 18 2/3
63
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
BCD 3, 6, 3 1/3
BCE 3, 6, 18 2/3
BDE 3, 3, 18 1/3
CDE 6, 3, 18 2/3
The sampling distribution of the proportions for sample size n=3 is given in
Table 5.4.
Table 5.4 Sampling Distribution of the Proportion for n=3
Sample proportions ( No of proportions Probability of (p( ))
) (Frequency = f) (Relative frequency)
1/3 3 0.3
2/3 6 0.6
1 1 0.1
The CLT may also be used to justify the normal approximation to the
distribution of the sample proportion for a sufficiently large sample size n. If
n is large (usually ) and both , then the sampling
64
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Learning Activities
1. Suppose 55% the television audience population watched a particular
program one Saturday evening.
(a) What is the probability that, in a random sample of 100 viewers, less than
50% of the sample watched the program?
(b)If a random sample of 500 viewers is taken, what is the probability that less
than 50% of the sample saw the program?
Continuous Assessment
Individual assignment and/or written test will be given about basic concepts of
probability.
Summary
A primary data source is one where you obtain the data yourself or have
access to all the original observations. A secondary data source contains a
summary of the original data, usually in the form of tables. When collecting
data always keep detailed notes of the sources of all information, how it was
collected, precise definitions of the variables, etc. Some data can be obtained
electronically, which saves having to type it into a computer, but the data still
need to be checked for errors.
There are two principal types of sampling namely random and non-random
sampling. The various types of random sample, including simple, systematic,
stratified and clustered random samples. The methods are sometimes
combined in multistage samples. Common types of non-random sampling
methods include convenient, purposive, quota and snow ball sampling.
The type of sampling affects the size of the standard errors of the sample
statistics. The most precise sampling method is not necessarily the best if it
costs more to collect (since the overall sample size that can be afforded will
be smaller).
65
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
The sampling frame is the list (or lists) from which the sample is drawn. If it
omits important elements of the population its use could lead to biased
results.
Careful interviewing techniques are needed to ensure reliable answers are
obtained from participants in a survey.
Estimation is the process of using sample information to make good
estimates of the value of population parameters.
There are several criteria for finding a good estimate. Two important ones are
the (lack of) bias and precision of the estimator. Sometimes there is a tradeoff
between these two criteria – one estimator might have a smaller bias but be
less precise than another.
An estimator is unbiased if it gives a correct estimate of the true value on
average. Its expected value is equal to the true value.
The precision of an estimator can be measured by its sampling variance.
Statistical Estimation
Estimation is a process by which we estimate various unknown population
parameters from sample statistics. The first step in estimation is to obtain
observations on one or more random variables. Suppose there are a single
random variable X and a single parameter . The observations are used to
construct estimates of . The formula for obtaining the estimate of a parameter
is referred to as an estimator and the numerical value associated with it is called
an estimate. Thus, an estimator is a sample statistics that is used to estimate an
unknown population parameter. The theory of estimation can be divided into
two parts: point estimator and interval estimator.
Point Estimation: Point estimate is a single number or value that estimates the
exact value of the unknown population parameter of interest. Interval estimate
is an interval that provides an upper and lower bounds for a population
parameter. For example, the sample mean is an estimator of the population
66
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
(a) Interval estimation of the mean in large sample case: Here we assume
that n 30 and the sampling distribution of is thus approximated by the
normal probability distribution. We also assume that the sample size is small
relative to the size of population (n/N 0.05) and hence the finite population
67
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Statisticians use the notation to denote the probability that the sampling error
is larger than the sampling error reported in the precision statement. Thus,
=0.05 states there is a 0.05 probability that the sampling error is larger than
that reported in the precision statement. Since the normal distribution is
symmetric, /2 is the area or probability in each tail of the distribution and 1-
will be the area or probability that a sample mean will provide a sampling error
less than or equal to the sampling error used in the precision statement.
68
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
(6.3)
can be used in its place. Thus, the standard deviation of the sample means is
given by
69
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
The following table contains the values for z/2 for the most commonly used
confidence intervals. It should be noted that the common choices for the degree
of confidence can be expressed in percentage terms as 90%, 95% and 99%. The
z-value corresponding to z/2 is also referred to as critical value.
Table 6.1 Confidence Levels
Confidence level
(Critical Values)
(%)
90 0.1 0.05 1.645
95 0.05 0.025 1.96
99 0.01 0.005 2.575
(b) Interval estimation of the mean in small sample case: The central limit
theorem played an important role in the development of the confidence interval
in Section 7, but this was only the case for n 30. If n < 30 the sampling
distribution of depends upon the distribution of the population. If the
population distribution is normal, the sampling distribution of will also be
normal regardless of the sample size. If the population standard deviation ( )
is known, then expression (6.2) can be used regardless of the sample size. The
more likely case is that is unknown and the sample standard deviation
denoted as S (the square root of expression 8.3) must be used to obtain an
estimate for . The resultant confidence interval is: . is only
valid when n 30. In the small sample case, the confidence interval is based on
an alternative probability distribution known as the t-distribution. The t-
distribution is actually a family of similar probability distributions. A specific
distribution is determined by a parameter known as the degrees of freedom. A
t-distribution has a zero mean and with more degrees of freedom becomes less
dispersed. As the number of degrees of freedom increases, the difference
between the t-distribution and the standard normal probability distribution
becomes smaller and smaller.
70
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
computed as:
Example 6.2: John Gail is running for Congress from Nebraska district. From a
random sample of 100 voters in the district, 60 indicate they plan to vote for
him the upcoming election. The sample proportion ( ) is 0.6, but the
population proportion (p) is unknown. (a) Estimate the population proportion,
(b) Develop a 95% confidence interval for a population proportion, (c) Interpret
confidence interval estimate
Solutions:
(a) . This is the point estimate of the population proportion.
71
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
(b)
(c) We are 95% confident that the population proportion (p) is between
0.503 and 0.693.
2)
3) We are 95% confident that the population proportion (p) is between 0.4 and
0.8.
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
73
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Ho is true H0 is false
Reject Ho Type I Error Correct Conclusion
Accept Ho Correct Conclusion Type II Error
It is usually argued that making a Type I error is more serious than a type II
error. We can control the probability of making a Type I error since this is
nothing more than the level of significance of the test. In the last discussion we
used the notation = 0.05 to denote the level of significance of the test. This
specifies the maximum allowable probability of making a Type I error. The
probability of making a Type I error is controlled for by setting a low value for
the significance level of the test. Conventional values for are 0.05 and 0.01.
These values are set low to enhance our confidence that the conclusion to reject
Ho while it is correct. Because of the uncertainty associated with making a Type
II error, statisticians often recommend use of the statement “do not reject H0”
rather than statement “accept H0”. The former statement eliminates the
possibility of making a Type II error. Only two conclusions are possible - either
reject H0 or do not reject H0.
75
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
3. Select the appropriate test statistic: Test statistics is a value obtained from
sample information that used to determine whether the Ho is rejected or not.
There are many test statistics such as z-distribution, t-distribution, F-distribution
and - distribution.
(a) The test statistic used for testing a claim about a population mean
The large sample case: The objective of hypothesis testing is to determine
whether a sample point estimate (for e.g., the sample mean) is significantly
different from a claimed value. The relevant sample statistic ( ) is converted
into a test statistic and compared to a critical value. For a large sample (n 30)
and the standard deviation of a population is either known or unknown, the
appropriate test statistic is the z-score. The test statistic can be computed as
follows: (6.4)
(6.6)
76
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
The z-score is always evaluated under the null hypothesis and is not affected by
whether the alternative hypothesis suggests a left-tailed test, a right-tailed test
or a two-tailed test. The use of this test can best be illustrated by reference to
an example.
Example 6.5: A certain manufacturer claims on its label that each jar of coffee
it produces contains at least three pounds of coffee. A researcher intends to test
the validity of this and establish whether the manufacturing company is in
violation of its label claim.
The null and alternative hypotheses may be expressed as: Ho: 3; HA: < 3.
It is clear that the test is a left-tailed. Previous studies suggest that = 0.20
pounds. Suppose the researcher took a sample of 49 jars of coffee and
computes an average per jar of 2.95 pounds ( = 2.95). We can now insert our
sample values into expression (6.5) to obtain the z-score value.
The small sample case: If the sample size is small (n < 30) and the population
standard deviation is unknown, then we cannot use the z-distribution but must
use the t-distribution. We assume in using the t-distribution that the parent
(6.7)
and is distributed with n-1 degrees of freedom. The critical values for this test
are determined by the degrees of freedom. This test can be used to undertake
left-sided hypothesis tests, right-sided hypothesis tests and two-tailed tests. It is
very similar to the z-score test with the provision that n < 30 and the critical
values vary depending on the degrees of freedom. The null and alternative
hypotheses are set up in exactly the same way.
Example 6.6: The mean expenditure on fuel for all Addis Ababa households in
1990 was Br.600. A randomly selected sample of 15 upper-income households
was chosen to ascertain whether high-income households spent more on fuel.
77
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
The sample average for this group was calculated at Br.825 with an estimated
standard deviation of Br.150. Do high income families spend more on fuel?
This is clearly a right-tailed test. The small sample size in conjunction with the
estimated standard deviation suggests the use of the t-statistic.
The null hypothesis in this case is:
Ho: 600 (i.e., the mean expenditure is not greater than the national
mean)
Ha: > 600 (i.e., the mean expenditure is greater than the national mean)
We can now compute the t-statistic itself. If we insert the values we have
obtained from the sample into expression we obtain the t-value as
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
The prob-value or p-value of a test is also a useful piece of information that can
be used when undertaking hypothesis testing. A p-value is the probability of
getting a value of the sample statistic that is at least as extreme as the one from
the sample data, assuming that the null hypothesis is true. The ‘rule-of-thumb’
is that you should reject the null hypothesis if the p-value of the test is less than
or equal to the significance level of the test . The traditional approach
outlined compares the test statistic to the critical value whereas the p-value
compares the p-value to the significance level. Most statistical software now
generates p-values as a matter of course. The results obtained whether you use
the traditional or the p-value approach will always result in the same
conclusion.
The decision rule (at a level of significance of ) in a small sample case for:
(a) A left-sided test is: Reject Ho if t-statistic (or tcal) < - t; or .
(b) A right-tailed test is: Reject Ho if t-statistic > t .
(c) A two-tailed test is: Reject Ho if t-statistic < - t/2 or if t-statistic > t/2; or
.
5. Make a decision: Here we compare the test statistic and the critical value
and make a decision to reject Ho or not to reject Ho. If Ho is rejected, it means
that it is highly improbable that a large computed Z-value/t-value is due to
sampling error (or by chance). If it is not rejected, a small computed Z-value/t-
value is due to sampling error (or by chance).
79
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Where, = Pooled
proportion
Under the null hypothesis P1 - P2 = 0. Hence, the test statistic is reduced into
the following form:
80
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
about the distribution of S2. Unfortunately, this does not have a convenient
probability distribution, so we transform it to (equation 6.8) which does have a
χ2 distribution, with ν = n − 1 degrees of freedom.
To construct the 95% confidence interval around the point estimate we proceed
in a similar fashion to the Normal or t distribution. First, we find the critical
values of the χ2 distribution which cut off 2.5% in each tail. These are no longer
symmetric around zero as was the case with the standard Normal and t
distributions (See Table A4 in the Appendix).
Like the t distribution, the first column gives the degrees of freedom, so we
require the row corresponding to ν = n − 1 = 19.
For the left-hand critical value (cutting off 2.5% in the left-hand tail) we look
at the column headed 0.975, representing 97.5% in the right-hand tail. This
critical value is 8.91.
For the right-hand critical value we look up the column headed ‘0.025’ (2.5%
in the right-hand tail), giving 32.85. The excel formula for this is: =
CHIINV (0.025, 19) gives the answer 32.85, the right-hand critical value.
We can therefore be 95% confident that lies between these two
values, i.e.
(6.9)
We actually want an interval estimate for so we need to rearrange equation
(6.9) so that lies between the two inequality signs. Rearranging yields
(6.10)
and evaluating this expression leads to the 95% confidence interval for
which is
Note that the point estimate, 625, is no longer at the centre of the interval but is
closer to the lower limit. This is a consequence of the skewness of the
distribution.
83
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
3 15 12 3 9 0.75
4 7 12 −5 25 2.08
5 15 12 3 9 0.75
6 14 12 2 4 0.33
Totals 72 72 0 7.66
Looking up the critical value for this test takes a little care as one needs first to
consider if it is a one- or two-tailed test. Looking at the alternative hypothesis
suggests a two-sided test, since the error could be in either direction. However,
this intuition is wrong, for the following reason. Looking closely at equation
(6.11) reveals that large discrepancy between observed and expected values
(however occurring) can only lead to large values of the test statistic.
Conversely, small values of the test statistic must mean that differences
between O and E are small, so the die must be unbiased. Thus the null is only
rejected by large values of the χ2 statistic or, in other words, the rejection region
is in the right-hand tail only of the χ2 distribution. It is a one-tailed test. This is
illustrated in Figure 6.2.
85
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
(c) Contingency Tables: Data are often presented in the form of a two-way
classification as shown in Table 6.4, known as a contingency table and this is
another situation where the χ2 distribution is useful. It provides a test of whether
or not there is an association between the two variables represented in the table.
Table 6.4: Data on Voting Intentions by Social Class
Social Liberal
Labour Conservative Total
class Democrat
A 10 15 15 40
B 40 35 25 100
C 30 20 10 60
Totals 80 70 50 200
86
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
The table shows the voting intentions of a sample of 200 voters, cross-classified
by social class. Test whether there is any association between people’s voting
behaviour and their social class. Are manual workers (social class C in the
table) more likely to vote for the Labour party than for the Conservative party?
The table would appear to indicate support for this view, but is this truly the
case for the whole population or is the evidence insufficient to draw this
conclusion? This sort of problem is amenable to analysis by a χ 2 test. The data
presented in the table represent the observed values, so expected values need to
be calculated and then compared to them using a χ2 test statistic.
The first task is to formulate a null hypothesis, on which to base the calculation
of the expected values, and an alternative hypothesis. These are: H 0: there is no
association between social class and voting behaviour; H 1: there is some
association between social class and voting behaviour. As always, the null
hypothesis has to be precise, so that expected values can be calculated. If H0 is
true and there is no association, we would expect the proportions voting
Labour, Conservative and Liberal Democrat to be the same in each social class.
Further, the parties would be identical in the proportions of their support
coming from social classes A, B and C. This means that, since the whole
sample of 200 splits 80:70:50 for the Labour, Conservative and Liberal
Democrat parties (see the bottom row of the table), each social class should
split the same way. Thus of the 40 people of class A, 80/200 of them should
vote Labour, 70/200 Conservative and 50/200 Liberal Democrat. This yield:
87
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Both observed and expected values are presented in Table 6.5 (expected values
are in brackets). Notice that both the observed and expected values sum to the
appropriate row and column totals.
Table 6.5: Observed and expected values (latter in brackets)
Social Liberal
Labour Conservative Total
class Democrat
A 10(16) 15(14) 15(10) 40
B 40(40) 35(35) 25(25) 100
C 30(24) 20(21) 10(15) 60
Totals 80 50 200
It can be seen that, compared with the ‘no association’ position, Labour gets too
few votes from Class A and the Liberal Democrats too many. However, Labour
gets disproportionately many class C votes, the Liberal Democrats too few. The
Conservatives’ observed and expected values are identical, indicating that the
propensities to vote Conservative are the same in all social classes. A quick way
to calculate the expected value in any cell is to multiply the appropriate row
total by column total and divide through by the grand total (200). For example,
to get the expected value for the class A/Labour cell:
In carrying out the analysis care should again be taken to ensure that
information is retained about the sample size, i.e. the numbers in the table
should be actual numbers and not percentages or proportions. This can be
checked by ensuring that the grand total is always the same as the sample size.
The χ2 test on a contingency table is similar to the one carried out before, the
formula being the same: with the number of degrees of freedom
given by ν = (r − 1) × (c − 1) where r is the number of rows in the table and c is
the number of columns. In this case, r = 3 and c = 3 so ν = (3 − 1) × (3 − 1) = 4.
The reason why there are only four degrees of freedom is that once any four
cells of the contingency table have been filled, the other five are constrained by
the row and column totals. The number of ‘free’ cells can always be calculated
88
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
as the number of rows less one, times the number of columns less one, as given
above. The test statistic in this can be calculated as follows, cell by cell:
Find the critical value from the χ 2 distribution with 4 degrees of freedom. At
the 5% significance level, it is 9.50 (see Table A4 in the appendix).
Make a decision: Since 8.04 < 9.50 the test statistic is smaller than the critical
value, so the null hypothesis cannot be rejected. The evidence is not strong
enough to support an association between social class and voting intention. We
cannot reject the null of the lack of any association with 95% confidence. Note,
however, that the test statistic is fairly close to the critical value, so there is
some weak evidence of an association, but not enough to satisfy conventional
statistical criteria.
The F distribution: It has a variety of uses in statistics; for this session we only
look at: testing for the equality of two variances and conducting an analysis of
variance (ANOVA) test. The F family of distributions resembles the χ2
distribution in shape: it is always non-negative and is skewed to the right. It has
two sets of degrees of freedom (labelled ν 1 and ν2) and these determine its
precise shape. Typical F distributions are shown in Figure 8.5. As usual, for a
hypothesis test we define an area in one or both tails of the distribution to be the
rejection region. If a test statistic falls into the rejection region then the null
hypothesis upon which the test statistic was based is rejected.
89
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
It is appropriate to write the hypotheses in the form shown in (8.12) since the
random variable we shall use is in the form of the ratio of sample variances,
we write: ~͠ (6.13)
The F distribution thus has two parameters, the two sets of degrees of freedom,
one (ν1) associated with the numerator, the other (ν2) associated with the
denominator of the formula. In each case, the degrees of freedom are given by
the sample size minus one. Note that is also an F distribution (i.e. it
doesn’t matter which variance goes into the numerator) but with the degrees of
freedom reversed, ν1 = n2 − 1,ν2 = n1 − 1. The sample data are: S1 = 25, S2 = 20,
n1 = 30, n2 = 30.
90
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
The test statistic is simply the ratio of sample variances. In testing it is less
confusing if the larger of the two variances is made the numerator of the test
statistic. Therefore, we have the following test statistic:
This must be compared to the critical value of the F distribution with ν 1 = 29,
ν2= 29 degrees of freedom. The rejection regions for the test are the two tails of
the distribution, cutting off 2.5% in each tail. Since we have placed the larger
variance in the denominator, only large values of F reject the null hypothesis so
we need only consult the upper critical value of the F distribution, i.e. that value
which cuts off the top 2.5% of the distribution. The degrees of freedom for the
test are given along the top row (ν1) and down the first column (ν2). The
numbers in the table give the critical values cutting off the top 2.5% of the
distribution. The critical value in this case is 2.09 (see Table A5 in the Annex),
at the intersection of the row corresponding to ν 2 = 29 and the column
corresponding to ν1 = 30 (ν1 = 29 is not given so 30 is used instead; this gives a
very close approximation to the correct critical value). Since the test statistic
does not exceed the critical value, the null hypothesis of equal variances cannot
be rejected with 95% confidence.
Analysis of Variance
In the previous sub-section we discussed how to test the hypothesis that the
means of two samples are the same, using a z or t test, depending upon the
sample size. This type of hypothesis test can be generalised to more than two
samples using a technique called analysis of variance (ANOVA), based on the
F distribution. Although it is called analysis of variance, it actually tests
differences in means. Using this technique we can test the hypothesis that the
means of all the samples are equal, versus the alternative hypothesis that at least
one of them is different from the others. The assumptions underlying the
ANOVA technique are essentially the same as those used in the t test when
comparing two different means. We assume that the samples are randomly and
independently drawn from normally distributed populations which have equal
variances. Suppose there are three factories, whose outputs have been sampled,
with the results shown in Table 6.6. We wish to answer the question whether
91
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
this is evidence of different outputs from the three factories, or simply random
variations around an average output level.
Table 6.6: Samples of output from three factories
Observation Factory 1 Factory 2 Factory 3
1 415 385 408
2 430 410 415
3 395 409 418
4 399 403 440
5 408 405 425
6 418 400
7 399
The null and alternative hypotheses are therefore: H0: μ1 = μ2 = μ3; H1: at least
one mean is different from the others. This is the simplest type of ANOVA,
known as one-way analysis of variance. In this case there is only one factor
which affects output – the factory. The factor which may affect output is also
known as the independent variable. In more complex designs, there can be two
or more factors which influence output. The output from the factories is the
dependent or response variable in this case. To decide whether we reject H0, we
compare the variance of output within factories to the variance of output
between (the means of) the factories. Both methods provide estimates of the
overall true variance of output and, under the null hypothesis that factories
make no difference, should provide similar estimates. The ratio of the variances
should be approximately unity. If the null is false however, the between-
samples estimate will tend to be larger than the within-samples estimate and
their ratio will exceed unity. This ratio has an F distribution and so if it is
sufficiently large that it falls into the upper tail of the distribution then H 0 is
rejected.
To formalise this we breakdown the total variance of all the observations into:
1. The variance due to differences between factories, and
2. The variance due to differences within factories (also known as the error
variance).
92
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Initially we work with sums of squares rather than variances. Recall that the
sample variance is given by:
The numerator of the right-hand side of this expression gives the sum of
squares, i.e. the sum of squared deviations from the mean. Accordingly we have
to work with three sums of squares:
The total sum of squares measures (squared) deviations from the overall or
grand average using all the observations. It ignores the existence of the
different factors.
The between sum of squares is based upon the averages for each factor and
measures how they deviate from the grand average.
The within sum of squares is based on squared deviations of observations
from their own factor mean.
It can be shown that there is a relationship between these sums of squares, i.e.
Total sum of squares (TSS) = Between sum of squares + Within sum of squares
is the grand average. The index i runs from 1 to 3 in this case (there are
three classes or groups for this factor) and the index j (indexing the
observations) goes from 1 to 6, 7, or 5 (for factories 1, 2 and 3 respectively).
Although this looks complex, it simply means that you calculate the sum of
squared deviations from the overall mean. The overall mean of the 18 values is
410.11 and the total sum of squares may be calculated as:
as before
93
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
The between sum of squares (BSS) (or treatment sum of square- TrSS) is
Once again there is an alternative formula which may be simpler for calculation
purposes:
(6.17)
The term Xij − measures the deviations of the observations from the factor
mean and so the within sum of squares gives a measure of dispersion within the
classes. Hence, it can be calculated as:
Compute the F test statistic: The F statistic is based upon comparison between
and within sums of squares (BSS and WSS) but we must also take account of
the degrees of freedom for the test. The degrees of freedom adjust for the
number of observations and for the number of factors. Formally, the test
94
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
Learning Activities
1. The mean annual income for a sample of 250 Midroc factory workers was
calculated to be Br.25,000. The population standard deviation of annual
income for Midroc factory workers is known to be Br.5000. Assume the
sample is less than 5% of the relevant population.
a) Construct a 95% confidence interval for the population mean.
b) Construct a 90% confidence interval for the population mean.
2. A point estimate is correct if it is equal to the actual value of the parameter
being estimated while an interval estimate is correct if the actual value of the
parameter is in the interval. Which of these two estimates has the greatest
chance of being correct?
95
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
96
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
References
Agrawal B.L. (1996). Basic statistics. New Age International Pub. Ltd. New
Delhi
97
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities
Introduction to Statistics
98
Jimma, Haramaya, Hawassa, Ambo, Adama, Bahir Dar, Samara and Wolaita Sodo Universities