0% found this document useful (0 votes)
79 views96 pages

???????????? ?? ?????????? ??? ?????????

The document discusses the definition and classification of statistics. It defines statistics as the collection and analysis of quantitative data using probability theory. It classifies statistics into descriptive statistics, which summarizes data, and inferential statistics, which generalizes from samples to populations. The document also outlines the key characteristics of statistics such as being aggregated facts and being numerically expressed.

Uploaded by

Haftom Yitbarek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views96 pages

???????????? ?? ?????????? ??? ?????????

The document discusses the definition and classification of statistics. It defines statistics as the collection and analysis of quantitative data using probability theory. It classifies statistics into descriptive statistics, which summarizes data, and inferential statistics, which generalizes from samples to populations. The document also outlines the key characteristics of statistics such as being aggregated facts and being numerically expressed.

Uploaded by

Haftom Yitbarek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 96

Debre Berhan University, Economics Program Introduction to Statistics

UNIT ONE: INTRODUCTION

1.1 Definition of Statistics

Statistics is a branch of applied mathematics concerned with the collection and interpretation of
quantitative data and the use of probability theory to estimate population parameters. It is a
science that helps us make better decisions in business and economics as well as in other fields.
Most researches in different areas of study require data so as to generate valuable information
that facilitate the decision making process. Data are raw materials for researches. Moreover, the
quality of the collected data greatly affects or determines the precision of results to be obtained
from a specific investigation. Therefore, it is extremely important to know about the basics of
data collection.

Statistics is a very broad subject, with applications in a vast number of different fields.
In generally one can say that Statistics is a collection of methods for planning experiments,
obtaining data, organizing, summarizing, presenting analyzing, interpreting and drawing
conclusion based on the data.

Statistical methods can be used to find answers to the questions like:

 What kind and how much data need to be collected?


 How should we organize and summarize the data?
 How can we analyze the data and draw conclusions from it?
 How can we assess the strength of the conclusions and evaluate their uncertainty?

Statistics may also be defined as Statistical data (plural sense) or it can also be defined as a
method (singular sense). Each one of these definitions is treated separately as follows.

a. In the plural sense : statistics are the raw data themselves , like statistics of births,
statistics of deaths, statistics of students, statistics of imports and exports, etc.
b. In the singular sense: statistics is the subject that deals with statistical method of the
collecting, organizing, presenting, analyzing and interpreting of numerical data.

1.2 Classification of statistics

1|Page
Debre Berhan University, Economics Program Introduction to Statistics

In general Statistical methods are classified into two groups or areas based on how data are used:
descriptive statistics and inferential statistics.

1.2.1 Descriptive statistics

Descriptive statistics, or just description, is the use of statistics to describe characteristics of the
data that we have. Descriptive Statistics consists of the collection, organization, summarization,
and presentation of numerical data. It is concerned with describing certain characteristics of a set
of observed data (usually a sample) – that is, what it is shaped like, what number the values tend
to cluster (converge) around, how much variation is present in the data, and so forth.

In short, descriptive Statistics describes the nature or characteristics of a data without making
conclusion or generalization. The following are some examples of descriptive Statistics.
 The average age of athletes participated in New York Marathon was 24 years.
 45% of the students in Debre Berhan University are female.
 The marks of 50 students in the Introduction to statistics course are found to range from
45 to 95.

1.2.2 Inferential statistics

Inferential Statistics, also called inductive Statistics, is a method to generalize from a sample to a
population. It is concerned with the process of drawing conclusions (inferences) about specific
characteristics of a population based on information obtained from samples, performing
hypothesis testing, determining relationships among variables, and making predictions. The area
of inferential Statistics entirely needs the whole aims to give reasonable estimates of unknown
population parameters.

The following Statistics are some examples of inferential Statistics:

 The result obtained from the analysis of the income of 1000 randomly selected citizens in
Ethiopia suggests that the average perception income of a citizen in Ethiopia is 30 Birr.
 The trend analysis on past data includes that the average exchange rate for a dollar is
expected to be 8.30 birr in the coming month.

2|Page
Debre Berhan University, Economics Program Introduction to Statistics

What is the difference between Descriptive and Inferential Statistics?

• Descriptive statistics is focused on summarizing the data collected from a sample. The
technique produces measures of central tendency and dispersion which represent how the
values of the variables are concentrated and dispersed.
• Inferential statistics generalizes the statistics obtained from a sample to the general
population to which the sample belongs. The measures of the population are termed as
parameters.
• Descriptive statistics make only summarization of the properties of the sample from which
data were acquired, but in inferential statistics, the measure from the sample is used to infer
properties of the population.

• In inferential statistics, the parameters were obtained from a sample, but not the whole
population; therefore, always some uncertainty exists compared to the real values.

Describe briefly the difference between descriptive and inferential statistics.

1.3 Characteristics of statistics

Statistics should possess the following characteristics.

i. Statistics should be aggregates of facts

Single and isolated figures are not Statistics for the simple reason that such figures are unrelated
and can’t be compared. According to this aspect, to be Statistics, data must be in aggregate
(mass) and also the individual elements within the aggregate should relate to a common
phenomenon so that they can be compared to one another.

ii. Statistics should be affected to a marked extent by multiplicity of causes

Since Statistics are most commonly used in social sciences it is natural that they are affected by a
large variety of factors at the same time. Putting differently, Statistics are not as such caused by a
single factor (force), rather they are outcomes of a number of (multiple) factors (forces)
operating together.

3|Page
Debre Berhan University, Economics Program Introduction to Statistics

iii. They should be numerically expressed

All Statistics are expressed in numbers. Nevertheless, the converse of this statement is not in
general true. That is, statements expressed in terms of numbers may not necessarily be Statistics
as there is a possibility for them not to meet the other requirements of the definition given above.

iv. They should be enumerated or estimated according to reasonable standards of accuracy

Numerical statements can either be enumerated, in which case they are supposed to be accurate
and precise or else they can be estimated by some expert observers, in which case 100%
accuracy is unlikely to be attained. In the process of estimation, reasonable standards of accuracy
must, however, be attained.

v. They should be collected in a systematic manner

If data are collected in haphazard manner, then results to be obtained are likely to lead to
fallacious conclusions. Therefore, it is essential that Statistics must be collected in a systematic
manner so that they may confirm to reasonable standards of accuracy.

vi. They should be collected for a predetermined purpose

Statistics collected without any predetermined purpose do not serve any useful purpose.
Therefore, the purpose of collecting Statistics should be defined clearly before they are collected.
Meaning, figures (Statistics) should be collected in view of some goal or target. Moreover, the
data should be collected in such a manner that it meets the predetermined needs.

vii. They should be placed in relation to each other

They should be comparable either period wise or region wise, or in reference to some other
means of comparison. As an example, suppose that the marketing head of a given supermarket in
Addis Ababa wants to know the average expenditure of households in the city, among other
things, so as to revise his marketing strategy. To achieve his objective, the head collects data on
expenditure from a sample of 1000 households’ selected using stratification (to be discussed in
the next chapter). Moreover, the head used the interview approach to gather the required
information.

4|Page
Debre Berhan University, Economics Program Introduction to Statistics

Thus, the data collected by the marketing head are Statistics as they fulfill all the requirements of
the definition.

First, the data is collected from 1000 households. Hence, it forms an aggregate. Second, data on
expenditure are affected by a number of factors like income, taste, season, culture, etc. Third,
expenditure of households, as they are expressed in terms of currency, is clearly numeric values.
Fourth, as far as expenditures are expressed to the nearest, say birr, the degree of accuracy is
more than satisfactory. Fifth, since the investigator uses stratified sampling technique and also
the interview data collection method it is reasonable to say that the data is collected in a
systematic or scientific manner. The record lack requirement if fulfilled as the investigator has a
pre-determined end or goal, which is revising his marketing strategy, based on the average
expenditure level and other qualitative and quantitative variables. Finally, it is obvious that
expenditure of households can be analyzed (compared) be it region wise or period wise.

Describe the main characteristics of statistics.

1.4 Importance of statistics

The main function of Statistics is to collect and present numerical data in a systematic manner so
that it may be analyzed in a scientific way. Statistics basically concentrates on the analysis of a
phenomenon in a scientific manner, without proving it. The analysis of data, which is the core
objective of Statistics, is important because it helps to avoid or replace arbitrary decisions,
dogmatism, rule of thumb, tradition, and it tries to increase the custom of making decision based
on analyzed quantitative facts.

The following are the major importance’s of Statistics:

i. It simplifies mass of data (condensation)

It is common that the human mind is not capable of assimilating huge (mass of) facts and figures,
as they are complex to understand. Statistical methods simplify this complexity by making the
huge data easily intelligible and readily understandable. Meaning, Statistical methods provide the
necessary means to condense mass of data and present them with the help of simple figures such

5|Page
Debre Berhan University, Economics Program Introduction to Statistics

as averages, ratios, variations, measures of Skewness, coefficients, etc. More attractive and
understandable presentations of data are also made through the help of diagrams and graphs.

ii. It presents facts in a definite form (Definiteness)

Statistics presents facts in a precise and definite (numeric) form and thus helps proper
comprehension of what is to be stated. Facts (data) stated in quantitative terms are more precise
and simple to understand, analyze and interpret. To give an example, suppose that you read the
statement:

“The average rainfall in 1995 is expected to be lower than that of 1994.”

This statement does not tell precise and complete information because it doesn’t present the
quantified rainfall in the year 1995. On the other hand, the statement may be presented as:

“The average rainfall in 1995 is expected to decrease by 5% from that of 5ml recorded in 1994.”

Clearly, since the rate of decrement in rainfall is quantitatively expressed (5%) it is possible to
have the precise expected rainfall in 1995 in quantitative form.

iii. It facilitates Comparison

The very reason for saying numerical data are more precise is that they are amendable for (lend
themselves to) comparison. By furnishing different suitable devices or tools for comparison, like
averages and measures of dispersion, Statistics enables better understanding and appreciation of
the significance of a series of figures. Moreover, in most cases conclusions or decisions are
reached mainly based on the results obtained from different comparisons. For example, the
aggregate performance of students in two different sections (classes) can be judged by
comparing the average marks for the two classes.

iv. Predictions

One of the major reasons making Statistical methods so critical in Business is their prediction
function. Prediction is the process of making a scientific guess about the future value of a
variable. Statistical methods made it possible to predict the likely future value of a variable based

6|Page
Debre Berhan University, Economics Program Introduction to Statistics

on its past trend. Time series and regression analysis are the most commonly used methods
towards prediction.

v. It helps in formulation of suitable policies

If carefully handled, Statistics plays invaluable role in the process of policy formulation.
Statistical data and Statistical methods help the government in formulating suitable methods help
the government in formulating suitable policies with respect to taxation, import-export,
budgeting and other socio-economic welfare programs.

vi. Formulating and Testing hypothesis

This function of Statistics refers to an aspect of inferential Statistics. In inferential Statistics,


hypothesis are formulated and tested to make conclusions and in some cases to develop new
theories. A Statistical hypothesis is usually a mathematical statement that states about the nature
(significance) of variables. In addition, Statistical methods help in testing the correctness of the
laws of the different branches of knowledge. Hypothesis testing is commonly used in the fields
of medicine, agriculture, economics and business.

What are the importance’s’ of statistics?

1.5 Limitations of Statistics

The fact that Statistics is applicable in almost all fields of study is not a guarantee for its
perfection. Of course, there is no perfect science in the globe. Statistical methods as well have
their own limitations. The following are the major limitations:

i. Statistics does not deal with individual items

This is to mean that Statistics deals only with aggregates of facts and no importance is attached
to individual items. For instance, age of a single student in a given class in a given year is not a
Statistical data. In contrast, the age of all students within a given class in a given year form an
aggregate and hence can be considered as data. Alternatively, the semester GPA of a single
student for 4 semesters also forms a Statistical data. In short, Statistical methods are suited only
to those problems or situations where group characteristics are desired to be studied.

7|Page
Debre Berhan University, Economics Program Introduction to Statistics

ii. Statistics deals only with quantitatively expressed items

Another limitation of Statistics is that it deals with those subjects of inquiry that are capable of
being quantitatively measured and numerically expressed. Accordingly, such qualitative
characteristics as health, poverty, honesty and intelligence are not suitable for Statistical analysis
however; problems involving such qualitative variables are treated in Statistics indirectly. For
example, the variable health may be studied through death rate, which is a quantitative variable.
However, these are only indirect methods.

iv. Statistical results are not universally true

As it is often said, Statistical results are true only on the average. Meaning, the results obtained
from Statistical data analysis are not true for each member or item within the data for which the
analysis is made. Statistical statements or conclusions are not generally true or applicable to
individuals, but are applicable to the majority of cases.

v. Statistics is liable to be misused

Misuses of Statistics, unfortunately, are probably as common as valid uses of Statistics. In


reality, Statistical methods can be properly used by experienced or trained people, as it requires
skill to draw sensible conclusions from data. It is actually this limitation that hinders the
possibility of mass popularity of such a useful and applicable science.

Chapter summary

Statistics is a collection of methods for planning experiments, obtaining data, organizing,


summarizing, presenting analyzing, interpreting and drawing conclusion based on the data.
Statistics may also be defined as Statistical data (plural sense) or it can also be defined as a
method (singular sense). Statistical methods are classified into two groups or areas based on how
data are used as descriptive statistics and inferential statistics. Statistics posses’ different
characteristics i.e., Statistics should be aggregates of facts, Statistics should be affected to a
marked extent by multiplicity of causes. In addition to this Statistics has the importance of
simplifying mass data, presentation of data in definite form, facilitating comparison form and etc.

8|Page
Debre Berhan University, Economics Program Introduction to Statistics

Self evaluation test

1. Identify the following Statistics as inferential or descriptive


a) Over the past 10 years, Ashenafi’s consumption expenditure varied from
nothing to 2500 ETB.
b) According to the census performed by CSA, the per capital income of a
household in A.A is 100 birr.
c) The average height of athletes who are going to participate in the coming
Olympic is estimated to be 1.7m.
d) 40 % of the plumbers in Austin made over $ 20,000 last year.

2. Define statistics in both plural and singular sense.

UNIT TWO: THEORY OF SAMPLING

2.1 Introduction

Sampling survey is simply the process of learning about the population on the basis of a sample
drawn from it. Thus in sampling technique instead of every unit of the population only part of the
population is studied and the conclusions are drawn on that basis for the entire population. The
process from sampling survey involves three elements: selecting the sample, collecting the
information and making an inference about the population.

Basic concepts of sampling theory

Population: In Statistics the term population is used to mean the totality of causes (items) under
consideration in a given investigation or research. In other words, the largest collection of
observations on a variable constitutes the population.

Census: The process of gathering data from every element in the population.

Sample: Is part of the population of interest. Any non-empty subset of a population is called a
sample. There are different possible samples that can be selected from a single population.
Nevertheless, the one that best reflects or represents the behavior of the population is considered
to be the most appropriate one.

9|Page
Debre Berhan University, Economics Program Introduction to Statistics

Sampling: The method of selecting a sample from a population.

Survey: process of gathering information from population

Statistic: It is a measurable characteristic of the sample. In short it is a sample result.

Parameter: It is a measurable characteristic of the population or it is a numerical result obtained


as measuring the population.

Sampling Error: The difference between sample statistic and population parameter

Sampling Unit: Elements of the population to be sampled or the unit of selection in the sampling
process.

Sample design: Is the set of procedures /methods/ for selecting the sample elements from the
population.

Sampling Frame: The list of all possible units in the reference population.

Sample Size: The number of elements/observations in a specific sample


Variable: A variable is a factor or characteristic that can take on different possible values or
outcomes. Income, height, weight, sex, age, etc are examples of variables. In an investigation,
data are collected about one or more variables of interest. A variable can be qualitative or
quantitative (numeric).

If the values of the variable can be represented numerically, it is called quantitative variable.

Example 2.1: GDP, amount of money, age, weight of a person.

The variables whose values cannot be represented numerically are called qualitative variable.

Example 2.2: sex, color, religion, nationality.

Quantitative variables are again divided in to two groups: discrete and continuous. Discrete
variables are those variables whose values are obtained by counting i.e., no of students, no of
books in library. Continuous variables are those quantitative variables which can take any value
between two numbers. Their values are obtained by measurement i.e., weight, height of a person.

2.2 Reasons for sampling

The following are the major reasons for sampling technique:

10 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

i. Cost/Economy

Unit cost of collecting data in the case of census is significantly less than in the case of sampling.
However, due to due to the larger number of items in the population, the total cost involves in the
case of is significantly higher than in the case of sampling. Suppose it takes Birr 200 per unit to
make a census of 10,000,000 individuals but the unit cost of sampling 5000 individuals is Birr
1000. Thus, the total cost is: 10,000,000 x 200 = 2,000,000,000 but that of sample is 5,000 x
1000 = 50,000,000

ii. Timeliness

Due to the larger size of population total time involves in the case of census in significantly
higher than that of sampling (i.e., the sample may provide us with necessary information
quickly).

iii. Large population size


Sometimes, many populations about which inferences must be made are quite large implying that
it is impossible to cover all the items in the population. Thus, the solution is to take sample from
such a population.
iv. Inaccessibility of the entire population

In some cases the entire population may not be accessible due to diseases, death, conflict, mental
abnormality, prisoners, etc. In that case sampling is necessary.

v. Destructive nature of many tests

Due to destructive nature of many tests, the resources are completed to collect information only
from part of the population. For example: blood test for a patient, life hours of a tube light,
strength of wires, etc.

vi. Accuracy
Non-sampling error in the case of census is higher than the non-sampling error committed in the
case of a sample survey ( as less qualified investigator are involve in the case of census and the
supervision, monitoring and quality control mechanism in the case of census may be poor). The
higher the degree of non-sampling error, the less reliable your result may be.

11 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

2.3 Methods of sampling

There are two principal methods of drawing a sample from a population: Probability sampling
and Non-probability sampling.

2.3.1 Probability Sampling

In the case of probability sampling each observation in the population has an equal
chance of being selected to become part of the sample. There is no human judgment in the case
of probability sampling.

There are five basic types of probability sampling techniques.


i. Simple Random Sampling
ii. Stratified Sampling
iii. Systematic Sampling
iv. Cluster Sampling
v. Multi-Stage sampling
i. Simple Random Sample

Simple Random Sampling is a method of probability sampling in which every unit in the
population has an equal nonzero chance of being selected (or part of the sample). In other words,
each element of the population has an equal and independent chance of being included into the
sample. The probability is given by n/N.

There are two methods to select a simple random sample:

Lottery method- In this method, each population item is numbered 1 to N on slips of identical
cards (size, shape and color). Then place numbered cards in a bowl, mix them thoroughly, and
select as many cards as needed in a blind fold selection. The subjects whose numbers are
selected constitute the sample. Since it is difficult to mix the cards thoroughly, there is a chance
of obtaining a biased sample. Thus we need other method of selecting sample elements.

Random Number method- due to the problem of lottery method, statisticians use another method
known as the random number method where numbers are generated using computers or they are
available on the annex of text books.
12 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

How to use random number table method

a) Assign a unique number to each population element in the sampling frame. Start with
serial number 1, or 01, or 001, etc. depending on the number of digits required.
b) Choose a random starting position by closing your eyes (blind fold selection) and placing
your finger on a number in the table.
c) Select serial numbers across rows or down columns or diagonally from the starting point.
d) Discard numbers that are not assigned to any population element and ignore numbers that
have already been selected.
e) Repeat the selection process until the required number of sample elements is selected.
Advantage of simple random sampling
 It ensures that the sample is unbiased.
Disadvantages simple random sampling
 It requires a Sampling Frame, and this is sometimes impossible (the case of fish
population).
 If the population is very large, it is tedious and time consuming to number and select the
sample.
 Minority subgroups of the population may not be represented in the sample.

Try to explain the methods of simple random sampling

ii. Stratified sampling

In stratified sampling, a population is first divided into subgroups, called strata (singular
stratum), and a sample is selected from each stratum based on simple random or systematic
sampling method. The strata are made according to various homogeneous characteristics such as
sex, race, region or institutional affiliation such as faculty. Stratified sampling is applied if the
population is heterogeneous.

Stratified random sampling method is a three-step process:


 Step 1- Divide the population into homogeneous, mutually exclusive and collectively
exhaustive groups or strata using some stratification variable (e.g. income level, sex,
education level, etc.);

13 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

 Step 2- Select an independent simple random sample from each stratum (using simple
random sample);
 Step 3- Form the final sample by consolidating all sample elements chosen in step 2.
Stratified samples can be:
Proportionate: involving the selection of sample elements from each stratum, such that the ratio
of sample elements total number of population elements (n/N) is constant/equal for all strata.
Disproportionate: the sample is disproportionate when the above mentioned ratio is unequal.

Example 2.3: To select a proportionate stratified sample of 20 households from Addis Ababa
that belong to three income groups: low (50), middle (30) and high (20) (N=50+30+20=100).

 Sub-divide the club members into three homogeneous sub-groups or strata by the
income groups: low, middle and high.
 Calculate the overall sampling fraction, f, in the following manner: f=n/N=20/100=0.2
Where n = sample size and N = population size: n1=0.2*50=10, n2=0.2*30=6 and n3=0.2*2=4.
Thus, n=n1+n2+n3=10+6+4=20
Advantage of Stratified Sampling:
 The representation of the sample is improved
Disadvantages Stratified Sampling
 If there are many variables of interest, dividing a large population in to representative
subgroups requires a great deal of effort,
 If variables are somewhat complex or ambiguous (such as beliefs, attitudes, etc), it is
difficult to separate individuals in to the sub-groups according to these variables.

iii. Systematic Sampling

In systematic sampling only one random number is needed throughout the entire sampling
process. Elements of the population will be arranged in some order and the elements to be
included in the sample will be selected at a constant interval.

To use systematic sampling, a researcher needs:


a. A sampling frame of the population;
b. a skip interval (K) calculated as follows:

14 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑙𝑖𝑠𝑡 𝑠𝑖𝑧𝑒(𝑁) 𝑁


Skip interval K) = 𝑜𝑟 =𝐾
𝑠𝑎𝑚 𝑝𝑙𝑒 𝑠𝑖𝑧𝑒(𝑛) 𝑛

The first element (number), which is between 1 and K, is determined using simple
random sampling and then the next items are selected using the skip interval. For instance, the
j th unit is selected at first and then ( j  k ) th , ( j  2 K ) th … etc until the required sample size is
obtained.

Example 2.4: Suppose there are 2000 subjects in the population and a sample size of 50 subjects
are needed. The sampling interval (k) is 40 (2000/50). Select the starting point, say ‘x’, from 1
through 40 using simple random sampling, and then include every 40th element starting from ‘x’.

Advantages of Systematic Sampling:

 Less time consuming and easier to perform than SRS,


 It is more convenient to use as compared to SRS, and
 It provides a good approximation to SRS.

Disadvantages of Systematic Sampling:

If there is any sort of cyclic ordering of the subjects, the samples will not be representative of the
population. Example: If subjects in the population are arranged in a manner such as:

 Defective item
 Non-defective item
 Defective item
 Non-defective item etc,

iv. Cluster Sampling

Cluster sampling is can be used if the population is homogeneous and very large in size. It is a
type of sampling in which the population is divided into non-overlapping heterogeneous groups
called clusters or groups and clusters/groups of elements are sampled as the sampling units using
simple random sampling technique in the first phase (if it is the two-phase cluster sampling). In
other words, cluster sampling is a type of sampling which involves dividing the population into

15 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

groups (or clusters). Then, one or more clusters are chosen at random and individual within the
chosen cluster is sampled.
A two-step-process:
 Step 1- Defined population is divided into number of mutually exclusive and
collectively exhaustive heterogonous groups or clusters;
 Step 2- Select an independent simple random sample of clusters using sample random
sampling.
Advantages of Cluster Sampling:
 A list of all individual study units in the reference population is not required,
 Reduces cost, and
 Simplifies field work and it is convenient.
Disadvantages:
 The members of the clusters are often more homogeneous than the members of the
whole population and therefore, it may not be representative.
 The elements in a cluster may not have the same variation in characteristics as
elements selected individually from the population.

v. Multi-stage and multi-phase sampling (optional)

In multi-stage sampling: several levels of nested clusters are involved where sample units are
clusters at each stage except the final stage. It is known as 'multistage' because there are multiple
stages, or steps, to creating the sample. The first stage in multistage sampling is the same as
cluster sampling. Thus, it is a complex form of cluster sampling. It often includes both stratified
and cluster sampling techniques. For instance we can cluster Ethiopia into regions, zones,
Woredas, kebeles and finally take households from sampled kebeles using simple random
sampling or stratified sampling.

Multi-phase sampling: is designed to make use of the information collected in one phase to
develop sampling design for the next phase. For instance, in the double phase sampling, the first
phase may consider relationship between income and expenditure and using information
obtained in the first phase, surveyed households divided into groups based on income levels
(strata). Or it is sometimes convenient and economical to collect certain items of information

16 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

from the whole of the units of a sample and other items of usually more detailed information
from a sub-sample of the units constituting the original sample. This may be termed two-phase
sampling, For instance, if the collection of information concerning variate, Y, is relatively
expensive, and there exists some other variate, X, correlated with it, which is relatively cheap to
investigate, it may be profitable to carry out sampling in two phases.

Advantage of Multi-stage Sampling

 Cuts the costs of preparing sampling frame


Disadvantages
 Gives less precise estimate than SRS for the same sample size

2.3.2 Non-Probability Sampling

In the case of non-probability sampling, not every unit in the population has a chance of being
included in the sample. It involves at least some degree of personal subjectivity instead of
following predetermined, probabilistic rules for selection.

There are four basic types of probability sampling techniques.


i. Convenience sampling
ii. Judgmental sampling
iii. Quota sampling
iv. Snowball sampling

i. Convenience sampling

Convenience sampling implies sample drawn at the convenience of the researcher. It is common
in exploratory research. Does not lead to any conclusion

ii. Judgmental sampling

Sampling based on some judgment, gut-feelings or experience of the researcher. It is common in


commercial marketing research projects. If inference drawing is not necessary, these samples are
quite useful.

iii. Quota sampling

17 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

In this method, the decision maker requires the sample to contain a certain number of items with
a given characteristic. It is something like judgmental sampling. Individuals are selected from
each quota using simple random sampling.

Example 2.5: Suppose we know that 54% of the adults in a community are females, and the
study requires 100 respondents as a sample. In quota sampling, we might interview the first 54
females and the first 46 males.

iv. Snowball sampling

Used in studies involving respondents who are rare to find. To start with, the researcher compiles
a short list of sample units from various sources. Each of these respondents is contacted to
provide names of other probable respondents.

2.4. Determinant factors of sample size

Size of sample means the number of sampling units selected from the population for
investigation. If the size of sample is small it may not represent the population and the inference
drawn about the population may be misleading. On the other hand, if the size of sample is very
large, it may be too burdensome financially, requires a lot of time and may have a serious
problem of managing it. Hence the sample size should be neither too small nor too large.

The following factors should be considered while deciding the sample size:

i. The size of the population: the larger the size of the population, the bigger should be the
sample size.
ii. The resource available: if the resources available are vast a large sample size could be taken.
However, in most cases resources constitute a big constraint on sample size.
iii. The degree of accuracy or precision desired: the greater the degree of accuracy desired, the
larger should be the sample size. However, it does not necessarily mean that bigger samples
always ensure greater accuracy.
iv. Homogeneity or heterogeneity of the population: If the population consists of homogeneous
units a small sample may serve the purpose, but if the population consists of heterogeneous
units a large sample may be inevitable.

18 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

v. Nature of study: For an intensive and continuous study a small sample may be suitable. But
for studies which are not likely to be repeated and are quite extensive in nature, it may be
necessary to take a large sample size.
vi. Method of sampling adopted: The size of sample is also influenced by the type of sampling
plan adopted. For example if the sample is a simple random sample it may necessitate a
bigger sample size, However, in a properly drawn stratified sampling plan, even a small
sample may give better results.
vii. Nature of respondent: Where it is expected a large number of respondents will not co-operate
and send back the questionnaire, a large sample should be selected.

Chapter summary

On the basis of sample study we can predict and generalize the behavior of mass phenomena.
This is possible because there is no statistical population whose elements would vary from each
other without limit. Sampling is the process of learning about the population on the basis of a
sample drawn from it. The process of sampling involves three elements: selecting the sample,
collecting the information and making an inference about the population. The major reasons for
choosing sampling technique over census are; Cost/Economy, Timeliness, Large population size,
Inaccessibility of the entire population, Destructive nature of many tests and Accuracy. In the
case of probability sampling each element in the population has a known chance (>zero) of being
included in the sample. In the case of non-probability sampling, not every unit in the population
has a chance of being included in the sample.

Self-evaluation test

1. Suppose there are 2100 subjects in the population and a sample size of 30 subjects are
needed. Identify the elements to be included in the sample using systematic random
sampling method.
2. Discuss the reasons for sampling.
3. Compare and contrast stratified and cluster sampling method.
4. Explain the concept of quota, convenience sampling using appropriate example.

19 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

UNIT THREE: DATA COLLECTION AND PRESENTATION

3.1 Introduction

The term “Data Collection” refers to all the issues related to data sources, scope of investigation
and sampling techniques. In this chapter, our discussion starts with the discussion of the meaning
of data collection. Having the reader acquainted with the meaning of data collection, the chapter
advances to the discussion of the classification of data. In addition, the different methods of
collecting data from primary sources are discussed. Finally tabular, graphic and diagrammatic
presentation section explains the different ways of presenting data according to the nature of
data. It introduces graphs such as histogram, frequency polygon, ogive and diagrams such as bar
chart and pie chart.

3.2 Data collection

Collection of data implies a systematic and meaningful assembly of information for the
accomplishment of the objective of a statistical investigation. It refers to the methods used in
gathering the required information from the units under investigation.

The quality of data greatly affects the final output of an investigation. Hence, utmost care should
be attached to the data collection process and every possible precaution should be taken to ensure
accuracy while collecting data. Otherwise, with inaccurate and inadequate data, the whole
analysis is likely to be faulty and also the decisions to be taken will also be misleading.

3.3 Classification of data

Statistical data may be obtained either from primary or secondary source. A primary source is a
source from where first-hand information is gathered. On the other hand, secondary source is
the one that makes data available, which were collected by some other agency. Clearly, a source,
which is not primary, is necessarily a secondary source. Primary sources are original sources of
data.

20 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

Based on the source data can be classified as primary data and secondary data. Data obtained
from a primary source is called primary data. Likewise, data gathered from a secondary source
is known as secondary data. For example, assume that a simple study is to be conducted to see
the age distribution of HIV/AIDS victim citizens. Clearly, the variable of study is age. Data
about the age of HIV/AIDS victim citizens may be obtained by making direct interview with the
victims. Note, in this specific case, the victim citizens are primary sources. Moreover, the data to
be collected from them are primary data. Alternatively, one may use records of hospitals and
other related agencies to obtain the age of the victim citizens without the need of tracing the
victims personally. Therefore, the records of the hospitals, in our case, are secondary sources and
the data copied from such records are secondary data.

In most cases, secondary data is obtained from such sources as census and survey
reports, books, official records, reported experimental results, previous research papers, bulletins,
magazines, newspapers, web sites, and other publications. Different organizations and
government agencies publish information (data) in the form of reports, periodicals, journals, etc.
In the case of Ethiopia, the Central Statistics Authority (CSA) is the first to be mentioned in
publishing such relevant information (secondary data).

Discuss the secondary sources of data.

Advantages and Disadvantages of Primary and Secondary data

The advantages of primary data:

 The primary data gives more reliable, accurate and adequate information, which is
suitable to the objective and purpose of an investigation.
 Primary source usually shows data in greater detail.
 Primary data is free from errors that may arise from copying of figures from publications,
which is the case in secondary data.

The disadvantages of primary data are:

 The process of collecting primary data is time consuming and costly.

21 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

 Often, primary data gives misleading information due to lack of integrity of investigators
and non-cooperation of respondents in providing answers to certain delicate questions.

Advantage of Secondary data:

 It is readily available and hence convenient and much quicker to obtain than primary data,
 It reduces time, cost and effort as compared to primary data,

 Secondary data may be available in subjects (cases) where it is impossible to collect


primary data. Such a case can be regions where there is war.

Disadvantages of Secondary data are:

 Data obtained may not be sufficiently accurate.


 Data that exactly suit our purpose may not be found.
 Error may be made while copying figures.

3.4 Method of primary data collection

After discussing the two sources of data, primary and secondary, it is logical to discuss the
methods employed in collecting data from primary sources. Different methods are applied for the
collection of primary data.

Primary data may be obtained by applying any of the following observation;

i. Personal Enquiry Method (Interview method)


ii. Questionnaire method
iii. Direct Observation method
i. Personal Enquiry Method (Interview method)

In personal enquiry method, a question sheet is prepared which is called schedule. The schedule
contains all the questions, which would extract a complete report from a respondent. Usually,
schedules are pre-tested so as to remove certain discrepancies like ambiguities of the questions
and irrelevant questions. This pre-testing process is called a pilot survey. It is worth mentioning
that the schedule is not directly given to the respondent. Rather, it is the interviewer who asks
those questions on the schedule and jot down the interviewee’s (respondent’s) response.

22 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

Depending on the nature of the interview, personal enquiry method is further classified into two
types.

Direct Personal Interview: It is a type of personal enquiry where there is a face-to-face contact
with the persons from whom the information is to be obtained. In other words, the investigator
contacts each respondent personally, without the interference of third party, and asks questions
given in the schedule one by one and notes down respondent’s replies on the schedule.

Indirect Personal Enquiry (Interview): It is the second type of personal enquiry where the
investigator contacts third parties called witnessed who are capable of supplying the necessary
information. Here, the information is not collected directly from the respondent but from a third
person who knows the respondent well. Such an approach is useful in case where the respondent
is expected to conceal information about him or her. For example, if an enquiry about the habit
of using condoms is distributed in a village, most of the villagers may not provide the correct
information. Thus, it would be wiser to get the required information from other parties, like the
nearby condom dealing shop.

ii. Questionnaire method

Under this method, a list of questions related to the survey is prepared and sent to the various
respondents by post, Web sites, e-mail, etc. However, this method cannot be used if the
respondent is illiterate. It is a method that is often used in many statistical investigations.

The following are the major points that we need to take into account while preparing a
questionnaire.

 The number of questions should be small. Naturally, respondents are not comfortable
with lengthy questionnaires. Lengthy questionnaires usually bore respondents. Hence,
fifteen to twenty five questions in a questionnaire are optimal. If a lengthy questionnaire is
unavoidable, it should preferably be divided into two or more parts.
 The questions should be short, clear, simple and unambiguous. Moreover, the
questions must be arranged in a logical order so that natural and spontaneous reply to each
is induced. For instance, it is not appropriate to ask a person how many packets of
cigarette he/she smokes before asking whether he/she smokes or not.

23 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

 Questions of sensitive nature should be avoided. Sensitive questions are those questions
that are too personal and pecuniary like “Sources of income”, “Drinking habit”, etc. The
logic here is that respondents do not willingly answer sensitive questions. Such
information, if necessary, may be gathered through interview or through other indirect
questions.
 Questions should be capable of objective answers. As much as possible, avoid
subjective questions and keep to questions of fact. To this end, multiple answer questions
can be used.
 Mail questionnaires should be accomplished by a covering letter, which should state the
purpose of the questionnaire, promise of confidentiality of responses, etc.

Furthermore, the questions preferably designed in such a way can easily be answered
as yes/no.

iii. Direct Observation method

In this approach, an investigator stays at the place of survey and notes down the observation
himself. There is no enquires in the case of direct observation. For example, an investigator
making a study on nutritional status of children may directly (physically) measure the weight,
height, and other required parameters himself/herself. Direct observation is more experimental
and usually applied in scientific studies. It is time consuming and also costly.

Explain the questioner and direct observation method.

3.5 Data presentation

Data presentation is a statistical procedure of arranging and putting data in a form of tables,
graphs, charts and/or diagrams. Data can be presented using tables’ graphs and diagrams.

3.5.1 Tabular method of data presentation

Tabulation is the arrangement of information or data in tables. A table makes it possible for the
analyst to present a huge mass of data in a detailed orderly manner within a minimum of space.
There are various techniques of tabulation.

24 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

i. Data array

Data array is a table showing data arranged in descending or ascending order.

Descending: (100, 99, 98, 97, 96, 95 ……..)

Ascending: (1, 2, 3,4,5,6,7,8,9 …………)

Data array offers the following of advantages:

 Determine at a glance the highest and lowest values contained in the data.
 Identify groups of similar data values.
 Easily see differences between values in the data.
ii. Frequency distribution
Frequency distribution is a table that group data into non-overlapping intervals called classes
and records the number of observations in each class. The frequency distribution associates each
value of Xi to its corresponding frequency (fi). If fi is the frequency of Xi (where Xi is value of
each case/observation), then the value of Xi associated with the corresponding values of fi is
known as frequency distribution. It summarizes data in a condensed form that can be readily
understood & easily interpreted.

Key Terms in frequency distribution

Class: Each category of the frequency distribution.

Frequency: The number of times an item or a particular observation occurs/repeats itself is


known as frequency

Total frequency: The sum of class frequencies

Class limits (CL): the lowest and highest values that can be included in a class such that there is
gap between successive classes are called class limits. The lower class limit (LCL) of a class is a
value such that no lower value can fall in to that class, whereas the upper class limit (UCL) of a
class is a value such that no upper value can fall into that class.

Class Boundary (CB) or Real class limits: class boundaries are the lowest and the highest
values in each class when there is no gap between successive classes. To work with the

25 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

distribution of a variable as if it was continuous, we make use of these real class limits (also
known as class boundaries).

Finding class boundaries

Let d =LCL of a class minus UCL of preceding/previous class. Add half of this difference to all
upper class limits to get the upper class boundary (UCB), and subtract it from all lower class
limits to get the lower class boundary (LCB).That is,

1 1
UCBi = UCLi + d and LCBi = LCLi - d
2 2

Example 3.1 Find the class boundaries for the following frequency distribution.

Class limit frequency


24-30 3
31-37 1
38-44 5
45-51 9
52-58 6
59-65 1

1
Solution: 31-30=1, and d  0.5 .For the first class,
2
UCB = UCL + 0.5 = 30+0.5=30.5, and LCB= LCL-0.5=24-0.5=23.5; continuing in such a way,
we get the class boundaries in the second column of the example in the above table.

Class limit Class boundary Frequency


24-30 23.5-30.5 3
31-37 30.5-37.5 1
38-44 37.5-44.5 5
45-51 44.5-51.5 9
52-58 51.5-58.5 6
59-65 58.5-65.5 1
Class width (w): It is the difference b/n the lower limits /upper limit of any two consecutive classes of
the class & lower limit/upper limit of the next higher class.

26 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

Also difference between the upper and lower class boundaries of any class. It is also the
difference between any two consecutive class marks.

𝑅𝑎𝑛𝑔𝑒
Approximately, class width =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠 𝑑𝑒𝑠𝑖𝑟𝑒𝑑

Range= Maximum value – Minimum value

Class mark (CM) or mid-point: The class mark is the mid-point of the class interval or is a
value which lies mid way between the lower and upper limits of the class. It is obtained as:

LCL  UCL LCB  UCB


CM  CM 
2 Or 2

In further analysis of the data (measures of central tendencies, measures of variations,


etc.), a class mark (CM) is used to represent all the items in that class.

 Units of measurement (U) or decimal point (d): the distance between two possible consecutive
measures. It is usually taken as 1, 0.1, 0.01, 0.001, -----.

Guidelines for constructing frequency distribution

1. Decide the number of classes (k): Select the number of classes desired, usually between 5
and 20 or use the most widely used Sturge’s rule K  1 3.322 log n where “K” is number
of classes desired and n is total number of observation.

Example 3.2: Using the Sturge’s rule, if n =10, k = 4.32  4; if n =100, k= 7.644  8;
if n = 1000, k =10.96  11

We can also use 2k rule: select the smallest number (k) for the number of classes such that 2k is
greater than the number of observations (n).

Example 3.3: Using the 2k rule, if n= 10, 23 = 8, 8 < 10, 24= 16 > 10, then the recommended
number of classes is 4.

2. Compute the Range(R) = Maximum value- Minimum value. For instance if the highest
value is134 and the lowest value is 34.

27 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

Range(R) =134-34=90

3. Determine the Class Width (w): If the number of classes is known and if it is decided to use
Range
a uniform class width, we use w  and rounded up to the nearest integer, where
k
Range is the difference between the highest and the smallest value of the data.

As far as possible, a class width of 5 or a multiple of 5 is convenient and facilitates


computations.

4. Determine the Class Limits: Pick a suitable starting point less than or equal to the minimum
value. The starting point is called the lower limit of the first class. Continue to add the class
width to this lower limit to get the rest of the lower limits.

5. Determine the Upper class limits: To find the upper limit of the first class, subtract 1 (one)
from the lower limit of the second class. Then continue to add the class width to this upper
limit to find the rest of the upper limits.

6. Determine the frequency of each class: The frequency of each class can be determined
simply by counting the number of observations belonging to each class.
7. Classes must be mutually exclusive: A given data value should fall into only one
class/category.
Limits such as the following would be inappropriate:
class Frequency
15-20 3
20-25 7

Class frequency
17.0-23.5 4
22.0-27.5 3

8. The class must be exhaustive: No data value should fall outside the range covered by the
frequency distribution.

28 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

9. If possible, the classes should have equal widths. It is difficult to interpret both frequency
distribution and their graphical presentation if there is unequal class width. One exception
occurs when there is an open-ended distribution, i.e., it has no specific beginning value or no
specific ending value.
10. Sum up the frequency of each class to check whether it is equal to the total number of data
collected from the field or not.

Example 3.4: Construct a continuous frequency distributions for the following raw data on
marks (out of 100) obtained by 50 students in Statistics.

57, 53, 65, 55, 50, 45, 64, 52, 16, 46, 42, 63, 33, 64, 53, 25, 54, 35, 48, 55, 70, 47, 39, 58, 52,
36, 65, 75, 26, 20, 55, 60, 83, 61, 45, 63, 49, 42, 35, 18, 51, 45, 42, 65, 39, 59, 45, 41, 30, 40.

Solution:
a. Since n = 50, using the Sturge’s rule, the number of classes is:
k= 1+ 3.322 log 50 =6.64  7. Thus, the number of class is 7.

By applying the 2k rule, in this case, n= 50, 25 = 32, 32 < 50, 26 = 64 > 50, the recommended
number of classes is 6.

Since the Sturge’s rule is widely used, the appropriate number of classes is K=7.

b. Range = highest value – lowest value = 83 –16= 67.


Range 67
c. Class width  w    9.57  10 .
k 7
d. Since the smallest value is 16, the LCL1 can be 15 and the UCL1 should be 24; and the
frequency distribution would look like:

Marks frequency
15 - 24 3
25 – 34 4
35 – 44 10
45 -54 15
55 – 64 12
65 – 74 4
75 – 84 2
Total 50

29 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

Note: For the class boundaries see the nature of the data.
 If there is no decimal point, then U or d = 1.
 If there is one digit after the decimal, then U or d = 0.1.
 If there are two digits after the decimal, then U or d = 0.01.

Generally, to construct frequency distribution use the following

Types of frequency distributions

A. Relative frequency: The ratio of frequency of a case to the total number of observation
f f
(sample size =n). That is, rf  or rf  n
where f is the number of times a given
f
n
i
i 1

element repeats itself (absolute frequency) and n is the total number of observations or total
frequency.
B. Absolute frequency: shows the absolute number of occurrences of an entry or group of
entries in each class.

Example 3.6: compute relative frequency from the following frequency distribution.

i Value of X(Xi) Absolute Relative frequency


frequency
1 150 1 1/33
2 152 4 4/33
3 154 6 6/33

30 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

4 156 2 2/33
5 158 2 2/33
6 160 7 7/33
7 162 2 2/33
8 164 4 4/33
9 166 1 1/33
10 168 3 3/33
11 170 1 1/33

C. Cumulative Frequency Distribution

The cumulative frequency of a class tells us how often the values fall below or above that class.
Or as the name indicates it cumulates frequencies starting at the lowest or the highest class
boundary. There are two types of cumulative frequency distributions: the “less than” and the
“more than” cumulative frequency distributions.

 The “less than” cumulative frequency distribution is obtained by adding the frequency of
all the preceding (previous or earlier) classes including the class against which it is written
or including the frequency of that class. In other words, it is obtained by adding
successively the frequencies of all the previous classes including the class under
consideration. The cumulate is started from the lowest to the highest size.
 The “more than” cumulative frequency distribution is obtained by adding the frequency
of the succeeding (later) classes including the frequency of that class. In other words, it is
obtained by finding the cumulate total of frequencies starting from the highest to the
lowest class.
Example 3.7: consider the distribution of marks of 50 students:

Class fi Class Marks of Less than Marks of students More than


limits boundary students cumulative cumulative
10-19 2 9.5-19.5 Less than 19.5 2 9.5 or more 48+2=50
20-29 4 19.5-29.5 Less than 29.5 2+4=6 19.5 or more 44+4=48
30-39 7 29.5-39.5 Less than 39.5 6+7=13 29.5 or more 37+7=44
40-49 10 39.5-49.5 Less than 49.5 13+10=23 39.5 or more 27+10=37
50-59 16 49.5-59.5 Less than 59.5 23+16=39 49.5 or more 11+16=27
60-69 8 59.5-69-5 Less than 69.5 39+8=47 59.5 or more 3+8=11
70-79 3 69.5-79.5 Less than 79.5 47+3=50 69.5 or more 3

31 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

Interpretation of less than cumulative frequency: for instance, 23 out of 50 students have marks
less than 49.5.

Interpretation of more than cumulative frequency: for instance, 27 out of 50 students have
marks 49.5 or more.

3.5.2 Graphical method of data presentation

This topic deals with the study of organizing a set of raw data in to a Frequency distribution (FD)
and describes the distribution graphically in a histogram, a frequency polygon, & a cumulative
frequency curve (ogive). The other types of numerical information will be summarized &
presented in the form of bar chart, pie chart or a pictogram.

Presenting data graphically or diagrammatically helps to;

 Understand the information easily


 Make the data attractive to the eye
 Make comparisons of items easily
 Draw attention of the observer
i. Histogram

After a frequency distribution is completed, the next step will be to construct a “picture” of
these data values using a histogram. A histogram is a graph consisting of a series of adjacent
rectangles whose bases are equal to the class width of the corresponding classes and whose
heights are proportional to the corresponding class frequencies. Here, class boundaries are
marked along the horizontal axis (x – axis) and the class frequencies along the vertical axis ( y –
axis) according to a suitable scale. It describes the shape of the data.

Steps in constructing a histogram:

 Draw x – y axis and Label the class boundaries on the x – axis, frequency on the Y – axis,
 Using the frequencies as the heights, draw vertical bars for each class

Example 3.8: Considers the following grouped frequency distribution and construct a histogram.

Class (Xi) Class boundaries Frequency (fi)

32 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

3–6 2.5 – 6.5 4


7 – 10 6.5 – 10.5 7
11 – 14 10.5 – 14.5 10
15 – 18 14.5 – 18.5 6
19 - 22 18.5 – 22.5 3
30
Histogram for the above distribution

10
Class frequency (fi)

2.5 6.5 1.05 14.5 18.5 22.5

Class boundaries (CB)

ii. Frequency polygon

It is a line graph of frequency distribution. Although a histogram does demonstrate the shape of
the data, perhaps the shape can be more clearly illustrated by using a frequency polygon. Here,
you merely connect the centers of the tops of the histogram bars (located at the class midpoints)
with a series of straight lines. The resulting figure is a frequency polygon. Here the class marks
are plotted along the x – axis and the class frequencies along the y – axis. Empty classes are
include at each end so that the curve will anchor with the x – axis.
Steps in constructing a frequency polygon:
 Find class marks for each class,
 Draw the x – y axis. Label the x – axis with the class marks and use a suitable scale on
the y – axis for the frequencies (absolute or relative), and
 Connect the coordinated (x, y) with line segments.
Example 3.9: Considers the following grouped frequency distribution and construct a frequency
polygon.
Marks frequency
15 - 24 3

33 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

25 – 34 4
35 – 44 10
45 -54 15
55 – 64 12
65 – 74 4
75 – 84 2
Total 50

Frequency Polygon
20

15

10

0
10 20 30 40 50 60 70 80 90

iii. Cumulative frequency graph (Ogive curves)

The Ogive curve is a graph that represents the cumulative frequencies for the classes in a
frequency distribution. It is also known as the cumulative frequency curve, which is a graphical
representation of a cumulative frequency distribution. There are two types of Ogive curves: the
“less than Ogive” and the “more than Ogive”.

a. The “less than” Ogive – the less than cumulative frequencies are plotted against upper
class boundaries of their respective classes and they are joined by either straight lines or
smooth curves.
b. The “more than” Ogive- in this case, the “ or more than” cumulative frequencies are
plotted against the lower class boundaries of their respective classes, and the
connections may be by straight lines or smooth curves.

Steps in constructing cumulative frequency graph

 Find the cumulative frequency for each class,

34 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

 Draw the x–y axis and label the x–axis with the class boundaries and y–axis with the
cumulative frequencies, and
 In drawing less than” Ogive” upper class boundaries are plotted against the ‘less than’
cumulative frequencies of the respective class & they are joined by adjacent lines. In
drawing more than” Ogive” lower class boundaries are plotted against the ‘more than’
cumulative frequencies of their respective class and they are joined by adjacent line.

Example 3.10: Construct a less than and more than Ogive curve using the following frequency
distribution.
Class marks Less than marks More than
boundary cumulative cumulative
Less than 9.5 0
9.5-19.5 Less than 19.5 2 9.5 or more 50
19.5-29.5 Less than 29.5 6 19.5 or more 48
29.5-39.5 Less than 39.5 13 29.5 or more 44
39.5-49.5 Less than 49.5 23 39.5 or more 37
49.5-59.5 Less than 59.5 39 49.5 or more 27
59.5-69-5 Less than 69.5 47 59.5 or more 11
69.5-79.5 Less than 79.5 50 69.5 or more 3
79.5 or more 0
m

la
C

ti
u

n
v

q
F

y
e

c
r

60 Less than cumulative (or ogive) curve


50 79.5, 50
69.5, 47
40 59.5, 39
30
20 49.5, 23

10 39.5, 13
29.5, 6
0 9.5, 0 19.5, 2
0 20 40 60 80 100

35 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

More than cumulative (ogive) curve


60
50 9.5, 50 19.5, 48
29.5, 44
40
39.5, 37
30
49.5, 27
20
10 59.5, 11
0 69.5, 3
79.5, 0
0 20 40 60 80 100

Note that the O-gives can help us in estimating the number of observations falling either
above, below or in between values.

3.5.3 Diagrammatic method of data presentation

Diagrams and graphs are extremely useful because, they give a bird’s-eye view of the entire data
and, therefore, the information presented is easily understood. The main types of diagram: Line
graphs; Bar charts and Pie – charts

i. Line graphs (charts)

It represents the relationship between time (on the x-axis) and values of variable (on the y-axis).
The values are recorded with respect to the time of occurrence. Time series data are most
effectively presented on a line chart. Line charts are particularly effective for business and
economic data because we can show the change or trends in a variable overtime.
Example 3.11. Draw a line graph for the following time series.
Year 1986 1987 1988 1989 1990 1991
values 20 10 30 15 25 10

A line graph showing the above time series


40
30 30
25
Values

20 20
15
10 10 10
0
1986 1987 1988 1989 1990 1991
Year

36 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

i. Bar chart (Bar diagram)

Bar chart is a series of equally spaced bars of uniform width where the height (length) of a bar
represents the amount (magnitude) of frequency corresponding with a category. Bars may be
drawn horizontally or vertically. Vertical bar graphs are preferred as they allow comparison with
other bars. A bar chart can be simple, multiple or sub-divided bar chart.

A. Simple bar chart

It represents a single set of data (variable) classified in different categories. Singular bars are
drawn with the respective frequencies.

Example 3.12: Rating of students for introduction to statistics

rating frequency Relative frequency Percent (%)


Poor 2 0.1 10
Below average 3 0.15 15
Average 5 0.25 25
Above average 9 0.45 45
Excellent 1 0.05 5

bar graph
9
10
5
5 2 3
1
0
Poor Below Average Above Excellent
average average

B. Multiple bar chart

Here two or more bars are grouped with the corresponding frequency to represent two or more
interrelated data in each category. The bars of related variables are kept adjacent to each other
for every set of values. These charts can be used if the overall total is not required and each bar
is shaded or colored separately and a key is given to distinguish them.

Example 3.13: The following table shows rating of female and male students in Introduction to
statistics course.
37 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

rating total Male Female


Poor 2 1 1
Below average 3 2 1
Average 5 3 2
Above average 9 4 5
Excellent 1 0 1

5
4
5 3
2 2
1 1 1 1 Male
0
0 Femal
Poor Below Average Above Excellent
average average

C. Subdivided bar chart

It is used to present data by subdividing a single bar with respect to the proportional frequency.
Each portion of the bar is then shaded or colored and a key is give to distinguish them.

Example 3.14: Using the table given in the above example construct a subdivided bar hart.

10

5 Femal

0 Male
Poor Below Average Above Excellent
average average

ii. Pie chart

A pie chart is a circle divided in to various sectors with areas proportional to the value of the
component they represent. It shows the components in terms of percentages not in absolute
magnitude. The degree of the angle formed at the center has to be proportional to the values
represented.

Example 3.15: Using the data given in the table below draw a pie-chart.

38 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

rating frequency Relative % size of central


frequency angle

Poor 2 0.1 10 36
Below average 3 0.15 15 54
Average 5 0.25 25 90
Above average 9 0.45 45 162
Excellent 1 0.05 5 18

Excellent, 5 Poor , 10
Below
average, 15
Above
average , 45

Average , 25

Chapter summary

Data collection is the most important stage in statistical data analysis. Before we begin with data
collection, the purpose of collecting the data, the data to be collected, the source from which we
can get the data and the method(s) to be used for data collection should be considered. Based on
the source data can be classified as primary data and secondary data. Primary data may be
obtained by applying any of the following observation; Personal Enquiry Method (Interview
method), questionnaire method and direct Observation method. The whole data idea of
organizing a raw data set is to present the information in concise way. The frequency table
showing categories or classes is most convenient in data organization. The graphical or
diagrammatical presentation of a data set is meant to provide a clear picture of the emerging
trend or pattern in a data set. Often, more than one method can be employed for the same data set
to get a pictorial view of the information collected from the sample directly from the target
population. Among all the methods, bar diagram and pie charts are widely used for categorical
data set, whereas the histogram, frequency polygon and Ogive are most informative for
quantitative data set.

39 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

Self-evaluation test

1. Discuss the advantages and disadvantages of primary and secondary data.


2. Compare direct personal interview and indirect personal enquiry.
3. Given the following data construct a grouped frequency distribution.

100 112 114 112 110 120 125 130 115 129 133 132 104

134 120 112 134 123 130 127 129 105 126 106 107 124

110 126 124 122 130 117 132 130 115 109 108 111

120 132 124 102 102 120 125 123 118 121 124 107

4. Consider the following grouped frequency distribution data and construct a histogram.

Marks frequency
15 - 24 3
25 – 34 4
35 – 44 10
45 -54 15
55 – 64 12
65 – 74 4
75 – 84 2
Total 50

5. Using the above table construct a frequency polygon and Ogive curves.

40 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

UNIT FOUR: MEASURE OF CENTRAL TENDENCY

4.1. Introduction

One of the most important objectives of statistical analysis is to get one single value that
describes the characteristics of the entire mass of unwieldy data. Such a value is called the
central value or an ‘average’ or the expected value of the variable. So the average is a typical
value that represents a group of values.

Measure of central tendency uses a typical value or an average that describes a data set near its
center. They provide indications on middle values or most likely or most frequent values. In
other words, they tell us where the center of the distribution of the data is located.

Objectives of Measures of Central Tendency

There are three main objectives of studying measurers of central tendency.

i. To get a single value that describes the characteristics of entire population. By


condensing the mass data in one single value, measures of central tendency enables us to
get a clear view of the entire data of a distribution. Thus one value can represents
thousands and even millions of values.
ii. To facilitate comparison. By reducing the mass of data in single figure, measures of
central tendency enables comparison to be made amongst any two or more distribution.
iii. For further statistical studies.

Requisites of a good measure of central tendency

The measure of central tendency should be

 Easy to understand.
 Simple to compute.
 Based on all the observation.
 Not be unduly affected by extreme observations or outliers.
 Rigidly defined or it should have one and only interpretation.
 Stable with respect to sampling.

41 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

 Capable of further algebraic treatment.

What is measure of central tendency? Discuss the reason why we are using measures
of central tendency?

The summation notation

Summation or sigma notation is a convenient and simple form used to give a concise expression
for a sum of the values of a variable. In statistics, the symbol ∑ (Greek letter sigma) means to
add or find the sum.

For example, ∑ xi means to add the numbers represented by the variable X. Thus, if X represents
7,4,9,5, and 10, then ∑ xi=7+4+9+5+10=35.

Sometimes a subscript notation is used, such as ∑10


1=1 𝑥𝑖 .This notation means to find the sum of

ten numbers represented by X. This notation is read as follows: sum the values of 𝑥𝑖 from 𝑥1
through 𝑥10.

The notation  ( x  x) 2 Means perform the following steps:

1) Find the mean ( x 


x)
n

2) Subtract the mean from each value

3) Square the answers

4) Find the sum

Properties of the summation notation

i. ∑𝑛𝑖=1 𝑐 = 𝑛𝑐 where 𝑐 is non zero constant


ii. ∑𝑛𝑖=1 𝑏𝑥𝑖 = 𝑏 ∑𝑛𝑖=1 𝑥𝑖 where 𝑏 is non zero constant number
iii. ∑𝑛𝑖=1(𝑎 + 𝑏𝑥𝑖) = 𝑛. 𝑎 + 𝑏 ∑𝑛𝑖=1 𝑥𝑖 where 𝑎 and 𝑏 are non zero constants
iv. ∑𝑛𝑖=1(𝑥𝑖 ± 𝑦) = ∑𝑛𝑖=1 𝑥𝑖 ± ∑𝑛𝑖=1 𝑦𝑖
v. ∑𝑛𝑖=1 𝑥𝑖. 𝑦𝑖 ≠ ∑𝑛𝑖=1 𝑥𝑖 . ∑𝑛𝑖=1 𝑦𝑖

42 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

4.2 Types of measures of central tendency

The following are the most important types of measure of central tendency
4.5.1 The Mean (Arithmetic, Weighted, Geometric, Harmonic)
4.5.2 The mode
4.5.3 The median
4.2.1 The Mean
I. Arithmetic Mean
The arithmetic mean is the sum of the data set values divided by the number of observations.
Arithmetic mean or average value of a variable is the most important numerical measures of
central tendency.

Ungrouped data

For ungrouped data, the population mean (usually denoted by “”) is the sum of all the
population values divided by the total number of population values.

X i
 i 1

N
where : N  number of elements in the population
  population mean

For ungrouped data, the sample mean is the sum of all the sample values divided by the number
of sample values:

X i
X  i 1

n
X  sample mean
n  number of elements in the sample/sample size

Example 4.1: A sample of five teachers received the following salaries (Birr in hundred): 12,
13, 15, 19, 13, and 18, find the mean salary.
43 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

12 + 13 + 15 + 19 + 13 + 18
𝑋̅ = = 15
6

Therefore the mean salary of teachers is 1500 birr.

Grouped data

The mean of a sample of data organized in a frequency distribution is computed by the following
formula:

f i Xi fi  i th class frequency
X  i 1
k where: X i  class mark of the i th class
f i 1
i k  number of classes

Example 4.2: Compute the arithmetic mean for the following grouped data:

Class boundaries 0-10 10-20 20-30 30-40 40-50 50-60


Frequency 5 10 25 30 20 10
Solution:
- First find the class mark
- Find the product of the frequency and the class mark
- Lastly find the mean using the formula

Class boundaries Mid point (m) No. of students (f) fm


0-10 5 5 25
10-20 15 10 150
20-30 25 25 625
30-40 35 30 1,050
40-50 45 20 900
50-60 55 10 550
N=100 ∑ 𝑓𝑚 = 3,300

∑ 𝑓𝑚 3,300
X= = = 33
𝑁 100

Therefore the Arithmetic mean of the above grouped distribution is 33.

Calculate the arithmetic mean from the following data

44 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

Marks 0-10 10-0 20-30 30-40

No. of students 5 12 25 8

Short-cut method: Short cut formula can be used if the figures in the calculation have many
digits. The formula for calculating arithmetic mean using the short cut method is:

∑𝑑
For ungrouped data 𝑥̅ = 𝐴 +
𝑁

𝑊ℎ𝑒𝑟𝑒 𝐴 𝑖𝑠 𝑡ℎ𝑒 𝑎𝑠𝑠𝑢𝑚𝑒𝑑 𝑚𝑒𝑎𝑛 𝑎𝑛𝑑 𝑑 𝑖𝑠 𝑡ℎ𝑒 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑖𝑡𝑒𝑚𝑠 𝑓𝑟𝑜𝑚 𝑡ℎ𝑒 𝑎𝑠𝑠𝑢𝑚𝑒𝑑 𝑚𝑒𝑎𝑛,
𝑖. 𝑒. , 𝑑 = (𝑋 − 𝐴)
∑ 𝑓𝑑
For grouped data 𝑋̅ = 𝐴 +
𝑁

𝑊ℎ𝑒𝑟𝑒 𝐴 𝑖𝑠 𝑡ℎ𝑒 𝑎𝑎𝑠𝑠𝑢𝑚𝑒𝑑 𝑚𝑒𝑎𝑛; 𝑑 = (𝑋 − 𝐴); 𝑁 = 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛, 𝑖. 𝑒. 𝛴𝑓.

Example 4.3: For the following discrete series data distribution find the arithmetic mean by
applying both the direct and short cut methods?

Weights in kilograms 20 30 40 50 60 70

No. of students 8 12 20 10 6 4

Solution: Direct method

Weight in Kilogram(X) No. of students(f) fX

20 8 160
30 12 360
40 20 800
50 10 500
60 6 360
70 4 280
ΣfX=2,460

𝛴𝑓𝑋 2,460
𝑋̅ = = = 41
𝑁 60

Short-cut method

Let A the assumed mean is 40.

Weight in Kilogram(X) No. of students(f) (X-40)= d


fd

45 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

20 8 -20 -160
30 12 -10 -120
40 20 0 0
50 10 +10 +100
60 6 +20 +120
70 4 +30 +120
Σfd=60
∑ 𝑓𝑑 60
𝑋̅ = 𝐴 + = 40 + = 41
𝑁 60

Weighted Mean: It is a special case of arithmetic mean. It is the mean value of data values that
have been weighted according to their relative importance. The term weight itself stands for the
relative importance of the different items.

The formula for computing the weighted arithmetic mean of a population or a sample is

∑ 𝑤𝑖𝑥𝑖
µ𝑤 𝑜𝑟 𝑋̅𝑤 = ∑ 𝑤𝑖
Where µ𝑤 = 𝑖𝑠 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝑚𝑒𝑎𝑛

𝑋̅𝑤 = 𝑖𝑠 𝑠𝑎𝑚𝑝𝑙𝑒 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝑚𝑒𝑎𝑛


𝑤𝑖 = 𝑤𝑒𝑖𝑔ℎ𝑡 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑 𝑡𝑜 𝑡ℎ𝑒 𝑖𝑡ℎ 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒
𝑥𝑖 = 𝑡ℎ𝑒 𝑖𝑡ℎ 𝑑𝑎𝑡 𝑣𝑎𝑙𝑢𝑒

In case of frequency distributions the weighted arithmetic mean is:

∑ 𝑊(𝑓𝑖𝑋𝑖)
𝑋̅ 𝑤 =
∑ 𝑋𝑖

Example 4.4: An NGO covers monthly expenses of adults’, females and childrens under
different package. Based on the package adults receive 300 birr, females 250birr and children
receive 200birr every month. And the number of adults, females and childrens are 10, 15 and 20
respectively. What is the average monthly expense covered by the NGO?

Solution:

The average monthly expense covered will be the weighted mean and calculated as follows

10(300) + 15(250) + 20(100)


𝑋̅𝑤 = = 194.44
45

46 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

Therefore, the average monthly expense covered by the NGO is 194.44 birr per month.

Discuss the weighted arithmetic mean and its application.

Properties of Arithmetic mean

i. The sum of the deviations of each value from the mean is always zero (taking signs into
account) i.e. ∑(𝑋 − 𝑋̅) = 0
ii. The sum of the squared deviation of the items from arithmetic mean is minimum that is
less than the sum of the squared deviation of the items from any other value. i.e. ∑(𝑋 −
𝑋̅)2 is the minimum
iii. If 𝑋̅ 1 and 𝑋̅ 2are the arithmetic means of two observations with 𝑁1 and 𝑁 2number of
observations then the combined antiemetic mean will be
̅ 1+𝑁2𝑋̅2
𝑁1𝑋
𝑋̅ 12 =
𝑁1+𝑁2

Where 𝑋̅12 𝑖𝑠 𝑡ℎ𝑒 𝑐𝑜𝑚𝑏𝑖𝑛𝑒 𝑎𝑟𝑡ℎ𝑖𝑚𝑒𝑡𝑖𝑐 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 1 𝑎𝑛𝑑 2


iv. The arithmetic mean is affected by both change of origin and scale. That is, given a mean
for data values, if we add or subtract a constant number c from all data values, the new
mean will be the old mean plus or minus c (change of origin).

Given a mean for data values, if we multiply all data values by a constant number c, then
the new mean will be c times the old one (change of scale).

Discuss the main properties of arithmetic mean?


Merits and demerits of arithmetic mean

Merits of Arithmetic mean

 It is easy to calculate and understand


 It is based on all observations
 It is capable of further algebraic treatment
 It is list affected by fluctuations of sampling

Demerits of Arithmetic mean

 It is affected very much by sampling fluctuations


47 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

 It cannot be accurately determined even of one of the values is not known


 It can be located graphically
 The mean is affected by large or small data values, called outliers and may not be the
appropriate average to use in this situations.
 We cannot determine a mean for open ended data.
II. Geometric mean

The geometric mean (GM) of “n” positive numbers is defined as the nth root of their product. The
geometric mean is useful in finding the average of percents, indexes, ratios, growth rates and
logarithmically distributed series. It has a wide application in business and economics because we are
often interested in finding the percentage changes in sales, revenues, profits, GDP, etc.

The formula to calculate the geometric mean for ungrouped data is:

GM = n  X 1 X 2 X 3.... Xn   n xi ,  => multiplication

Where X1 X2, X3, …... , Xn are the various items of the series

If 'n' is three or more, extracting the nth root of the product is excessively difficult. To facilitate
the computation of GM logarithms are used

𝑙𝑜𝑔𝑥1 + 𝑙𝑜𝑔𝑥2 + ⋯ . . +𝑙𝑜𝑔𝑥𝑛


𝑙𝑜𝑔𝐺𝑚 =
𝑁

𝛴𝑙𝑜𝑔𝑥 𝛴𝑙𝑜𝑔𝑥
Or log 𝐺𝑀 = [ ] 𝑇ℎ𝑒𝑟𝑒𝑓𝑜𝑟𝑒 𝐺𝑀 = 𝐴𝑛𝑡𝑖𝑙𝑜𝑔 [ ]
𝑁 𝑁

For grouped data the geometric mean is calculated as:

𝑛
𝐺𝑀 = √𝑥1 𝑓1 ∗ 𝑥2 𝑓2 ∗ … … … … .∗ 𝑥𝑛 𝑓𝑛

∑ 𝑓𝑙𝑜𝑔𝑥𝑖
Or 𝐺𝑀 = 𝐴𝑛𝑡𝑖𝑙𝑜𝑔 [ ]
𝑁

Where x1, x2…xn is the class mark for each class, f1, f2….fn are the corresponding frequencies
for each class and n represents the total number of observations.

Example 4.5: Calculate the geometric mean for 2, 4 and 8.

48 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

3
Solution 𝐺𝑀 = √2 ∗ 4 ∗ 8 = 4

Example 4.6: The return on investment earned by a company for 5 consecutive years is given by
25%, 40%, 30%, 120% and -85%. Calculate the geometric rate of return earned by the company
on investment?
Solution:
The positive percentage gain on investment means additional gain on what the company already
has. Then 25%, 40%, 30%, 120% and -85% investment return can be expressed as 1.25(1+0.25)
,1.4()01+0.4,1.3(1+0.3)+2.2(1+1.2)and 0.15(1-0.85).
5
𝐺𝑀 = √1.25 ∗ 1.4 ∗ 1.3 ∗ 2.2 ∗ 0.15 = 0.944

Geometric mean is most frequently used in the determination of average percent of


change. The following geometric mean formula is used to determine the percent increase in
sales, production or other business or economic series over a period of time.

Example 4.7: If a person receives a 20% rise in his initial income after one year of service and a
10% rise after the second year of service, what is the average percentage increase?

Solution:
20%+10%
The average percentage raise is not 15% ( ) but 14.89% as shown below.
2

𝐺𝑀 = √1.2 ∗ 1.1 = 1.1489 𝑜𝑟

𝐺𝑀 = √120 ∗ 110 = 114.89%

Let’s show this answer is right by assuming that the person earns Birr 10,000 at the beginning
and receives two raises: 20% and 10%.

Raise 1 = 10,000*20%=2000, Raise 2 12000*10%=1200.

The total salary increase is Birr 3200.This total is equivalent to:

Birr 10,000*14.89%=Birr 1489

Birr 11,489*14.89=Birr 1710.71

49 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

The total increase= Birr 1489+1710.71=3199.71 (almost equal to 3200)

But if the arithmetic mean were applied;

Raise 1 Birr10, 000*15%=Birr 1500 Raise 2 Birr 10,000+1500=11,500, 11,500*15%=1725

The total increase =Birr 1500 +1725 = Birr 3225 (3225 is higher than the actual 3200 increase).

The above result clearly shows that to find the correct average percentage change we should
apply geometric mean instead of arithmetic mean.

Example 4.8: Find the geometric mean for the following grouped data on the percentage
increase in salary of 16 employees of a company.

% increase in salary Number of employees Class mark


0-4 5 2
5-9 6 7
10-14 3 12

15-19 2 17

16
Solution: 𝐺𝑀 = √25 ∗ 76 ∗ 123 ∗ 172 = 5.85 %.The geometric mean percentage increase
in salary is 5.85%.

Another use of the geometric mean is to determine the percent increase in sales, production or
other business or economic series from one time period to another.

𝑛 (𝑣𝑎𝑙𝑢𝑒 𝑎𝑡 𝑡ℎ𝑒 𝑒𝑛𝑑 𝑜𝑓 𝑡ℎ𝑒 𝑝𝑒𝑟𝑖𝑜𝑑)


𝐺𝑀 = √ –1 𝑤ℎ𝑒𝑟𝑒 𝑛 = 𝑡𝑖𝑚𝑒 𝑝𝑒𝑟𝑖𝑜𝑑
𝑣𝑎𝑙𝑢𝑒 𝑎𝑡 𝑡ℎ𝑒 𝑏𝑒𝑔𝑖𝑛𝑖𝑛𝑔 𝑜𝑓 𝑡ℎ𝑒 𝑝𝑒𝑟𝑖𝑜𝑑

Example: The population of a country increased from 84 million in 2005 to 108 million in 2015.
Find the annual rate of growth of population.

𝑛 (𝑣𝑎𝑙𝑢𝑒 𝑎𝑡 𝑡ℎ𝑒 𝑒𝑛𝑑 𝑜𝑓 𝑡ℎ𝑒 𝑝𝑒𝑟𝑖𝑜𝑑)


Solution 𝐺𝑀 = √
(𝑣𝑎𝑙𝑢𝑒 𝑎𝑡 𝑡ℎ𝑒 𝑏𝑒𝑔𝑖𝑛𝑖𝑛𝑔 𝑜𝑓 𝑡ℎ𝑒 𝑝𝑒𝑟𝑖𝑜𝑑)
–1

10 108,000,000
= √ – 1 = 2.6 = 2.6%
84,000,000

50 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

The production of sugar for a sugar factory increased from 500,000 tons in 1990 to
950,000 tons in 2006. Compute the rate of production increase per year.

Properties of a geometric mean

i. The product of the values of series will remain unchanged when the values of the
geometric mean is substituted for each individual value. For example the geometric mean
for series 1,3,9 is; therefore we have

1*3*9=27=3*3*3

ii. The sum of the deviations of the logarithms of the original observations above or below
the logarithm of the geometric mean is equal. Thus using the previous example
3 3 9
∗ =3=
1 3 3

Merits and Demerits of geometric mean

Merits of geometric mean

 It is based on all observations


 It is capable of further algebraic treatment
 It is rigidly defined

Demerits of geometric mean

 It is not easy to compute


 If the value of one observation is zero its geometric will become zero
 It may not be defined if a single observation is negative
III. Harmonic Mean (H.M.)

It is the mean of n numbers x1 , x2 ,, and x n and is defined as n divided by the sum of the
reciprocals of the n numbers. It is appropriate for situations when the average of rates is desired
(e.g., it helps to find the average speed of a trip over a route divided into constant speed
segments (of distance), ..). For example, if one travels half-way to a destination at 20 mi/hr, and
then goes 60 mi/hr for the second half of the distance. The average speed is 30mi/hr.

51 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

n n
H .M .   ………for ungrouped data
1 1 1 1
  .... 
x1 x2 xn
x
i

2 24
Example 4.9: 1. The H.M. of 6 and4 is 
1 1 5

6 4

6 360
2. The H.M. of the first six natural numbers is 
1 1 1 1 1 147
1    
2 3 4 5 6

n n
H .M .   ………for grouped data
f1 f 2 f f
  ....  i
x1 x2 xn
 ( xi )
i

Relationship between Arithmetic mean, Geometric Mean and Harmonic Mean

For a set of data containing n-positively valued observations, the following relationships always
holds:

HM≤GM ≤AM

However, HM=GM=AM iff all values in the data set are equal.

4.5.2 The mode

The mode of a set of data is defined as the value with the highest frequency, and which occurs more
than once”.
Mode for ungrouped data
The mode or the modal value of a raw data is simply obtained by locating the observation with the
maximum frequency (if there exists such a value).
Example 4.10: the examination scores for ten students are: 81, 93, 84, 75, 68, 87, 81, 75,81and 87.
Because the score of 81 occurs three times, it is the mode.
Note: A data set may have
- No mode at all, e.g. 1, 3, 9, 0, 7, 8
- One mode (unimodal), e.g. 1, 3, 1, 7, 1, 9, mode is 1
- Two modes (bimodal), e.g. 7,2,4,4,7 , the modes are 7 and 4.

52 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

- Many modes (multimodal), e.g. 1, 0, 0, 1, 3, 2, 2, 3, 7, 7, 4, 9, the modes are 1, 0, 3, 2,7.


Mode for grouped discrete data: In the case of discrete grouped data, the mode is determined just
by inspection, i.e., by looking to that value (s) having the highest frequency.
Mode for grouped continuous data: In such cases, one can only determine the modal class easily:
the class with the highest frequency. After locating this class, the mode is interpolated using:

f  f1 f  f1
Mode  Lo  i  L0  i
 f  f 1  f  f 2 2 f  f 1  f 2 
Where:
Lo  lower classs boundary of the modal class (i.e., the class with thehighest frequency)
f  is the frequencyof the modal class
f1  frequencyof the class immediately preceding the modal class class
f2  frequencyof the class immediately following the modal class
i  class interval/width
Example 4.11: Calculate the modal age for the age distribution of 228 teachers.
Class Interval Number of Teachers
Age (in years)
15 – 19 6
20 – 24 19
25 – 29 50
30 – 34 57
35 – 39 48
40 – 44 27
45 – 49 21
Total 228

Solution: By inspection, the mode lies in the fourth class,


Where L =29.5, fmod = 57, f1=50, f2=48,w = 5.
f  f1 f  f1
Mode  Lo  i  L0  i
 f  f 1  f  f 2 2 f  f 1  f 2 
57 − 50
𝑀𝑜𝑑𝑒 = 29.5 + ∗5
2 ∗ 57 − 50 − 48
𝑀𝑜𝑑𝑒 = 29.5 + 2.2 = 31.5
Properties of mode

53 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

i. It is the easiest average to compute.


ii. It can be obtained for both qualitative and quantitative data.
iii. It is not affected by extreme values.
iv. The mode may not exist for a data set.
v. It is not unique. A data set can have more than one mode.
vi. The mode is not based on all observations.
Merits and Demerits of Mode

Merits of Mode

- easy to calculate and understand


- can be computed when there is an open-ended class
- not affected by extreme values -
- can be used for qualitative data
Demerits of the Mode
- not rigidly defined by mathematical formula
- doesn’t consider all the items
- not suitable for further mathematical treatments
- is an unstable average
4.5.3 The median

It has been pointed out that whenever there is a frequency distribution with open-end intervals, the
arithmetic mean cannot be calculated. Also the mean is greatly affected by extremely large or small
values. Hence, in such cases, the mean cannot be a good representative. Instead, other measures are
used to describe the data. In this section, we will discuss the most popular measure of position, the
median, and other related measures known as quintiles. Positional measures are chosen because of
their positions.

The median is, as its name indicates, the middle most value in the arrangement in an ascending or
descending order of magnitude, which divides the data in to two equal parts. It is the value which
exceeds, and is exceeded by, equal number (i.e., half) of the observations. That is, the median is
found by arranging the data in an increasing or decreasing order of magnitude. We can consider the
following three cases in finding the median:

54 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

Case 1: Ungrouped data: In such cases, the median is the value of the middle term when the data
are arranged in order of magnitude.
When the number of observations is odd, there will always be a single value in the middle of the
arrayed data. When n is even, however, there will be two middle observations, and the median is the
mean of these two values. Let x1 , x2 ,, xn be n ordered observations. Then, the median value is

n 1
th

a) the   value, if n is odd and


 2 
th th
n n 
     1
b) the  2   2  value, if n is even.
2
Example 4.12: 1. Find the median for the following data: 5, 2, 7, 1, 9, 10, 12. N=7
Solution: n = 7 is odd, we use above formula, and the median is the 4th item in the array. i.e.
median = 7.
2. Find the median value of the population figures (in thousands) of 10 cities: 2000, 1180, 1785,
1500, 560, 782, 1200, 385, 1123, 222.
Solution: The arrayed data is: 222, 385, 560, 782, 1123, 1180, 1200, 1500, 1785, 2000, since

n = 10 is even, the median is the mean of the 5th , and 6th values; i.e., Median  1123 1180  1151.5
2

thousands.

Case 2: Ungrouped discrete series data: In this case also, the median is obtained by the same
formula. Only one more step, finding the less than cumulative frequencies, is added, because
cumulative frequency distribution is itself an arrangement of values in an order:

- Find the less than cumulative frequencies.

- Look at the cumulative frequency and find that total which is either equal to or next higher
to  n  1  th obs. when n=odd and the average of the two middle values when=even, and the
 2 

corresponding value is the median. Example 4.13: Find the median of the data given below
xi 3 5 6 8 10
fi 4 4 7 9 5

55 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

Solution: First construct the cumulative frequency distribution:


xi fi Cum. Freq (  )
3 4 4
5 4 8
6 7 15
8 9 24
10 5 29

 n  1
th

Since n = 29,    15 value, the median is 6.


th

 2 
Case 3: Grouped continuous distribution
For continuous grouped data, the exact median cannot be obtained unless the original data has been
retained. Hence, the median has to be interpolated (or estimated) from the median class. An
interpolation formula which is based on the assumption that classes are uniformly distributed is:
w n 
Median  L    CF   ~
x Where: L= the lower class boundary of the median class;
f m ed  2 
w = the class width of the median class;
f med = the frequency of the median class; and

CF  the cum. Freq. corresponding to the class preceding the median class.
That is, the sums of the frequencies of all classes lower than the median class. Where the median
th
n
class is the class which contains the   observation whether n is odd or even, since the items
2
have already lost their originality once they are grouped into continuous classes.

Example 4.14: For the following distribution, find the median.

Grade Frequency
40 – 49 5
50 – 59 18
60 – 69 27
70 – 79 15
80 – 89 6
Solution: Construct the less than cumulative frequency distribution as follows

56 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

C.B’s Freq.  f i )  CF(<)


39.5 – 49.5 5 5
49.5 - 59.5 18 23
59.5 – 69.5 27 50
69.5 – 79.5 15 65
79.5 – 89.5 6 71

Since n = 71, 71/2 = 35.5, and the smallest CF greater than or equal to 35.5 is 50; thus, the median
class is the third class. And for this class, L = 59.5, w = 10, f med  27 , CF = 23. Then applying

Formula, we get: Median  59.5  10 35.5  23  64.13 .


27

Properties of Median

i. Array is a must before we calculate the median.


ii. There is a unique median for each data set.
iii. Geometrically, median divides the histogram or cumulative frequency curves into two parts
with equal area.
iv. Median remains unaffected by the magnitude of the extreme values.
v. It can be calculated for an open ended frequency distribution if the median class doesn't lie
in an open ended class.

Merits and merits of median


Merits of the median
- easy to calculate and understand
- rigidly defined by mathematical formula
- can be computed when there is an open-ended class
- not affected by extreme values -
Demerits of median
- is based only on the middle item
- is relatively less stable than the mean
- doesn’t consider all the item
- not suitable for further mathematical treatments
Positional averages (Quartiles deciles and percentiles)

57 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

Descriptive measures that describe the position (place) of value in a given data or distribution are
positional averages.

Measures which divided data in to many equal parts are called quartiles (fractiles).The most
important of these are quartiles, deciles and percentiles.

Quartiles: are the three values, which divide the given data in to four equal parts. They are
denoted by Q1, Q2 and Q3.

Q1 - The lower or first quartile. It covers 25% of the distribution.

Q2 - The middle or second quartile. It covers 50% of the distribution.

Q3 - The upper or third quartile. It covers 75% of the distribution.

Deciles: are the nine values, which divide the series in to ten equal parts. They are denoted by
D1, D2, … , D9.
D1 = Covers 10% of the distribution
D2 = Covers 20% of the distribution
.
.
D9 = Covers 90% of the distribution
Percentiles: are the 99 values, which divide the series in to 100 equal parts. They are denoted by
P1, P2 , … , P99.

i. Q1 = P25 Q2 = D5 = P50 = median Q3 = P75


ii. D1 = P10, D2 = P20, D3 = P30, … , D9 = P90.

Computation of Quartiles, Deciles and Percentiles for Ungrouped and Grouped Data

For ungrouped data and discrete series:

First, for ungrouped data, rearrange the values in the order of magnitude and for discrete series,
compute the <Cfi column. Then apply the following formula.

58 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

i  N  1
th
Qi  value of item
4
i  N  1
th
Di  value of item
10
i  N  1
th
Pi  value of item
100

For continuous series:

- Compute the less than cumulative frequency column.


- Determine the quartile, docile or percentile class.
- Apply the following interpolation formula.

c  iN 
Qi  l    c. f 
f  4 
c  iN 
Di  l    c. f 
f  10 
c  iN 
Pi  l    c. f 
f  100 

Example 4.15: For the data given below, compute the value of Quartiles, D3, D7, P15 and P88
and interpret.

Marks Below 10 10 - 20 20 - 40 40 - 60 60 - 80 Above 80


No. of Students 10 15 25 30 14 6
<cfi 10 25 50 80 94 100

Solution:

th
N
Q1 – size of item = 25th item 10 – 20 quartile class
4

l = 10, c = 10, f = 15, c.f = 10

c n 
Q1  l    c. f   10 
10
25  10   20
f 4  15

Mark of 25% of students is less than 20.

th
2N
Q2 – size of item = 50th item 20 – 40 quartile class
4

59 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

l = 20, c = 20, f = 25, c.f = 25

c n 
Q2  l    c. f   20 
20
50  25   40
f 2  25

Mark of half of students is below 40.

th
3N
Q3 – size of item = 75th item 40 – 60 quartile class
4

l = 40, c = 20, f = 30, c.f = 50

c  3n 
Q3  l    c. f   40 
20
75  25   73 .33
f  4  30

3 th
Mark of of students is below 73.33.
4
th
3N
D3 – size of item = 30th item 20 – 40 decile class
10

L = 20, c = 20, f = 25, c.f = 25

c  3n 
D3  l    c. f   20 
20
30  25   24
f  10  25

Mark of 30% of students is below 24.

th
7N
D7 – size of item = 70th item 40 – 60 decile class
10

L = 40, c = 20, f = 30, c.f = 50

c  7n 
D7  l    c. f   40 
20
70  50   53 .33
f  10  30

Mark of 70% of students is below 53.33.

th
15N
P15 – size of item = 15th item 10 – 20 percentile class
100

L = 10, c = 10, f = 15, c.f = 10

c  15 n 
P15  l    c. f   10 
10
15  10   13 .3
f  100  15

60 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

Mark of 15% of students is below 13.3.

th
88N
P88 – size of item = 88th item 60 – 80 percentile class
10

L = 60, c = 20, f = 14, c.f = 80

c  88 n 
P88  l    c. f   60 
20
88  80   71 .43
f  100  14

Mark of 88% of students is below 71.43.

Chapter summary

Given any raw data involving either qualitative or quantitative variables, we look for one basic
feature; the central tendency of observation. Several measures are available for this feature, and
each measure has its own advantage and drawbacks. Commonly we use the mean, median and
mode to get some ideas about the central tendency. Out of mean, median and mode, he mean is
the most commonly used measure in central tendency. But the other two, namely the median and
mode are not any less important. Median is usually used if the collected data is qualitative type.
A data set may not have mode or it may have one, two or more than two modes.

Self-evaluation test

1. A student scored an A in Introduction to Statistics (3 credit hours), a C in Psychology (3


credit hours), a B in Microeconomics-I (4 credit hours) and a D in Macroeconomics (2
credit hours). Assuming A has 4 grade points, B has 3 grade points , C has 2 grade points
and D has 1 grade points, calculate the grade point average (GPA).
2. For the following grouped data compute arithmetic mean by applying the short cut method
Marks 0-10 10-20 20-30 30-40 40-50 50-60
No. of students 5 10 25 30 20 10
3. Calculate the geometric mean of the data given below.
Marks 4-8 8-12 12-16 16-20 20-24 24-28 28-32 32-36 36 40
Frequency 6 10 18 30 15 12 10 6 2

4. Compute the questions given below based on the following frequency distribution.

61 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

Marks frequency
15 - 24 3
25 – 34 4
35 – 44 f1
45 -54 15
55 – 64 12
65 – 74 f2
75 – 84 2
If the median is 47

a. Find the missing frequencies


b. Construct the frequency distribution by including class boundaries, class marks,
cumulative frequencies (less and more than) and relative frequencies.
c. Construct the appropriate graphical presentation.
d. Calculate the arithmetic mean, mode and median of the distribution.
5. Prove that if X and Y are two positive numbers (X≠Y) theire AM ≥ GM ≥ HM is true.
6. The mean mark of 50 students in the Introduction to statistics course was 72. From the 50
students 35 are female and their mean mark is 82. Find the mean mark of the male students.

62 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

UNIT FIVE: MEASURE OF DISPERSION

5.1 Introduction

Dispersion is the scatter or variation of items from a measure of central tendency. The first
section covers the definition and objectives of measure of dispersion and properties of a good
measure of dispersion. The second section: examines the different measures of variation i.e.,
range and relative range, quartile deviation an coefficient of quartile deviation, the mean and
coefficient of mean deviation and the variance, the standard deviation and coefficient of variation
and the standard score.

5.2. Definition and objectives of measures of dispersion

Dispersion’ or `variation’ in statistics is the degree of spread of each individual item or value
from the central value in the given distribution. According to Bowley, it is “the measure of the
variation of the items.” The term dispersion indicates the extent the items in the data differ from
one another. In other words, they measure the lack of uniformity in the distribution.

The measures of dispersion are also called the average of second order, since they measure the
average of the deviations taken from the central tendency of the distribution. It measures the
scatter or variation of items from a measure of central tendency.

Example 5.1: Consider the following data on the expenditures of two groups of workers:

- Group A:ETB 6200 2000 1300 1300 1200 (the mean is ETB 2400)
- Group B: ETB 1600 1700 1300 4200 3200 (the mean is ETB 2400)

We simply conclude that the two groups spend identical amount, if we were given only the
average expenditure of the two groups without knowing the actual expenditures. But the actual
observations indicate that more variation is observed in group A.

Objectives of measures of variation

 To judge the reliability of measure of central tendency


 To control variability itself
 To compare two or more groups of numbers in terms of their variability

63 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

 To make further statistical analysis

Properties of a good measure of dispersion

 It should be based on all observations


 It should be easily calculated.
 It should be easily understandable.
 It should be affected as little as possible by sampling fluctuations.
 It should be capable of further statistical treatment.

Discuss the objectives of measure of dispersion.

5.3. Types of measures of dispersion

There are two types of measures of dispersion; Absolute and Relative measures of dispersion.
An absolute measure expresses the magnitude of dispersion in the same unit of measurement in
which the data are recorded. However, the relative measure (which is unit less) expresses
dispersion in percentages or ratios. It is a quotient obtained by dividing the absolute measure by
a quantity in respect to which the absolute dispersion has been computed.

Absolute measures of dispersion Relative measures of dispersion

- Range - Relative range


- Inter-quartile range, quartile deviation - Coefficient of quartile deviation
- Mean deviation - Coefficient of mean deviation
- Standard deviation and variance - Coefficient of variation
i. Range

Range is an about measure of dispersion. Such measures are compared if the unit of
measurement is homogeneous i.e. if all the sets of distribution are expressed in the same
statistical unit such as birr, letter, meters, etc. But if different sets of distributions are expressed
in different statistical units, the absolute measure of range cannot be effectively compared .For
ensuring comparability, relative measure of range, known as coefficient of range is used. This is

64 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

obtained by dividing the absolute range by the sum of the largest and smallest values in the
distribution.

Range is defined as the difference between the smallest and the largest observations in a given
set of raw data.

Distinguish the absolute measure of dispersion from the relative measure of dispersion.

Example 5.2: Find the ranges of the following two groups.

Group A: ETB 6200 2200 1700 1700 1200 (the mean is ETB 2400)

Group B: ETB 1600 1700 1300 4200 3200 (the mean is ETB 2400)
Solution:
For Group A:
The highest expenditure = 6200 birr
The lowest expenditure = 1200 birr
Range = highest value – lowest value = 6200 – 1200 = 5000 Birr
For Group B:
The highest expenditure = 4200
The lowest expenditure = 1300
Range = 4200 – 1300 = 2900 Birr
Therefore, in terms of expenditure more variation is observed in group A.

A large value of range shows lack of uniformity and consistency in the distribution. It
explains that the average is inadequate and not representative.

For discrete grouped data we use the same formula as given above, i.e., the difference between
the highest and lowest values.

Example 5.3: Compute the range of the following data.

Table: Results (out of 35%) of 20 students in Introduction to Statistics test.

Xi 6 24 18 22 30 15
Yi 3 2 5 1 4 5

65 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

Solution:

Maximum value = 30 marks


Minimum value = 6 marks
Range = Highest value – lowest value = 30 – 6 = 24 between the highest and lowest values.
There are three Methods to calculate range for Continuous Grouped Data:

- By taking the difference between the upper class limit of the last class and the lower
limit of the first class.
- By taking the difference between the upper class boundary of the last class and the
lower class boundary of the first class.
- By taking the difference between the mid points of the first and the last class. This
does yield a result closer to the actual range as it reduces the margin by which it is in
error when computed by using the first the second methods.

Example 5.4: Compute the range of the data given below which shows the score (out of 35%)
of 40 students in Econometrics test.

Score (35%) Class boundaries Number of students (f i)


6-10 5.5-10.5 5
11-15 10.5-15.5 10
16-20 15.5-20.5 15
21-25 20.5-25.5 7
26-30 25.5-30.5 3
Solution

Range = UCBL – LCBF = 30.5 – 5.5 = 25 or


Range = UCLL – LCLF= 30 – 6= 24 or
Range = 28 – 8 = 20 (Using the mid-points)

Relative range or coefficient of range

The relative range or coefficient of range is defined as follows:

For raw data & discrete grouped data

66 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

ℎ𝑖𝑔ℎ𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒 − 𝑙𝑜𝑤𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒


𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑒𝑛𝑡 𝑜𝑓 𝑟𝑎𝑛𝑔𝑒 = ∗ 100%
ℎ𝑖𝑔𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒 + 𝑙𝑜𝑤𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒

For continuous grouped data

𝑈𝐶𝐵𝐿 − 𝐿𝐶𝐵𝐹
𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑒𝑛𝑡 𝑜𝑓 𝑟𝑎𝑛𝑔𝑒 = ∗ 100%
𝑈𝐶𝐵𝐿 +𝐿𝐶𝐵𝐹

Where 𝑈𝐶𝐵𝐿 is the upper class boundary of the last class and
𝐿𝐶𝐵𝐹 is the lower class boundary of the first class

Example 5.5: Find the coefficient of range (relative range) for the data given in the above table.

Solution:

UCBL = 30.5, LCBF = 5.5.

30.5 − 5.5
𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑒𝑛𝑡 𝑜𝑓 𝑟𝑎𝑛𝑔𝑒 = ∗ 100% = 69.4
30.5 + 5.5

Note
- Range is as good a measure of dispersion as any other where the data consist of a few
observations.
- It is advantageous when one wants to know only the extent of the extreme dispersion
under “ordinary” conditions.
- It tells us noting about the dispersion of the values which fall between the two
extremes.
- It is highly affected if the value of the two extremes changes.
ii. Inter- quartile range

The range, which takes into account only two extreme values, is considered a crude measure of
dispersion. To overcome certain limitations of range, another method known as `inter-quartile
range’ has been developed. Inter-quartile range is the absolute difference between the third
quartile (Q3) and the first quartile (Q1) of a given frequency distribution. In other words, it
includes only the middle 50% of items in the distributions and ignores one quarter of the

67 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

observations on either end of the distribution. The inter-quartile range is calculated with the help
of the formula.

Inter-quartile range (I.R) =Q3 - Q1

Where I.R = Inter-quartile range. Q3 = the third quartile Q1 = the first quartile

For example, if a frequency distribution Q3 and Q1 are 72 and 20 respectively, its inter-quartile
range is I.R = 72 – 20 = 52

iii. Quartile deviation


Another method of dispersion developed to overcome the limitations of range and inter-quartile
deviation is quartile deviation or semi inter-quartile-range. It can be defined as the average
absolute difference between the third (upper) quartile and the first (lower) quartile of the
frequency distribution. It studies the range of spread various items on either side of the median
and it ignores nearly 50% of the items on either the extreme ends of the distribution. High degree
of quartile deviation means low uniformity, and low degree of variation quartile deviation is
calculated with the help of the following formula.

Q3  Q1 Q3  Q1
Q.D. = and coefficient of Q.D.= .* 100%
2 Q3  Q1
Example 5.6: The following is data corrected on the result of 11 students in a final examination of
statistics.
Result: 72 65 75 69 35 43 52 37 61 58 41
Solution: The data when arranged in ascending order will read as follows:
35 37 41 52 58 61 65 69 72 75
Calculating the quartiles,

 N  1  N  1
th th

Q3 = value of the 3   , Q1 = Value of the  


 4   4 
Here N = 11

 11  1 
th th
 12 
Q3 = value of the 3   = Value of 3    item
 4  4
Value of the 9th item = 69

68 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

 N  1  11 
th th
1
Q1 = the value of   = Value of   item
 4   4 
th
 12 
= Value of   item
4
Value of the 3rd item = 41
Substituting 69 & 41 for the upper & lower quartiles respectively in the formula, we have:
Q3  Q1 69  41 28
Q.D  =   14
2 2 2
Q3  Q1 69  41 28
Coefficient of quartile deviation = *100 %  *100 %  *100 %  25 .5
Q3  Q1 69  41 110

From the coefficient calculated above, it can be seen that there is a greater uniformity in the
distribution because the coefficient of quartile deviation is small.
Advantages of Quartile deviation
- It is easy to compute and understand
- It can be computed for open-ended classes given that Q3 & Q1 can be found.
- It is not affected by extreme values
Disadvantages of Quartile deviation
- It ignores the first 25% and the last 25% items.
- It is not capable of mathematical manipulations.
- Its value is very much affected by sampling fluctuations.
- It doesn’t show the scatter around the average, but only a distance on scale.
iv. Mean deviation

Mean Deviation measures the average deviation /scatters of a set of observations about a central
value (mean/median). Mean deviation is obtained by dividing the sum of the absolute deviations
taken from the average by the total number of observations. Generally, the result is in the
absolute value to denote that deviations are taken by ignoring algebraic signs because, if signs
are taken into consideration, the sum of deviations from mean will be Zero, and the sum of
deviations from median or mode will be nearly Zero.

69 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

If the deviations are taken from the mean then it is called M.D around the mean, if the deviations
are taken from the median, it is called M.D around the median, or if the deviations are taken
from the mode, it is called M.D around the mode.
n n

 xi  x x
i 1
i  median
i 1
M.D. about the mean = , M.D. about the median =
n n
n

x
i 1
i  mod e
M.D. about the mode =
n
M .D.mean M .D.median M .D. mod e
The coefficient of M.D. = *100%or *100%or *100%
x median mod e
Mean Deviation for Grouped data (discrete or continuous)
For grouped data, the mean deviations around the mean, median, and mode are obtained,
respectively, as follows:
∑𝑚 𝑚
𝑖=1 𝑓𝑖 |𝑋 − 𝑚𝑒𝑎𝑛| ∑𝑖=1 𝑓𝑖 |𝑋 − 𝑚𝑒𝑑𝑖𝑎𝑛| ∑𝑚
𝑖=1 𝑓𝑖 |𝑋 − 𝑚𝑜𝑑𝑒|
, 𝑎𝑛𝑑
𝑛 𝑛 𝑛
Where m = number of classes, and Xi = class mark of the ith class.
Example 5.7: Calculate the mean deviations from mean and median for the data given below
Class Interval (C.I) 1-5 6-10 11-15 16-20
Frequency 4 1 2 3

Solution:
C.I xi fi fi xi  10 fi xi  10.5
1-5 3 4 28 30
6-10 8 1 2 2.5
11-15 13 2 6 5
16-20 18 3 24 22.5
Total 10 60 60

3  4  8  1  13  2  18  3
Mean   10 , and
10

Median = 5.5  5  45  10.5


1
Therefore, MD around the mean and around the median are 60/10* 100% = 60% or 6
Advantages of Mean Deviation

70 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

- It is easy to understand and compute than standard deviation


- It is not unduly influenced by large or small values.
- All values are used in its calculation.
Disadvantages of Mean Deviation
- It ignores the algebraic sign of the deviations.
- It is not suitable for further mathematical processing.
v. Standard deviation and variance

To overcome the limitation of ignoring the algebraic signs of the deviations, the concept of
standard deviation was introduced by Karl Pearson in 1893. Standard deviation may be defined
as the square root of the squared deviation may be defined as the square root of the squared
deviations measured form arithmetic average. It is also called the “Root-mean square deviation.”
Symbolically it is expressed by the small Greek letter S (Sigma). For calculating the standard
deviation the deviations are always taken from the mean, because, the sum of squared deviations
is minimum when measured from arithmetic mean. High value of standard deviations are
minimum when measured from arithmetic mean, high value of standard deviation denotes
greater variation & less uniformity and less measure of standard deviation denotes lesser
variation and greater uniformity. Whereas the lower value indicates that the averages are good
representatives of the data.

Variance and standard deviation are closely related each other, since, the variance is
the square of standard deviation or standard deviation is the square root of variance.

While the standard deviation and variance are the absolute measure, the relative measure is
known as coefficient of variation.

a. Population variance and S.D. (ungrouped data)

Suppose that x1, x2, ..., xn are the values of the observation in a population of size N with mean
 . Then, the population variance and standard deviation are defined by:
n

 x  
2
i
Variance =  2  i 1
…………… . . . . . . . . . . . . . . . . . . . . . . . . .. . . .. .. .. . (#)
N

71 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

 x  
2
i
Standard deviation =   i 1

N
Example 5.8: Suppose that the ages of all patients in the recovery room of a certain Hospital
are: 26,30,38,40,36,20,45, and 37 years. Find the population variance and standard deviation
Solution: Given N  8, x1  6, x 2  30 ,, x8  37 .

26  30    37 272
The population mean is,     34.
8 8
Then, we construct the following table for the deviations and squared deviations from  (the last
column is for totals).
Age  xi  26 30 38 40 36 20 45 37 ΣXi= 272

xi   -8 -4 4 6 2 -14 11 3 x 
Σ( i )=0
xi   2 64 16 16 36 4 196 121 9 Total=462

Thus, using above formulae we have,  2  462 / 8  57.75 ; and   57 .75  7.599 .

1 N 2
Or we can apply   
N i 1
xi   2
2

8
Now, to solve the example above using this formula, we have  xi  9710 , and
2

i 1

x i  272    34 ,  2 
1
9710  342  57.75 and   57 .75  7.599 .
i 1 8
b. Population variance and S.D for grouped data

In a grouped discrete or continuous data where x1, x2 , , xm have their corresponding

frequencies of f1, f 2 , , f m , the population variance is redefined as:


∑ 𝑓𝑖(𝑋𝑖−𝜇)2
𝛿2 = ∑ 𝑓𝑖
, Or

, N is ∑ 𝑓𝑖 and m classes.
m

 f x  
1 2
2  i i
N i 1

Where xi denote the class marks in case of continuous F.D.

Example 5.9: Find the variance and S.D for the population values:

72 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

xi 2 3 5 6 8
fi 3 4 4 5 4
Solution: First find the mean,
1 100

N
 fi xi 
20
 5 , and proceed as:

xi    xi   2 f i xi   


2
f i xi
xi fi
2 3 6 -3 9 27
3 4 12 -2 4 16
5 4 20 0 0 0
6 5 30 1 1 5
8 4 32 3 9 36
Sum 20 100 84
5
Here, N   f i  20 ; then, using the definitional formula, the variance is
i 1

 2 =84/20 = 4.2 and   4.2  2.05 .


c. Variance and S.D. for samples
Recall that a parameter is a summary measure computed from the population data. In the above
cases,  ,  , and  2 are the parameters. Whereas x , s and s2 are the statistics which are used to

estimate the corresponding parameters. So far, the parameters  2 and  have been discussed.
Normally, one can say that if we replace N by n and  by x the resulting formula would become

 
2
1 n
 xi  x
n 1
and this would be used to estimate  2. But, theoretically, it can be shown that

this underestimates  2. Instead, if we use the divisor to be n  1 instead of n, it will be an


unbiased estimator of  2. (i.e.; on the average, we will get a closer estimation). Thus, the
sample variance for the values x1 , x2 ,, xn is given by:

1 n
S    xi  x 
2 2

n  1 i 1
Why do we use n-1?

In small sample, it provides a better estimate of the variance of the population from which the
sample is drawn.
73 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

However, as n increases above about 30, we can use n instead of n-1, as the two versions given
approximately the same result for practical purposes.

Example 5.10: A sample of 5 students was taken from a class and their weights were found to be
48, 51, 52, 51, and 53kg. Find the variance and standard deviation.
Solution: The mean is x  255 / 5  51, and prepare the following table:

Weights (xi ) xi  x x  x
i
2

48 -3 9

51 0 0
52 1 1
51 0 0
53 2 4
Total 0 14

Thus, we have n  5, and

n 5 2

x i  x) 2
 x i  51
14
s  2 i 1
 1
  3.5, and s  3.5  1.87 is the sample standard
n 1 5 1 4
deviation.
d. Sample variance and S.D for grouped data
If the values xi have frequencies fi (i=1,2,…,m), then the sample variance is given by:

1 m
S   fi  xi  x 
2 2

n  1 i 1
The above definition for sample variance also holds for grouped continuous distribution where
xi=class mark of the ith class
Example 5.11: Find the sample variance and standard deviation for the distribution:
C.I 1-5 6-10 11-15 16-20

Freq. 4 1 2 3

Solution:
In a continuous F.D., xi is the class mark representing the ith class.

74 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

2
C.I fi
xi f i xi f i xi

1-5 3 4 12 36

6-10 8 1 8 64

11-15 13 2 26 338

16.20 18 3 54 972

Total 10 100 1410

f  10, x  
fi xi 100
Where n  i   10,  fi xi 2  1410 , so that
n 10

s2 
1
9

1410  10 10  
2

410
9
 45.56, and s  45 .56  6.75 .

Important properties of Variance /Standard Deviation


 The variance/standard deviation of any constant is always zero.
 A standard deviation of zero implies that there is no variation at all in the data set. In
other words the data values are the same.
 A variance/standard deviation can never be a negative number.
 If a constant is added or subtracted from each observation, the variance/standard
deviation of the resulting observations will not be affected.
 If every observation is multiplied by a constant K, then the new variance will be K2 times
the original variance and the new standard deviation will be K times the original standard
deviation.
Advantages and disadvantages of standard deviation/variance
Advantages
 rigidly defined by mathematical formula
 is based on all the observations
 is useful for further mathematical treatments.
Disadvantages
 gives more weight for extreme values
 cannot be calculated if all the values are not given
 cannot be used when there is an open-ended class
Coefficient of variation

75 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

The relative measure of standard deviation is called coefficient of variation. According to Karl
Pearson, who introduced this measure, it is “the percentage variation in the mean the standard
deviation being treated as the total variation in the mean.” It is used to measure the consistency
and variability or uniformity of two or more distributions. Lower value of the coefficient of
variation indicates higher consistency, more uniformity, and lower variability. Symbolically, the
coefficient of variation can be expressed as
 s
C.V .  100 % or C.V .   100% .
 x
Where C.V = coefficient of variation
 = standard deviation
x = sample mean
µ= Population mean

Both the coefficient of standard deviation and coefficient of variation are relative
measures of dispersion. Coefficient of variation is the percentage of coefficient of standard
deviation.
Example 5.12: Suppose typist A types out 30 pages per day on average with a S.D. of 6 and
typist B types out 45 pages per day on average with a S.D. of 10. Which typist has
shown greater consistency in her/his output?
s 6
Solution: C.V.(Typist A)   100%   100%  20 % .
x 30
s 10
C.V.(Typist B)   100%   100%  22.2 % .
x 45
Since C.V A is less than C.V B typist A is more consistent and shows uniformity on his/her
performance

Explain what we can infer from the results of coefficient of variation.

Standard Scores (Z-Scores)

Standard scores is one of the applications of standard deviation which is a relative measure. It
indicates the position of individual observations. Suppose that a student scored 66 in Statistics

76 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

and 80 in Mathematics. The question is, in which course did he score better as compared to his
classmates?

At first glance, it seems that he did much better in Mathematics. But suppose that all the students
in the class averaged 51 points in statistics with a standard deviation of 12, and averaged 72 in
Mathematics with a standard deviation of 16.
Thus, one can argue that the student’s score in Statistics is: 66  51  1.25 standard deviations
12

above the average, while his score in Mathematics is only 80  72  0.50 standard deviation
16
above the average for the class. Here, the grades have been converted in to standard scores.
Whereas the original scores cannot be meaningfully compared, the standard scores expressed in
terms of standard deviations can be compared. Thus, the student scored much higher in Statistics
than in Mathematics compared to the rest of the class.
In general, we define the standard score as:

xx x
z or Z 
s 
The Z- score tells us how many standard deviations a value lies above (if positive) or below (if
negative) the mean of the set of data to which it belongs.
Example 5.13: If a set of measurements has the mean 48 with a S.D. of 12, convert each of the
following in to
Standard units: a) 54; b) 72 ; c) 78.
Solution: a) For x = 54, Z  54  48  0.5 ; that is, 54 is 0.5 S.D’s below the mean .
12

b) For x = 72, Z  72  48  2 ; that is, 72 is 2 S.D’s above the mean.


12
78−48
c) Similarly, 78 = 2.5; that is 78 is 2.5 standard deviations above the mean.
12

Chapter summary

Given any raw data involving either qualitative or quantitative variables, we look for one basic
feature; the measure of observation. Several measures are available for this feature, and each
measure has its own advantage and drawbacks. Commonly we use the range and relative range,
quartile deviation and coefficient of quartile deviation, mean deviation and coefficient f mean

77 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

deviation, variance, standard deviation and coefficient of variation are the most commonly used
measures in dispersion. Among different groups the one which has less coefficient of variation
indicates more consistency or less variation of distribution.

Self evaluation test

1. The test grade for a number of students are: 76, 72, 100, 64, 72 and 90
a. What is the standard deviation and variance of the test score?
b. What is the coefficient of variation of the test score?
2. Metrologies interested in the consistency of temperatures in three cities during a given week
collected the following data. The temperature for the five days of the week in the three cities
were:
City 1 25 24 23 26 17
City 2 22 21 24 22 20
City 3 32 27 35 24 28
Which city have the most consistent temperature, based on these data?

3. Let n  10, x  12 and x  1530 for a certain data. Find the coefficient of variation.
2
i

4. Consider the following grouped frequency distribution data and calculate the following
questions.

Marks frequency
15 - 24 3
25 – 34 4
35 – 44 10
45 -54 15
55 – 64 12
65 – 74 4
75 – 84 2
Total 50

a. Find the quartile deviation and coefficient of quartile deviation.


b. Compute variance and standard deviation
c. Find the coefficient of variation.

78 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

UNIT SIX: MOMENTS, SKEWNESS AND KURTOSIS

6.1 Introduction

This chapter consists of three sub sections. The first section explains the concept of moments and
the methods of measuring moments i.e., from mean, arbitrary number and the origin. The second
section discusses the concept of Skewness. Skewness refers to the lack of symmetry in a
frequency distribution. In addition to this we will find the four different ways of measuring
coefficient of Skewness; The Karl Pearson’s, The Bowley’s, Kelly’s and Moments measure of
coefficient of Skewness. The last section covers the concept kurtosis and the measures of
coefficient of kurtosis.

6.2 Moments

It is the mean of different powers of deviations of observations from a certain point. If this point
is the mean, then the moment about the mean is called the central moments, denoted by  (read
as ‘mu’). This 1 stands for first moment about mean, 2 stands for second moment about mean,
etc.
Central moment (Moments from the mean)
The rth moment from the mean for ungrouped data is calculated as follows

∑(𝑋𝑖 − 𝑋̅)𝑟
μr =
𝑛

 X i  X 
1 
n
, Since Xi  X   0 , μ 1 is always Zero. 1  0

 Xi  X 
2

2   ( S.D) 2 = variance i.e. 2 coincides with the variance of the distribution


n
 Xi  X 
3

3 
n
th
The r moment from the mean for frequency distribution is calculated as follows

∑ 𝑓(𝑋𝑖 − 𝑋̅)𝑟
𝜇𝑟 =
∑ 𝑓𝑖

79 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

 f X i  X   f iXi  X 
2 3

1  , 3 
 fi  fi
 fi X i  X   fi Xi  X 
2 4

2  , 4 
 fi  fi
Where Xi is the class mark of each ith class and 𝑓𝑖 frequency of ith class

Moments about arbitrary origin


Moments about arbitrary origin also called ‘raw moments’ are denoted by 1, to distinguish
them from the moments about the mean, which are denoted by. Thus,  1 to stands for the first
moment about an arbitrary point ‘A’ and so on. The calculation shall be done as follows.

  X i  A  X  A   Xi  A
3
 
1  , 3 
n n

  Xi  A   Xi  A
2 4

 2  
, 4 
n n
For a frequency distribution,

1   fi  X  A or  fd  fd  X cw where d    X  A and N 


N N
or
N cw
 fi

 fi  X  A
2

2  And so on.
N

Moments from the origin (about zero)


The moments about zero are often denoted by 1, 2, 3, etc and are obtained as follows:

1 =
 fi  X  0   fiXi , 3 
 fiXi 3

and
N N N

 fi  Xi   fiXi
2 4

2 = ,  =
4
N N

The concept of moment is of great significance in statistical work. With the help of moments we
can measure the central tendency of a set of observations, their variability, their symmetry and
the height of the peak their curve would make. Because of the great convenience in obtaining

80 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

measures of the various characteristics of a frequency distribution, the calculation of the first
four moments about the mean may well be made the first step in the analysis of a frequency
distribution.

6.3 Skewness

Two or more distributions may have the same mean and equal standard deviation Thus; we may
feel that talking only of standard deviation to describe the distribution is not sufficient. Both
distributions with equal standard deviation may differ in shape. Thus we are forced to define (or
introduce) the concept of Skewness.

The term Skewness refers to the lack of symmetry i.e. when a distribution is not
symmetrical it is called asymmetrical or skewed. We study Skewness to have an idea about the
shape of the curve which we can draw with the help of the frequency distribution. Frequency
distributions often found skewed on either side of its central value. As a result, it has a longer
tail either to the left or to the right.

If there is a longer tail to the right of the center, the distribution is said to be positively skewed.
A positive Skewness means a greater dispersal of individual observations towards the right of the
central value
If the tail is longer to the left of the center, the distribution is said to be negatively skewed. A
negative Skewness, on the other hand, implies that individual observations have greater dispersal
towards the left of the central value.
Skewness, therefore, not only refers to the lack of symmetry in distribution, it also shows the
direction of dispersion of individual observations on either side of the center of the distribution.
Tests of Skewness

- If A.M = Median = Mode, then there is no Skewness in the distribution. In other words,
the curve of the frequency distribution would be symmetrical or bell-shaped.
- If A.M is less than the values of the mode, the tail of an asymmetrical distribution is on
the left side, i.e. the distribution is negatively skewed.
- If A.M is greater than the value of mode, the tail is on the right side, i.e. the distribution
is positively skewed.

81 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

Measures of Skewness

There are four important measures of relative Skewness, namely,

A. Karl Pearson’s Coefficient of Skewness


B. Bowley’s Coefficient of Skewness
C. Kelly’s Coefficient of Skewness
D. Measure of Skewness based on moments

A. Karl Pearson’s coefficient of Skewness:

The Karl Pearson measure of coefficient of Skewness is given by

Mean  Mode
S KP  𝑆𝐾𝑃 = Karl Pearson Coefficient of Skewness
Sandard Deviation

If mode is ill defined in some frequency distribution, by using the empirical relationship between
mean, median and mode for a moderately skewed distribution

Mode = 3 Median – 2 Mean

3 𝑀𝑒𝑎𝑛 − 3 𝑀𝑒𝑑𝑖𝑎𝑛
𝑆𝐾𝑃 =
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛(𝛿)

Example 6.1: Suppose the mean; the mode and the standard deviation of a certain distribution
are 32, 30.5 and 10 respectively. What is the shape of the curve representing the distribution?

Solution:

Mean  Mode 32  30.5


S KP  S KP   0.15
Sandard Deviation 10
,

𝑆𝐾𝑃 = 0.15 > 0 The distribution is positively skewed.

Theoretically, the values of this coefficient lies between -3 and +3; however,
practically the value of 𝑆𝐾𝑃 lies between –1 and 1. For a symmetrical distribution, its value
comes out to be zero.

82 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

B. Bowley’s Coefficient of Skewness

This measure is called quartile measure of Skewness. It is also useful in open-end distribution
and where extreme values are present.

Q3  Median  Median  Q1 


S KB 
Q3  Median  Median  Q1 

Q3  Q1  2 Median
S KB  𝑆𝐾𝐵 = Bowley’s Coefficient of Skewness, value of 𝑆𝐾𝐵 lies between –1
Q3  Q1

and 1.

C. Kelly’s Coefficient of Skewness

This method is a modification of Bowley’s formula for he suggested that measure of Skewness is
better if it includes all the observations. Thus, instead of leaving the upper and the lower
quartiles, he said better to leave the upper and the lower docile and forwarded the formula as
follows:

P10  P90  2Median D  D9  2 Median


S KK  also S KK  1
P90  P10 D9  D1

𝑆𝐾𝐾 = Kelly’s Coefficient of Skewness

D. Measure of Skewness based on moments.

A measure of Skewness may be obtained by making use of the third moment about the mean. 1
measures Skewness using the second and third moment.
 32
1 
 23
Since  32 and  23 are always positive, 𝛽1 as a measure of Skewness cannot tell us about the

direction of Skewness. Thus 1always remains to be positive. This draw back can be removed by
calculating Karl Pearson’s ratio 1 which is defined as
μ3 𝜇3
1 = √β1 = 3 = 3
(δ2 ) ⁄2 𝛿

83 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

1 is always positive, and the sign of Skewness depend upon the value of 3 .If 3 is
positive, we will have positive Skewness and if 3 is negative, we will have negative Skewness.
Thus it is advisable to use 1 as a measure of Skewness.
The shape of the curve is determined by the value of 1

If 1> 0 then the distribution is positively skewed

If 1= 0 then the distribution is symmetric

If 1< 0 then the distribution is negatively skewed

- In appositively skewed distribution, smaller observations are more frequent than


larger observations. i.e., the majority of the observations have a value below an average.

- In negatively skewed distribution, smaller observations are less frequent than larger
observations. I.e. the majority of the observations have a value above an average.

Example 6.2: Find 1 and 1 interpret the outcome using the table below.

X 2 3 4 5 6
f 1 3 7 3 1

Solution:

Calculate the first four moments

X f ̅ )=x,
(X-𝑿 fx fx2 fx3 fx4
̅ =4
𝑿
2 1 -2 -2 4 -8 16
3 3 -1 -3 3 -3 3
4 7 0 0 0 0 0
5 3 +1 +3 3 +3 3
6 1 +2 +2 4 +8 16
Σfx=0 Σ fx2=14 Σf x3=0 Σ fx4=38

∑ 𝑓(𝑋𝑖 − 𝑋̅)1 𝛴𝑓𝑥 0


𝜇1 = = = =0
∑ 𝑓𝑖 𝑁 15

84 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

∑ 𝑓(𝑋𝑖 − 𝑋̅)2 𝛴 𝑓𝑥 2 14
𝜇2 = = = = 0.933
∑ 𝑓𝑖 𝑁 15
∑ 𝑓(𝑋𝑖 − 𝑋̅)3 𝛴 𝑓𝑥 3 0
𝜇3 = = = =0
∑ 𝑓𝑖 𝑁 15
∑ 𝑓(𝑋𝑖 − 𝑋̅)4 𝛴 𝑓𝑥 4 38
𝜇4 = = = = 2.533
∑ 𝑓𝑖 𝑁 15
𝜇3 2 0
𝛽1 = = =0
𝜇2 3 0.9332
μ3 𝜇3 0
1 = √β1 = 3 = = =0
(δ2 ) ⁄2 𝛿3 0.9663

Since 1 is 0, the distribution is symetrical.


6.4 Kurtosis

Kurtosis in Greek means “bulginess”. In statistics it refers to the degree of flatness or peakedness
in the region about the mode of a frequency curve. The degree of kurtosis is of a distribution is
measured relative to the peakedness of normal curve. If a curve is more peaked than the normal
curve, it is called “Leptokurtic”. In such a case the items are more closely bunched around the
mode. On the other hand, if a curve is more flat-topped than the normal curve, it is called
“Platykurtic”. The normal curve itself is known as “Mesokurtic”. The condition of peakedness or
flat-toppedness itself is known as kurtosis.

Moment’s measures of kurtosis

The most important measure of kurtosis is the value of the coefficient2. It is defined as:
4
2 = 4 = 4th moment and 𝜇2 = 2nd moment.
 22

The greater the value of 2, the more peaked is the distribution.

For a normal curve the value of 2 = 3. When the value of 2 is greater than 3, the curve is more
peaked than the normal curve, i.e. Leptokurtic. When the value of 2 is less than 3, the curve is
less peaked than the normal curve, i.e. platykurtic. The normal curve and other curves with 2= 3
are called mesokurtic.
Sometimes 2, the derivative of 2, is used as a measure of kurtosis, 2 is defined as

 2 = 2 – 3

85 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

For a normal distribution, 2 = 0. If 2 is positive, the curve is leptokurtic and if 2 is negative,


the curve is platykutric.

Explain the meaning of mesokurtic, platykurtic and leptokurtic distribution.

Example 6.3: Based on the data given in the above table find 2 and 2.

4 2.533
2 = = =2.91
2
2 0.9332

2 = 2 – 3, 2.91-3=0.09

Since the value of 2 is less than 3 and the value of 2 is less than 0 the distribution is
platykurtic.

Chapter summary

The shape of the distribution is assessed using the measure of Skewness. Skewness refers to the
lack of symmetry in a frequency distribution. Depending on the shape of the distribution a data
may be positively, negatively or symmetrically distributed. And the peakdness of the
distribution is assessed using of kurtosis. If a curve is more peaked than the normal curve, it is
called “Leptokurtic”, and if a curve is more flat-topped than the normal curve, it is called
“Platykurtic”. The normal curve itself is known as “Mesokurtic”.

Self-evaluation test

1. Some characteristics of annually family income distribution (in Birr) in two regions is as
follows
Region Mean Median Standard deviation
A 6250 5100 960
B 6980 5500 940

a. Calculate coefficient of Skewness of each group


b. Which region shows a more skewed income distribution? Give your interpretation for thus
region.
2. Compute 1, 1 , 2 and 2and interpret the result based on the following table.

86 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

Class Interval Number of Teachers


Age (in years)
15 – 19 6
20 – 24 19
25 – 29 50
30 – 34 57
35 – 39 48
40 – 44 27
45 – 49 21
Total 228

UNIT SEVEN: SIMPLE LINEAR REGRESSION AND CORRELATION

7.1 Introduction

Regression and correlation analysis is used to study relationships among variables. Simple
regression analysis sees the relationship between variables, which we call the dependent and
independent variables. In that we try to determine what will be the change in the value of the
dependent for unit change in the value of the independent variable. For example, by how much
will the yield per hectare will increase as we increase the amount of fertilizers by one gram? By
how much student’s GPA increase the increase his/her stay on reading by one minute? As the
frequency of visit by the development agent increase by one day, by how much will the
probability of adopting new technology by farmer increase?

A dependent variable may be affected by one independent variable or many


independent variables. If we consider the change of the dependent variable as a function of only
one independent variable, the relationship is called simple regression. If the dependent variable is
to be seen as a function of two or more independent variable, the relationship is multiple
regressions. The relationship between two variables can also be linear or non-linear. A linear
relationship implies a constant absolute change in the dependent variable in response to a unit
changes in the independent variable. Non-linear relationship implies varying marginal change in
the dependent variable in response to changes in the independent variable.

87 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

In this chapter, we will confine ourselves to the type of regression involving only two variables
and the type of relationship between our variables which is linear. Finally we will discuss the
measurement of the closeness of the relationship between variables or the correlation analysis.

7.2 Simple linear regression

Regression analysis: is a statistical technique that can be used to develop a mathematical


equation showing how variables are related. Regression equation is a statement of equality that
defines the relationship between two (or more) variables.

Simple linear regression: refers to the linear relationship between two variables, the dependent
variable Y and the independent variable X.

A simple linear regression line is the line fitted to points plotted in the scatter diagram
which would describe the average relationship between the two variables. Therefore, to see the
type of relationship, it is advisable to prepare scatter plot before fitting the model.

The linear model is:

𝑌 = 𝛼 + 𝛽𝑋 + 𝜀

Where

𝑦 = 𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒

𝑥 = 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒

𝛼 = 𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡

𝛽 = 𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑠𝑙𝑜𝑝𝑒

𝜀 = 𝑟𝑎𝑛𝑑𝑜𝑚 𝑑𝑖𝑠𝑡𝑢𝑟𝑏𝑎𝑐𝑒 𝑡𝑒𝑟𝑚

𝑌~𝑁(𝛼 + 𝛽𝑋, 𝛿 2 )

𝜀~𝑁(0𝛿 2 )

88 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

The equation of the line which is to be used in predicting the value of the dependent variable
takes the form, 𝑌̂ = 𝛼̂ + 𝛽̂𝑋𝑖.

The most universally used and statistically accepted method of fitting such an
equation is the method of least squares. The method of least squares states that the value of 𝛼
and 𝛽 should be chosen given that the sum of squared residuals is minimized.

As shown in the scatter diagram, if 𝜀1 , 𝜀2 , 𝜀3 , … , 𝜀5 are the observed Y values from the straight
line (predicted Y value values - 𝑌̂ ), fitting a straight line in keeping with the above condition
requires that (for n sample size).

𝜀1 + 𝜀2 + ⋯ . . 𝜀𝑛 = ∑ 𝜀𝑖 2 𝑖𝑠 𝑚𝑖𝑛𝑖𝑚𝑢𝑚
2 2 2

𝑖=1

This can be done by partially deviating ∑ 𝜀𝑖 2 with respect to 𝛼̂ and 𝛽̂ and equating them to zero.

𝜀𝑖 is the error made when taking 𝑌̂ instead of Y Yi.

𝜀𝑖 = 𝑌𝑖 − 𝑌̂

2
∑ 𝜀𝑖 2 = ∑(𝑌𝑖 − 𝑌̂)

2
∑ 𝜀𝑖 2 = ∑(𝑌𝑖 − 𝛼̂ − 𝛽̂𝑋𝑖)

2
𝜕 ∑ 𝜀𝑖 2 𝜕 ∑(𝑌𝑖 − 𝛼̂ − 𝛽̂𝑋𝑖)
= =0
𝜕𝛼̂ 𝜕𝛼̂

−2 ∑(𝑌𝑖 − 𝛼̂ − 𝛽̂𝑋𝑖) = 0

∑ 𝑌𝑖 − ∑ 𝛼̂ − ∑ 𝛽̂𝑋𝑖 = 0

𝑛 ∑ 𝛼̂ ∑ 𝑌𝑖 𝛽̂ ∑ 𝑋𝑖
= +
𝑛 𝑛 𝑛

𝛼̂ = 𝑌̅ − 𝛽̂ 𝑋̅
2
𝜕 ∑ 𝜀𝑖 2 𝜕 ∑(𝑌𝑖 − 𝛼̂ − 𝛽̂𝑋𝑖)
= =0
𝜕𝛽̂ 𝜕𝛽̂

−2 ∑(𝑌𝑖 − 𝛼̂ − 𝛽̂𝑋𝑖)𝑋𝑖 = 0

89 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

∑ 𝑌𝑖𝑋𝑖 − 𝛼̂ ∑ 𝑋𝑖 − 𝛽̂ ∑ 𝑋𝑖 2 = 0

∑ 𝑌𝑖𝑋𝑖 − ̅̅̅
(𝑌 − 𝛽̂𝑋̅) [∑ 𝑋𝑖 − 𝛽̂ ∑ 𝑋𝑖 2 ] = 0

∑ 𝑌𝑖𝑋𝑖 − 𝑌̅ ∑ 𝑋𝑖
𝛽̂ =
̅̅̅̅̅̅̅̅̅
∑ 𝑋𝑖 2 − 𝑋𝑖 ∑ 𝑋𝑖

Or equivalently, multiplying both the numerator and denominator by n, we get

𝑛 ∑ 𝑌𝑖𝑋𝑖 − ∑ 𝑌𝑖 ∑ 𝑋𝑖
𝛽̂ =
𝑛 ∑ 𝑋𝑖 2 − (∑ 𝑋𝑖 )2

Example 7.1: Suppose we want to study the relationship between input (number of workers) and
output (thousands of Birr) of five factories given in table given below.

Yi Xi
4 2
7 3
3 1
9 5
17 9
a. Find regression line of Yi (thousands of Birr) on Xi (number of workers, we can employ the
method of least squares as follows:
b. Estimate the amount of Yi (thousands of Birr) that the factory will have if it has employed
12 workers,
Solution: a.

Yi Xi YiXi Xi2
4 2 8 4
7 3 21 9
3 1 3 1
9 5 45 25
17 9 153 81
∑ 𝑌𝑖 = 40 𝑌̅=8 ∑ 𝑋𝑖 = 20, 𝑋̅=4 ∑ 𝑌𝑖𝑋𝑖 = 230 ∑ 𝑋𝑖 2 = 120

Substituting these values in the above equation, we get

𝑛 ∑ 𝑌𝑖𝑋𝑖 − ∑ 𝑌𝑖 ∑ 𝑋𝑖
𝛽̂ =
𝑛 ∑ 𝑋𝑖 2 − (∑ 𝑋𝑖 )2

90 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

5 ∗ 230 − 40(20)
𝛽̂ =
5 ∗ 120 − (20)2

1150 − 800
=
600 − 400

350
=
200

𝛽̂ = 7⁄4

𝛼̂ = 𝑌̅ − 𝛽̂ 𝑋̅

𝛼̂ = 8 − 7⁄4 (4)

𝛼̂ = 8 − 7⁄4 (4) =1

Therefore the least square regression equation equals;

𝑌̂ = 1 + 7⁄4 𝑋𝑖.

a. Xi= 12

𝑌̂ = 1 + 7⁄4 𝑋𝑖.

𝑌̂ = 1 + 7⁄4 (12) = 22.

Consequently if the company employed 12 workers, its level of output will be 22,000 ETB.

Regression of X on Y

Sometimes it is possible, and if interest to fit the regression of X on Y type, i.e. being
Y as independent and X dependent variable. In Such case the general form of the equation is
given by

𝑋̂ = 𝛼̂ + 𝛽̂𝑌𝑖

Applying the principle of least square as before, the constants 𝛼̂ and 𝛽̂ are given as follows

𝛼̂ = 𝑋̅ − 𝛽̂𝑌̅

𝑛 ∑ 𝑌𝑖𝑋𝑖 − ∑ 𝑌𝑖 ∑ 𝑋𝑖
𝛽̂ =
𝑛 ∑ 𝑌𝑖 2 − (∑ 𝑌𝑖 )2

91 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

The regression equation of Y on X and X on Y coincides at (𝑋̅, 𝑌̅).

7.3 Correlation

Correlation analysis helps us in determining the degree of relationship between two or more
variables. If two quantities moves in such a way that movements in one are accompanied by
movements in the other, these quantities are correlated. For example there exists some
relationship between price of commodity and amount demanded, increase in rainfall up to a point
and production of Wheat, etc.

Simple, Partial and multiple correlations: The distinction between simple partial and multiple
correlations is based up on the number of variables studied. When only two variables are studied
it is a problem of simple correlation. When three or more variables are studied it is a problem of
either multiple or partial correlation.. In multiple correlation three or more variables are studied
simultaneously. On the other hand, in partial correlation we recognize more than two variables,
but consider only two variables les to be influencing each other, the effect of other influencing
variables being kept constant.

Linear and non-linear correlation: The distinction between linear and non-linear correlation is
based up on the consistency of the ratio of change between the variables. If the amount of change
in one variable tends to bear constant ratio to the amount of change in the other variable then the
correlation is said to be linear and if it does not bear a constant change to the amount of change
in the other variable the correlation would be called non-linear or curvilinear correlation.

Positive and negative correlation: If both the variables are varying in the same direction, i.e.., if
both variables are increasing or if both variables are increasing, correlation is said to be positive.
If on the other hand, if both the variables are varying in the opposite direction, i.e.., as the one
variable is increasing, the other is decreasing or vice versa, , correlation is said to be negative.

In this study we will concentrate on the relationship between two variables, which is simple
correlation.

92 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

The degree of relationship between the variables under consideration is measured


through the correlation analysis. The measure of correlation called the correlation coefficient or
correlation index summarizes in one figure the direction and degree of correlation.

Karl Pearson’s coefficient of correlation

The Karl Pearson method, popularly known as Karl Pearson’s coefficient of correlation is most
widely used in practice. The Parsonian coefficient of correlation is denoted by the symbol 𝑟.

∑(𝑋𝑖−𝑋̅)(𝑌𝑖−𝑌̅)
𝑟= And is termed as product-moment formula and it can be further
√∑(𝑋𝑖−𝑋̅)2 ∗∑(𝑌𝑖−𝑌̅)2

simplified as

𝑛 ∑ 𝑌𝑖𝑋𝑖 − ∑ 𝑌𝑖 ∑ 𝑋𝑖
𝑟=
√[𝑛 ∑ 𝑋𝑖 2 − (∑ 𝑋𝑖)2 ] ∗ [𝑛 ∑ 𝑌𝑖 2 − (∑ 𝑌𝑖)2 ]

Example 7.2: Find the Pearsonian coefficient of correlation for the two variables from the data
given below.

X 9 8 7 6 5 4 3 2 1

Y 15 16 14 13 11 12 10 8 9

Solution: N= 9

X X2 Y Y2 XY
9 81 15 225 135
8 64 16 256 128
7 49 14 196 98
6 36 13 169 78
5 25 11 121 55
4 16 12 144 48
3 9 10 100 30
2 4 8 64 16
1 1 9 81 9
∑ 𝑋 = 45 ∑ 𝑋 2 = 285 ∑ 𝑌 = 108 ∑ 𝑌 2 = 1,356 ∑ 𝑋𝑌 = 597
𝑛 ∑ 𝑌𝑖𝑋𝑖 − ∑ 𝑌𝑖 ∑ 𝑋𝑖
𝑟=
√[𝑛 ∑ 𝑋𝑖 2 − (∑ 𝑋𝑖)2 ] ∗ [𝑛 ∑ 𝑌𝑖 2 − (∑ 𝑌𝑖)2 ]

93 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

9 ∗ 597 − 45 ∗ 108 513 513


𝑟= = = = +0.95
√[9 ∗ 285 − (45)2 ] ∗ [9 ∗ 1356 − (108)2 ] √540 ∗ 540 540

Interpretation: It implies there is strong positive correlation.

Properties of Pearsonian coefficient of correlation;

1. -1≤ 𝑟 ≤1
2. When 𝑟 = 0 there is no correlation
3. The correlation coefficient 𝑟is independent of change of scale and origin.
4. The closeness of the relationship is not proportional to the value of 𝑟.
5. When 𝑟 = +1,it means there is perfect positive correlation
𝑟 = −1, it means there is perfect negative correlation
The closer 𝑟 is to +1 and -1, the closer the relationship between the variables and the
closer 𝑟 to 0, the less close the relationship.
6. It is free from any unit of measurement used.

Spearman’s Rank Correlation Coefficient

The Pearsonian coefficient of correlation cannot be used in cases when the direct quantitative
measurement of the phenomenon under study is not possible. In such cases, we make use of the
Spearman’s rank correlation coefficient. Spearman’s rank correlation coefficient is defined as
6 ∑ 𝐷2 6 ∑ 𝐷2
follows; 𝑅 =1− or 𝑅 =1−
𝑁(𝑁2 −1) 𝑁3 −𝑁

Steps in calculating Spearman’s rank correlation coefficient

1. Rank the X values among themselves giving rank (1) to the largest (or smallest value and
(2) to the next largest (or smallest) value and so on.

2. Rank the Y-values among themselves in a similar way to that of X.

3. Find the sum of the squares of the differences between ranks of two variables

4. Finally, apply the formula

94 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

Example 7.3: The ranking of 10 students in two subjects A and B are as follows, see if there
is any correlation between students performance in the two subjects

A 6 5 3 10 2 4 9 7 8 1

B 3 8 4 9 1 6 10 7 5 2

Solution:

R1 R2 (R1-R2)2=D2
6 3 9
5 8 9
3 4 1
10 9 1
2 1 1
4 6 4
9 10 1
7 7 0
8 5 9
1 2 1
∑ 𝐷2 = 36
6 ∑ 𝐷2 6 ∗ 36 216
𝑅 =1− 3
=1− 3 =1− = 0.782
𝑁 −𝑁 10 − 10 990

Interpretation: Since 𝑅 = 0.782 there is similarity in the rank of the 10 students in the two
subjects.

Equal ranks: If two or more individuals are equal it is customary to give each individual an
average rank. If two individuals are ranked equal at fourth place, they are each given the rank
4+5 4+5+6
= 4.5 if three are ranked at fourth place, they are given the rank = 5.
2 3

When equal ranks are assigned the same entries an adjustment in the above formula for
calculating the rank coefficient of correlation is made. The adjustment consists of adding
1
(𝑚3 − 𝑚) to the value of6 ∑ 𝐷2 , where 𝑚 stand for the number of items whose ranks are
12

common. If there are more than one such group of items with common rank, the value is added
as many times as the number of such groups. The formula can thus be written;

1 1
6 {∑ 𝐷2 + (𝑚3 − 𝑚) + (𝑚3 − 𝑚) … … . }
𝑅 =1− 12 12
𝑁3 − 𝑁

95 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics

Chapter summary

Simple regression analysis is the measure of the average relationship between two variables.
Regression analysis provides estimates of values of the dependent variable from values of the
independent variable. Correlation analysis helps us in determining the degree of relationship
between two or more variables. Relationship between variables can be positive or negative. The
degree of relationship between the variables under consideration is measured through the
correlation analysis

Self-evaluation test

1. Based on the data given below answer the following g questions.


Weight of Fathers (in Kg)(Y) 65 63 67 64 68 62 70 66 68 67 69 71
Weight of Sons (in Kg) (X) 68 66 68 65 69 66 68 65 71 67 68 70
A. Find the regression equation X on Y.
B. Find the regression equation Y on X.
C. Find the corresponding value of X if Y is 65.
D. Find the corresponding value of Y if X is70.
2. The following table gives indices of industrial production and registered un-employed (in
hundred thousand). Calculate the value of the correlation coefficient.
Year 1991 1992 1993 1994 1995 1996 1997 1998
Index of production 100 102 104 107 105 112 103 99
Number of unemployed 15 12 13 11 12 12 19 26
3. Two ladies were asked to rank to rank 7 types of lipsticks. Calculate the Spearman’s Rank
Correlation Coefficient based on the ranks given by them are as follows:
Lipsticks A B C D E F G
Selamawit 2 1 4 3 5 7 6
Tigist 1 3 2 4 5 6 7

96 | P a g e

You might also like