???????????? ?? ?????????? ??? ?????????
???????????? ?? ?????????? ??? ?????????
Statistics is a branch of applied mathematics concerned with the collection and interpretation of
quantitative data and the use of probability theory to estimate population parameters. It is a
science that helps us make better decisions in business and economics as well as in other fields.
Most researches in different areas of study require data so as to generate valuable information
that facilitate the decision making process. Data are raw materials for researches. Moreover, the
quality of the collected data greatly affects or determines the precision of results to be obtained
from a specific investigation. Therefore, it is extremely important to know about the basics of
data collection.
Statistics is a very broad subject, with applications in a vast number of different fields.
In generally one can say that Statistics is a collection of methods for planning experiments,
obtaining data, organizing, summarizing, presenting analyzing, interpreting and drawing
conclusion based on the data.
Statistics may also be defined as Statistical data (plural sense) or it can also be defined as a
method (singular sense). Each one of these definitions is treated separately as follows.
a. In the plural sense : statistics are the raw data themselves , like statistics of births,
statistics of deaths, statistics of students, statistics of imports and exports, etc.
b. In the singular sense: statistics is the subject that deals with statistical method of the
collecting, organizing, presenting, analyzing and interpreting of numerical data.
1|Page
Debre Berhan University, Economics Program Introduction to Statistics
In general Statistical methods are classified into two groups or areas based on how data are used:
descriptive statistics and inferential statistics.
Descriptive statistics, or just description, is the use of statistics to describe characteristics of the
data that we have. Descriptive Statistics consists of the collection, organization, summarization,
and presentation of numerical data. It is concerned with describing certain characteristics of a set
of observed data (usually a sample) – that is, what it is shaped like, what number the values tend
to cluster (converge) around, how much variation is present in the data, and so forth.
In short, descriptive Statistics describes the nature or characteristics of a data without making
conclusion or generalization. The following are some examples of descriptive Statistics.
The average age of athletes participated in New York Marathon was 24 years.
45% of the students in Debre Berhan University are female.
The marks of 50 students in the Introduction to statistics course are found to range from
45 to 95.
Inferential Statistics, also called inductive Statistics, is a method to generalize from a sample to a
population. It is concerned with the process of drawing conclusions (inferences) about specific
characteristics of a population based on information obtained from samples, performing
hypothesis testing, determining relationships among variables, and making predictions. The area
of inferential Statistics entirely needs the whole aims to give reasonable estimates of unknown
population parameters.
The result obtained from the analysis of the income of 1000 randomly selected citizens in
Ethiopia suggests that the average perception income of a citizen in Ethiopia is 30 Birr.
The trend analysis on past data includes that the average exchange rate for a dollar is
expected to be 8.30 birr in the coming month.
2|Page
Debre Berhan University, Economics Program Introduction to Statistics
• Descriptive statistics is focused on summarizing the data collected from a sample. The
technique produces measures of central tendency and dispersion which represent how the
values of the variables are concentrated and dispersed.
• Inferential statistics generalizes the statistics obtained from a sample to the general
population to which the sample belongs. The measures of the population are termed as
parameters.
• Descriptive statistics make only summarization of the properties of the sample from which
data were acquired, but in inferential statistics, the measure from the sample is used to infer
properties of the population.
• In inferential statistics, the parameters were obtained from a sample, but not the whole
population; therefore, always some uncertainty exists compared to the real values.
Single and isolated figures are not Statistics for the simple reason that such figures are unrelated
and can’t be compared. According to this aspect, to be Statistics, data must be in aggregate
(mass) and also the individual elements within the aggregate should relate to a common
phenomenon so that they can be compared to one another.
Since Statistics are most commonly used in social sciences it is natural that they are affected by a
large variety of factors at the same time. Putting differently, Statistics are not as such caused by a
single factor (force), rather they are outcomes of a number of (multiple) factors (forces)
operating together.
3|Page
Debre Berhan University, Economics Program Introduction to Statistics
All Statistics are expressed in numbers. Nevertheless, the converse of this statement is not in
general true. That is, statements expressed in terms of numbers may not necessarily be Statistics
as there is a possibility for them not to meet the other requirements of the definition given above.
Numerical statements can either be enumerated, in which case they are supposed to be accurate
and precise or else they can be estimated by some expert observers, in which case 100%
accuracy is unlikely to be attained. In the process of estimation, reasonable standards of accuracy
must, however, be attained.
If data are collected in haphazard manner, then results to be obtained are likely to lead to
fallacious conclusions. Therefore, it is essential that Statistics must be collected in a systematic
manner so that they may confirm to reasonable standards of accuracy.
Statistics collected without any predetermined purpose do not serve any useful purpose.
Therefore, the purpose of collecting Statistics should be defined clearly before they are collected.
Meaning, figures (Statistics) should be collected in view of some goal or target. Moreover, the
data should be collected in such a manner that it meets the predetermined needs.
They should be comparable either period wise or region wise, or in reference to some other
means of comparison. As an example, suppose that the marketing head of a given supermarket in
Addis Ababa wants to know the average expenditure of households in the city, among other
things, so as to revise his marketing strategy. To achieve his objective, the head collects data on
expenditure from a sample of 1000 households’ selected using stratification (to be discussed in
the next chapter). Moreover, the head used the interview approach to gather the required
information.
4|Page
Debre Berhan University, Economics Program Introduction to Statistics
Thus, the data collected by the marketing head are Statistics as they fulfill all the requirements of
the definition.
First, the data is collected from 1000 households. Hence, it forms an aggregate. Second, data on
expenditure are affected by a number of factors like income, taste, season, culture, etc. Third,
expenditure of households, as they are expressed in terms of currency, is clearly numeric values.
Fourth, as far as expenditures are expressed to the nearest, say birr, the degree of accuracy is
more than satisfactory. Fifth, since the investigator uses stratified sampling technique and also
the interview data collection method it is reasonable to say that the data is collected in a
systematic or scientific manner. The record lack requirement if fulfilled as the investigator has a
pre-determined end or goal, which is revising his marketing strategy, based on the average
expenditure level and other qualitative and quantitative variables. Finally, it is obvious that
expenditure of households can be analyzed (compared) be it region wise or period wise.
The main function of Statistics is to collect and present numerical data in a systematic manner so
that it may be analyzed in a scientific way. Statistics basically concentrates on the analysis of a
phenomenon in a scientific manner, without proving it. The analysis of data, which is the core
objective of Statistics, is important because it helps to avoid or replace arbitrary decisions,
dogmatism, rule of thumb, tradition, and it tries to increase the custom of making decision based
on analyzed quantitative facts.
It is common that the human mind is not capable of assimilating huge (mass of) facts and figures,
as they are complex to understand. Statistical methods simplify this complexity by making the
huge data easily intelligible and readily understandable. Meaning, Statistical methods provide the
necessary means to condense mass of data and present them with the help of simple figures such
5|Page
Debre Berhan University, Economics Program Introduction to Statistics
as averages, ratios, variations, measures of Skewness, coefficients, etc. More attractive and
understandable presentations of data are also made through the help of diagrams and graphs.
Statistics presents facts in a precise and definite (numeric) form and thus helps proper
comprehension of what is to be stated. Facts (data) stated in quantitative terms are more precise
and simple to understand, analyze and interpret. To give an example, suppose that you read the
statement:
This statement does not tell precise and complete information because it doesn’t present the
quantified rainfall in the year 1995. On the other hand, the statement may be presented as:
“The average rainfall in 1995 is expected to decrease by 5% from that of 5ml recorded in 1994.”
Clearly, since the rate of decrement in rainfall is quantitatively expressed (5%) it is possible to
have the precise expected rainfall in 1995 in quantitative form.
The very reason for saying numerical data are more precise is that they are amendable for (lend
themselves to) comparison. By furnishing different suitable devices or tools for comparison, like
averages and measures of dispersion, Statistics enables better understanding and appreciation of
the significance of a series of figures. Moreover, in most cases conclusions or decisions are
reached mainly based on the results obtained from different comparisons. For example, the
aggregate performance of students in two different sections (classes) can be judged by
comparing the average marks for the two classes.
iv. Predictions
One of the major reasons making Statistical methods so critical in Business is their prediction
function. Prediction is the process of making a scientific guess about the future value of a
variable. Statistical methods made it possible to predict the likely future value of a variable based
6|Page
Debre Berhan University, Economics Program Introduction to Statistics
on its past trend. Time series and regression analysis are the most commonly used methods
towards prediction.
If carefully handled, Statistics plays invaluable role in the process of policy formulation.
Statistical data and Statistical methods help the government in formulating suitable methods help
the government in formulating suitable policies with respect to taxation, import-export,
budgeting and other socio-economic welfare programs.
The fact that Statistics is applicable in almost all fields of study is not a guarantee for its
perfection. Of course, there is no perfect science in the globe. Statistical methods as well have
their own limitations. The following are the major limitations:
This is to mean that Statistics deals only with aggregates of facts and no importance is attached
to individual items. For instance, age of a single student in a given class in a given year is not a
Statistical data. In contrast, the age of all students within a given class in a given year form an
aggregate and hence can be considered as data. Alternatively, the semester GPA of a single
student for 4 semesters also forms a Statistical data. In short, Statistical methods are suited only
to those problems or situations where group characteristics are desired to be studied.
7|Page
Debre Berhan University, Economics Program Introduction to Statistics
Another limitation of Statistics is that it deals with those subjects of inquiry that are capable of
being quantitatively measured and numerically expressed. Accordingly, such qualitative
characteristics as health, poverty, honesty and intelligence are not suitable for Statistical analysis
however; problems involving such qualitative variables are treated in Statistics indirectly. For
example, the variable health may be studied through death rate, which is a quantitative variable.
However, these are only indirect methods.
As it is often said, Statistical results are true only on the average. Meaning, the results obtained
from Statistical data analysis are not true for each member or item within the data for which the
analysis is made. Statistical statements or conclusions are not generally true or applicable to
individuals, but are applicable to the majority of cases.
Chapter summary
8|Page
Debre Berhan University, Economics Program Introduction to Statistics
2.1 Introduction
Sampling survey is simply the process of learning about the population on the basis of a sample
drawn from it. Thus in sampling technique instead of every unit of the population only part of the
population is studied and the conclusions are drawn on that basis for the entire population. The
process from sampling survey involves three elements: selecting the sample, collecting the
information and making an inference about the population.
Population: In Statistics the term population is used to mean the totality of causes (items) under
consideration in a given investigation or research. In other words, the largest collection of
observations on a variable constitutes the population.
Census: The process of gathering data from every element in the population.
Sample: Is part of the population of interest. Any non-empty subset of a population is called a
sample. There are different possible samples that can be selected from a single population.
Nevertheless, the one that best reflects or represents the behavior of the population is considered
to be the most appropriate one.
9|Page
Debre Berhan University, Economics Program Introduction to Statistics
Sampling Error: The difference between sample statistic and population parameter
Sampling Unit: Elements of the population to be sampled or the unit of selection in the sampling
process.
Sample design: Is the set of procedures /methods/ for selecting the sample elements from the
population.
Sampling Frame: The list of all possible units in the reference population.
If the values of the variable can be represented numerically, it is called quantitative variable.
The variables whose values cannot be represented numerically are called qualitative variable.
Quantitative variables are again divided in to two groups: discrete and continuous. Discrete
variables are those variables whose values are obtained by counting i.e., no of students, no of
books in library. Continuous variables are those quantitative variables which can take any value
between two numbers. Their values are obtained by measurement i.e., weight, height of a person.
10 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
i. Cost/Economy
Unit cost of collecting data in the case of census is significantly less than in the case of sampling.
However, due to due to the larger number of items in the population, the total cost involves in the
case of is significantly higher than in the case of sampling. Suppose it takes Birr 200 per unit to
make a census of 10,000,000 individuals but the unit cost of sampling 5000 individuals is Birr
1000. Thus, the total cost is: 10,000,000 x 200 = 2,000,000,000 but that of sample is 5,000 x
1000 = 50,000,000
ii. Timeliness
Due to the larger size of population total time involves in the case of census in significantly
higher than that of sampling (i.e., the sample may provide us with necessary information
quickly).
In some cases the entire population may not be accessible due to diseases, death, conflict, mental
abnormality, prisoners, etc. In that case sampling is necessary.
Due to destructive nature of many tests, the resources are completed to collect information only
from part of the population. For example: blood test for a patient, life hours of a tube light,
strength of wires, etc.
vi. Accuracy
Non-sampling error in the case of census is higher than the non-sampling error committed in the
case of a sample survey ( as less qualified investigator are involve in the case of census and the
supervision, monitoring and quality control mechanism in the case of census may be poor). The
higher the degree of non-sampling error, the less reliable your result may be.
11 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
There are two principal methods of drawing a sample from a population: Probability sampling
and Non-probability sampling.
In the case of probability sampling each observation in the population has an equal
chance of being selected to become part of the sample. There is no human judgment in the case
of probability sampling.
Simple Random Sampling is a method of probability sampling in which every unit in the
population has an equal nonzero chance of being selected (or part of the sample). In other words,
each element of the population has an equal and independent chance of being included into the
sample. The probability is given by n/N.
Lottery method- In this method, each population item is numbered 1 to N on slips of identical
cards (size, shape and color). Then place numbered cards in a bowl, mix them thoroughly, and
select as many cards as needed in a blind fold selection. The subjects whose numbers are
selected constitute the sample. Since it is difficult to mix the cards thoroughly, there is a chance
of obtaining a biased sample. Thus we need other method of selecting sample elements.
Random Number method- due to the problem of lottery method, statisticians use another method
known as the random number method where numbers are generated using computers or they are
available on the annex of text books.
12 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
a) Assign a unique number to each population element in the sampling frame. Start with
serial number 1, or 01, or 001, etc. depending on the number of digits required.
b) Choose a random starting position by closing your eyes (blind fold selection) and placing
your finger on a number in the table.
c) Select serial numbers across rows or down columns or diagonally from the starting point.
d) Discard numbers that are not assigned to any population element and ignore numbers that
have already been selected.
e) Repeat the selection process until the required number of sample elements is selected.
Advantage of simple random sampling
It ensures that the sample is unbiased.
Disadvantages simple random sampling
It requires a Sampling Frame, and this is sometimes impossible (the case of fish
population).
If the population is very large, it is tedious and time consuming to number and select the
sample.
Minority subgroups of the population may not be represented in the sample.
In stratified sampling, a population is first divided into subgroups, called strata (singular
stratum), and a sample is selected from each stratum based on simple random or systematic
sampling method. The strata are made according to various homogeneous characteristics such as
sex, race, region or institutional affiliation such as faculty. Stratified sampling is applied if the
population is heterogeneous.
13 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
Step 2- Select an independent simple random sample from each stratum (using simple
random sample);
Step 3- Form the final sample by consolidating all sample elements chosen in step 2.
Stratified samples can be:
Proportionate: involving the selection of sample elements from each stratum, such that the ratio
of sample elements total number of population elements (n/N) is constant/equal for all strata.
Disproportionate: the sample is disproportionate when the above mentioned ratio is unequal.
Example 2.3: To select a proportionate stratified sample of 20 households from Addis Ababa
that belong to three income groups: low (50), middle (30) and high (20) (N=50+30+20=100).
Sub-divide the club members into three homogeneous sub-groups or strata by the
income groups: low, middle and high.
Calculate the overall sampling fraction, f, in the following manner: f=n/N=20/100=0.2
Where n = sample size and N = population size: n1=0.2*50=10, n2=0.2*30=6 and n3=0.2*2=4.
Thus, n=n1+n2+n3=10+6+4=20
Advantage of Stratified Sampling:
The representation of the sample is improved
Disadvantages Stratified Sampling
If there are many variables of interest, dividing a large population in to representative
subgroups requires a great deal of effort,
If variables are somewhat complex or ambiguous (such as beliefs, attitudes, etc), it is
difficult to separate individuals in to the sub-groups according to these variables.
In systematic sampling only one random number is needed throughout the entire sampling
process. Elements of the population will be arranged in some order and the elements to be
included in the sample will be selected at a constant interval.
14 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
The first element (number), which is between 1 and K, is determined using simple
random sampling and then the next items are selected using the skip interval. For instance, the
j th unit is selected at first and then ( j k ) th , ( j 2 K ) th … etc until the required sample size is
obtained.
Example 2.4: Suppose there are 2000 subjects in the population and a sample size of 50 subjects
are needed. The sampling interval (k) is 40 (2000/50). Select the starting point, say ‘x’, from 1
through 40 using simple random sampling, and then include every 40th element starting from ‘x’.
If there is any sort of cyclic ordering of the subjects, the samples will not be representative of the
population. Example: If subjects in the population are arranged in a manner such as:
Defective item
Non-defective item
Defective item
Non-defective item etc,
Cluster sampling is can be used if the population is homogeneous and very large in size. It is a
type of sampling in which the population is divided into non-overlapping heterogeneous groups
called clusters or groups and clusters/groups of elements are sampled as the sampling units using
simple random sampling technique in the first phase (if it is the two-phase cluster sampling). In
other words, cluster sampling is a type of sampling which involves dividing the population into
15 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
groups (or clusters). Then, one or more clusters are chosen at random and individual within the
chosen cluster is sampled.
A two-step-process:
Step 1- Defined population is divided into number of mutually exclusive and
collectively exhaustive heterogonous groups or clusters;
Step 2- Select an independent simple random sample of clusters using sample random
sampling.
Advantages of Cluster Sampling:
A list of all individual study units in the reference population is not required,
Reduces cost, and
Simplifies field work and it is convenient.
Disadvantages:
The members of the clusters are often more homogeneous than the members of the
whole population and therefore, it may not be representative.
The elements in a cluster may not have the same variation in characteristics as
elements selected individually from the population.
In multi-stage sampling: several levels of nested clusters are involved where sample units are
clusters at each stage except the final stage. It is known as 'multistage' because there are multiple
stages, or steps, to creating the sample. The first stage in multistage sampling is the same as
cluster sampling. Thus, it is a complex form of cluster sampling. It often includes both stratified
and cluster sampling techniques. For instance we can cluster Ethiopia into regions, zones,
Woredas, kebeles and finally take households from sampled kebeles using simple random
sampling or stratified sampling.
Multi-phase sampling: is designed to make use of the information collected in one phase to
develop sampling design for the next phase. For instance, in the double phase sampling, the first
phase may consider relationship between income and expenditure and using information
obtained in the first phase, surveyed households divided into groups based on income levels
(strata). Or it is sometimes convenient and economical to collect certain items of information
16 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
from the whole of the units of a sample and other items of usually more detailed information
from a sub-sample of the units constituting the original sample. This may be termed two-phase
sampling, For instance, if the collection of information concerning variate, Y, is relatively
expensive, and there exists some other variate, X, correlated with it, which is relatively cheap to
investigate, it may be profitable to carry out sampling in two phases.
In the case of non-probability sampling, not every unit in the population has a chance of being
included in the sample. It involves at least some degree of personal subjectivity instead of
following predetermined, probabilistic rules for selection.
i. Convenience sampling
Convenience sampling implies sample drawn at the convenience of the researcher. It is common
in exploratory research. Does not lead to any conclusion
17 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
In this method, the decision maker requires the sample to contain a certain number of items with
a given characteristic. It is something like judgmental sampling. Individuals are selected from
each quota using simple random sampling.
Example 2.5: Suppose we know that 54% of the adults in a community are females, and the
study requires 100 respondents as a sample. In quota sampling, we might interview the first 54
females and the first 46 males.
Used in studies involving respondents who are rare to find. To start with, the researcher compiles
a short list of sample units from various sources. Each of these respondents is contacted to
provide names of other probable respondents.
Size of sample means the number of sampling units selected from the population for
investigation. If the size of sample is small it may not represent the population and the inference
drawn about the population may be misleading. On the other hand, if the size of sample is very
large, it may be too burdensome financially, requires a lot of time and may have a serious
problem of managing it. Hence the sample size should be neither too small nor too large.
The following factors should be considered while deciding the sample size:
i. The size of the population: the larger the size of the population, the bigger should be the
sample size.
ii. The resource available: if the resources available are vast a large sample size could be taken.
However, in most cases resources constitute a big constraint on sample size.
iii. The degree of accuracy or precision desired: the greater the degree of accuracy desired, the
larger should be the sample size. However, it does not necessarily mean that bigger samples
always ensure greater accuracy.
iv. Homogeneity or heterogeneity of the population: If the population consists of homogeneous
units a small sample may serve the purpose, but if the population consists of heterogeneous
units a large sample may be inevitable.
18 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
v. Nature of study: For an intensive and continuous study a small sample may be suitable. But
for studies which are not likely to be repeated and are quite extensive in nature, it may be
necessary to take a large sample size.
vi. Method of sampling adopted: The size of sample is also influenced by the type of sampling
plan adopted. For example if the sample is a simple random sample it may necessitate a
bigger sample size, However, in a properly drawn stratified sampling plan, even a small
sample may give better results.
vii. Nature of respondent: Where it is expected a large number of respondents will not co-operate
and send back the questionnaire, a large sample should be selected.
Chapter summary
On the basis of sample study we can predict and generalize the behavior of mass phenomena.
This is possible because there is no statistical population whose elements would vary from each
other without limit. Sampling is the process of learning about the population on the basis of a
sample drawn from it. The process of sampling involves three elements: selecting the sample,
collecting the information and making an inference about the population. The major reasons for
choosing sampling technique over census are; Cost/Economy, Timeliness, Large population size,
Inaccessibility of the entire population, Destructive nature of many tests and Accuracy. In the
case of probability sampling each element in the population has a known chance (>zero) of being
included in the sample. In the case of non-probability sampling, not every unit in the population
has a chance of being included in the sample.
Self-evaluation test
1. Suppose there are 2100 subjects in the population and a sample size of 30 subjects are
needed. Identify the elements to be included in the sample using systematic random
sampling method.
2. Discuss the reasons for sampling.
3. Compare and contrast stratified and cluster sampling method.
4. Explain the concept of quota, convenience sampling using appropriate example.
19 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
3.1 Introduction
The term “Data Collection” refers to all the issues related to data sources, scope of investigation
and sampling techniques. In this chapter, our discussion starts with the discussion of the meaning
of data collection. Having the reader acquainted with the meaning of data collection, the chapter
advances to the discussion of the classification of data. In addition, the different methods of
collecting data from primary sources are discussed. Finally tabular, graphic and diagrammatic
presentation section explains the different ways of presenting data according to the nature of
data. It introduces graphs such as histogram, frequency polygon, ogive and diagrams such as bar
chart and pie chart.
Collection of data implies a systematic and meaningful assembly of information for the
accomplishment of the objective of a statistical investigation. It refers to the methods used in
gathering the required information from the units under investigation.
The quality of data greatly affects the final output of an investigation. Hence, utmost care should
be attached to the data collection process and every possible precaution should be taken to ensure
accuracy while collecting data. Otherwise, with inaccurate and inadequate data, the whole
analysis is likely to be faulty and also the decisions to be taken will also be misleading.
Statistical data may be obtained either from primary or secondary source. A primary source is a
source from where first-hand information is gathered. On the other hand, secondary source is
the one that makes data available, which were collected by some other agency. Clearly, a source,
which is not primary, is necessarily a secondary source. Primary sources are original sources of
data.
20 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
Based on the source data can be classified as primary data and secondary data. Data obtained
from a primary source is called primary data. Likewise, data gathered from a secondary source
is known as secondary data. For example, assume that a simple study is to be conducted to see
the age distribution of HIV/AIDS victim citizens. Clearly, the variable of study is age. Data
about the age of HIV/AIDS victim citizens may be obtained by making direct interview with the
victims. Note, in this specific case, the victim citizens are primary sources. Moreover, the data to
be collected from them are primary data. Alternatively, one may use records of hospitals and
other related agencies to obtain the age of the victim citizens without the need of tracing the
victims personally. Therefore, the records of the hospitals, in our case, are secondary sources and
the data copied from such records are secondary data.
In most cases, secondary data is obtained from such sources as census and survey
reports, books, official records, reported experimental results, previous research papers, bulletins,
magazines, newspapers, web sites, and other publications. Different organizations and
government agencies publish information (data) in the form of reports, periodicals, journals, etc.
In the case of Ethiopia, the Central Statistics Authority (CSA) is the first to be mentioned in
publishing such relevant information (secondary data).
The primary data gives more reliable, accurate and adequate information, which is
suitable to the objective and purpose of an investigation.
Primary source usually shows data in greater detail.
Primary data is free from errors that may arise from copying of figures from publications,
which is the case in secondary data.
21 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
Often, primary data gives misleading information due to lack of integrity of investigators
and non-cooperation of respondents in providing answers to certain delicate questions.
It is readily available and hence convenient and much quicker to obtain than primary data,
It reduces time, cost and effort as compared to primary data,
After discussing the two sources of data, primary and secondary, it is logical to discuss the
methods employed in collecting data from primary sources. Different methods are applied for the
collection of primary data.
In personal enquiry method, a question sheet is prepared which is called schedule. The schedule
contains all the questions, which would extract a complete report from a respondent. Usually,
schedules are pre-tested so as to remove certain discrepancies like ambiguities of the questions
and irrelevant questions. This pre-testing process is called a pilot survey. It is worth mentioning
that the schedule is not directly given to the respondent. Rather, it is the interviewer who asks
those questions on the schedule and jot down the interviewee’s (respondent’s) response.
22 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
Depending on the nature of the interview, personal enquiry method is further classified into two
types.
Direct Personal Interview: It is a type of personal enquiry where there is a face-to-face contact
with the persons from whom the information is to be obtained. In other words, the investigator
contacts each respondent personally, without the interference of third party, and asks questions
given in the schedule one by one and notes down respondent’s replies on the schedule.
Indirect Personal Enquiry (Interview): It is the second type of personal enquiry where the
investigator contacts third parties called witnessed who are capable of supplying the necessary
information. Here, the information is not collected directly from the respondent but from a third
person who knows the respondent well. Such an approach is useful in case where the respondent
is expected to conceal information about him or her. For example, if an enquiry about the habit
of using condoms is distributed in a village, most of the villagers may not provide the correct
information. Thus, it would be wiser to get the required information from other parties, like the
nearby condom dealing shop.
Under this method, a list of questions related to the survey is prepared and sent to the various
respondents by post, Web sites, e-mail, etc. However, this method cannot be used if the
respondent is illiterate. It is a method that is often used in many statistical investigations.
The following are the major points that we need to take into account while preparing a
questionnaire.
The number of questions should be small. Naturally, respondents are not comfortable
with lengthy questionnaires. Lengthy questionnaires usually bore respondents. Hence,
fifteen to twenty five questions in a questionnaire are optimal. If a lengthy questionnaire is
unavoidable, it should preferably be divided into two or more parts.
The questions should be short, clear, simple and unambiguous. Moreover, the
questions must be arranged in a logical order so that natural and spontaneous reply to each
is induced. For instance, it is not appropriate to ask a person how many packets of
cigarette he/she smokes before asking whether he/she smokes or not.
23 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
Questions of sensitive nature should be avoided. Sensitive questions are those questions
that are too personal and pecuniary like “Sources of income”, “Drinking habit”, etc. The
logic here is that respondents do not willingly answer sensitive questions. Such
information, if necessary, may be gathered through interview or through other indirect
questions.
Questions should be capable of objective answers. As much as possible, avoid
subjective questions and keep to questions of fact. To this end, multiple answer questions
can be used.
Mail questionnaires should be accomplished by a covering letter, which should state the
purpose of the questionnaire, promise of confidentiality of responses, etc.
Furthermore, the questions preferably designed in such a way can easily be answered
as yes/no.
In this approach, an investigator stays at the place of survey and notes down the observation
himself. There is no enquires in the case of direct observation. For example, an investigator
making a study on nutritional status of children may directly (physically) measure the weight,
height, and other required parameters himself/herself. Direct observation is more experimental
and usually applied in scientific studies. It is time consuming and also costly.
Data presentation is a statistical procedure of arranging and putting data in a form of tables,
graphs, charts and/or diagrams. Data can be presented using tables’ graphs and diagrams.
Tabulation is the arrangement of information or data in tables. A table makes it possible for the
analyst to present a huge mass of data in a detailed orderly manner within a minimum of space.
There are various techniques of tabulation.
24 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
i. Data array
Determine at a glance the highest and lowest values contained in the data.
Identify groups of similar data values.
Easily see differences between values in the data.
ii. Frequency distribution
Frequency distribution is a table that group data into non-overlapping intervals called classes
and records the number of observations in each class. The frequency distribution associates each
value of Xi to its corresponding frequency (fi). If fi is the frequency of Xi (where Xi is value of
each case/observation), then the value of Xi associated with the corresponding values of fi is
known as frequency distribution. It summarizes data in a condensed form that can be readily
understood & easily interpreted.
Class limits (CL): the lowest and highest values that can be included in a class such that there is
gap between successive classes are called class limits. The lower class limit (LCL) of a class is a
value such that no lower value can fall in to that class, whereas the upper class limit (UCL) of a
class is a value such that no upper value can fall into that class.
Class Boundary (CB) or Real class limits: class boundaries are the lowest and the highest
values in each class when there is no gap between successive classes. To work with the
25 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
distribution of a variable as if it was continuous, we make use of these real class limits (also
known as class boundaries).
Let d =LCL of a class minus UCL of preceding/previous class. Add half of this difference to all
upper class limits to get the upper class boundary (UCB), and subtract it from all lower class
limits to get the lower class boundary (LCB).That is,
1 1
UCBi = UCLi + d and LCBi = LCLi - d
2 2
Example 3.1 Find the class boundaries for the following frequency distribution.
1
Solution: 31-30=1, and d 0.5 .For the first class,
2
UCB = UCL + 0.5 = 30+0.5=30.5, and LCB= LCL-0.5=24-0.5=23.5; continuing in such a way,
we get the class boundaries in the second column of the example in the above table.
26 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
Also difference between the upper and lower class boundaries of any class. It is also the
difference between any two consecutive class marks.
𝑅𝑎𝑛𝑔𝑒
Approximately, class width =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠 𝑑𝑒𝑠𝑖𝑟𝑒𝑑
Class mark (CM) or mid-point: The class mark is the mid-point of the class interval or is a
value which lies mid way between the lower and upper limits of the class. It is obtained as:
Units of measurement (U) or decimal point (d): the distance between two possible consecutive
measures. It is usually taken as 1, 0.1, 0.01, 0.001, -----.
1. Decide the number of classes (k): Select the number of classes desired, usually between 5
and 20 or use the most widely used Sturge’s rule K 1 3.322 log n where “K” is number
of classes desired and n is total number of observation.
Example 3.2: Using the Sturge’s rule, if n =10, k = 4.32 4; if n =100, k= 7.644 8;
if n = 1000, k =10.96 11
We can also use 2k rule: select the smallest number (k) for the number of classes such that 2k is
greater than the number of observations (n).
Example 3.3: Using the 2k rule, if n= 10, 23 = 8, 8 < 10, 24= 16 > 10, then the recommended
number of classes is 4.
2. Compute the Range(R) = Maximum value- Minimum value. For instance if the highest
value is134 and the lowest value is 34.
27 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
Range(R) =134-34=90
3. Determine the Class Width (w): If the number of classes is known and if it is decided to use
Range
a uniform class width, we use w and rounded up to the nearest integer, where
k
Range is the difference between the highest and the smallest value of the data.
4. Determine the Class Limits: Pick a suitable starting point less than or equal to the minimum
value. The starting point is called the lower limit of the first class. Continue to add the class
width to this lower limit to get the rest of the lower limits.
5. Determine the Upper class limits: To find the upper limit of the first class, subtract 1 (one)
from the lower limit of the second class. Then continue to add the class width to this upper
limit to find the rest of the upper limits.
6. Determine the frequency of each class: The frequency of each class can be determined
simply by counting the number of observations belonging to each class.
7. Classes must be mutually exclusive: A given data value should fall into only one
class/category.
Limits such as the following would be inappropriate:
class Frequency
15-20 3
20-25 7
Class frequency
17.0-23.5 4
22.0-27.5 3
8. The class must be exhaustive: No data value should fall outside the range covered by the
frequency distribution.
28 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
9. If possible, the classes should have equal widths. It is difficult to interpret both frequency
distribution and their graphical presentation if there is unequal class width. One exception
occurs when there is an open-ended distribution, i.e., it has no specific beginning value or no
specific ending value.
10. Sum up the frequency of each class to check whether it is equal to the total number of data
collected from the field or not.
Example 3.4: Construct a continuous frequency distributions for the following raw data on
marks (out of 100) obtained by 50 students in Statistics.
57, 53, 65, 55, 50, 45, 64, 52, 16, 46, 42, 63, 33, 64, 53, 25, 54, 35, 48, 55, 70, 47, 39, 58, 52,
36, 65, 75, 26, 20, 55, 60, 83, 61, 45, 63, 49, 42, 35, 18, 51, 45, 42, 65, 39, 59, 45, 41, 30, 40.
Solution:
a. Since n = 50, using the Sturge’s rule, the number of classes is:
k= 1+ 3.322 log 50 =6.64 7. Thus, the number of class is 7.
By applying the 2k rule, in this case, n= 50, 25 = 32, 32 < 50, 26 = 64 > 50, the recommended
number of classes is 6.
Since the Sturge’s rule is widely used, the appropriate number of classes is K=7.
Marks frequency
15 - 24 3
25 – 34 4
35 – 44 10
45 -54 15
55 – 64 12
65 – 74 4
75 – 84 2
Total 50
29 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
Note: For the class boundaries see the nature of the data.
If there is no decimal point, then U or d = 1.
If there is one digit after the decimal, then U or d = 0.1.
If there are two digits after the decimal, then U or d = 0.01.
A. Relative frequency: The ratio of frequency of a case to the total number of observation
f f
(sample size =n). That is, rf or rf n
where f is the number of times a given
f
n
i
i 1
element repeats itself (absolute frequency) and n is the total number of observations or total
frequency.
B. Absolute frequency: shows the absolute number of occurrences of an entry or group of
entries in each class.
Example 3.6: compute relative frequency from the following frequency distribution.
30 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
4 156 2 2/33
5 158 2 2/33
6 160 7 7/33
7 162 2 2/33
8 164 4 4/33
9 166 1 1/33
10 168 3 3/33
11 170 1 1/33
The cumulative frequency of a class tells us how often the values fall below or above that class.
Or as the name indicates it cumulates frequencies starting at the lowest or the highest class
boundary. There are two types of cumulative frequency distributions: the “less than” and the
“more than” cumulative frequency distributions.
The “less than” cumulative frequency distribution is obtained by adding the frequency of
all the preceding (previous or earlier) classes including the class against which it is written
or including the frequency of that class. In other words, it is obtained by adding
successively the frequencies of all the previous classes including the class under
consideration. The cumulate is started from the lowest to the highest size.
The “more than” cumulative frequency distribution is obtained by adding the frequency
of the succeeding (later) classes including the frequency of that class. In other words, it is
obtained by finding the cumulate total of frequencies starting from the highest to the
lowest class.
Example 3.7: consider the distribution of marks of 50 students:
31 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
Interpretation of less than cumulative frequency: for instance, 23 out of 50 students have marks
less than 49.5.
Interpretation of more than cumulative frequency: for instance, 27 out of 50 students have
marks 49.5 or more.
This topic deals with the study of organizing a set of raw data in to a Frequency distribution (FD)
and describes the distribution graphically in a histogram, a frequency polygon, & a cumulative
frequency curve (ogive). The other types of numerical information will be summarized &
presented in the form of bar chart, pie chart or a pictogram.
After a frequency distribution is completed, the next step will be to construct a “picture” of
these data values using a histogram. A histogram is a graph consisting of a series of adjacent
rectangles whose bases are equal to the class width of the corresponding classes and whose
heights are proportional to the corresponding class frequencies. Here, class boundaries are
marked along the horizontal axis (x – axis) and the class frequencies along the vertical axis ( y –
axis) according to a suitable scale. It describes the shape of the data.
Draw x – y axis and Label the class boundaries on the x – axis, frequency on the Y – axis,
Using the frequencies as the heights, draw vertical bars for each class
Example 3.8: Considers the following grouped frequency distribution and construct a histogram.
32 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
10
Class frequency (fi)
It is a line graph of frequency distribution. Although a histogram does demonstrate the shape of
the data, perhaps the shape can be more clearly illustrated by using a frequency polygon. Here,
you merely connect the centers of the tops of the histogram bars (located at the class midpoints)
with a series of straight lines. The resulting figure is a frequency polygon. Here the class marks
are plotted along the x – axis and the class frequencies along the y – axis. Empty classes are
include at each end so that the curve will anchor with the x – axis.
Steps in constructing a frequency polygon:
Find class marks for each class,
Draw the x – y axis. Label the x – axis with the class marks and use a suitable scale on
the y – axis for the frequencies (absolute or relative), and
Connect the coordinated (x, y) with line segments.
Example 3.9: Considers the following grouped frequency distribution and construct a frequency
polygon.
Marks frequency
15 - 24 3
33 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
25 – 34 4
35 – 44 10
45 -54 15
55 – 64 12
65 – 74 4
75 – 84 2
Total 50
Frequency Polygon
20
15
10
0
10 20 30 40 50 60 70 80 90
The Ogive curve is a graph that represents the cumulative frequencies for the classes in a
frequency distribution. It is also known as the cumulative frequency curve, which is a graphical
representation of a cumulative frequency distribution. There are two types of Ogive curves: the
“less than Ogive” and the “more than Ogive”.
a. The “less than” Ogive – the less than cumulative frequencies are plotted against upper
class boundaries of their respective classes and they are joined by either straight lines or
smooth curves.
b. The “more than” Ogive- in this case, the “ or more than” cumulative frequencies are
plotted against the lower class boundaries of their respective classes, and the
connections may be by straight lines or smooth curves.
34 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
Draw the x–y axis and label the x–axis with the class boundaries and y–axis with the
cumulative frequencies, and
In drawing less than” Ogive” upper class boundaries are plotted against the ‘less than’
cumulative frequencies of the respective class & they are joined by adjacent lines. In
drawing more than” Ogive” lower class boundaries are plotted against the ‘more than’
cumulative frequencies of their respective class and they are joined by adjacent line.
Example 3.10: Construct a less than and more than Ogive curve using the following frequency
distribution.
Class marks Less than marks More than
boundary cumulative cumulative
Less than 9.5 0
9.5-19.5 Less than 19.5 2 9.5 or more 50
19.5-29.5 Less than 29.5 6 19.5 or more 48
29.5-39.5 Less than 39.5 13 29.5 or more 44
39.5-49.5 Less than 49.5 23 39.5 or more 37
49.5-59.5 Less than 59.5 39 49.5 or more 27
59.5-69-5 Less than 69.5 47 59.5 or more 11
69.5-79.5 Less than 79.5 50 69.5 or more 3
79.5 or more 0
m
la
C
ti
u
n
v
q
F
y
e
c
r
10 39.5, 13
29.5, 6
0 9.5, 0 19.5, 2
0 20 40 60 80 100
35 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
Note that the O-gives can help us in estimating the number of observations falling either
above, below or in between values.
Diagrams and graphs are extremely useful because, they give a bird’s-eye view of the entire data
and, therefore, the information presented is easily understood. The main types of diagram: Line
graphs; Bar charts and Pie – charts
It represents the relationship between time (on the x-axis) and values of variable (on the y-axis).
The values are recorded with respect to the time of occurrence. Time series data are most
effectively presented on a line chart. Line charts are particularly effective for business and
economic data because we can show the change or trends in a variable overtime.
Example 3.11. Draw a line graph for the following time series.
Year 1986 1987 1988 1989 1990 1991
values 20 10 30 15 25 10
20 20
15
10 10 10
0
1986 1987 1988 1989 1990 1991
Year
36 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
Bar chart is a series of equally spaced bars of uniform width where the height (length) of a bar
represents the amount (magnitude) of frequency corresponding with a category. Bars may be
drawn horizontally or vertically. Vertical bar graphs are preferred as they allow comparison with
other bars. A bar chart can be simple, multiple or sub-divided bar chart.
It represents a single set of data (variable) classified in different categories. Singular bars are
drawn with the respective frequencies.
bar graph
9
10
5
5 2 3
1
0
Poor Below Average Above Excellent
average average
Here two or more bars are grouped with the corresponding frequency to represent two or more
interrelated data in each category. The bars of related variables are kept adjacent to each other
for every set of values. These charts can be used if the overall total is not required and each bar
is shaded or colored separately and a key is given to distinguish them.
Example 3.13: The following table shows rating of female and male students in Introduction to
statistics course.
37 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
5
4
5 3
2 2
1 1 1 1 Male
0
0 Femal
Poor Below Average Above Excellent
average average
It is used to present data by subdividing a single bar with respect to the proportional frequency.
Each portion of the bar is then shaded or colored and a key is give to distinguish them.
Example 3.14: Using the table given in the above example construct a subdivided bar hart.
10
5 Femal
0 Male
Poor Below Average Above Excellent
average average
A pie chart is a circle divided in to various sectors with areas proportional to the value of the
component they represent. It shows the components in terms of percentages not in absolute
magnitude. The degree of the angle formed at the center has to be proportional to the values
represented.
Example 3.15: Using the data given in the table below draw a pie-chart.
38 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
Poor 2 0.1 10 36
Below average 3 0.15 15 54
Average 5 0.25 25 90
Above average 9 0.45 45 162
Excellent 1 0.05 5 18
Excellent, 5 Poor , 10
Below
average, 15
Above
average , 45
Average , 25
Chapter summary
Data collection is the most important stage in statistical data analysis. Before we begin with data
collection, the purpose of collecting the data, the data to be collected, the source from which we
can get the data and the method(s) to be used for data collection should be considered. Based on
the source data can be classified as primary data and secondary data. Primary data may be
obtained by applying any of the following observation; Personal Enquiry Method (Interview
method), questionnaire method and direct Observation method. The whole data idea of
organizing a raw data set is to present the information in concise way. The frequency table
showing categories or classes is most convenient in data organization. The graphical or
diagrammatical presentation of a data set is meant to provide a clear picture of the emerging
trend or pattern in a data set. Often, more than one method can be employed for the same data set
to get a pictorial view of the information collected from the sample directly from the target
population. Among all the methods, bar diagram and pie charts are widely used for categorical
data set, whereas the histogram, frequency polygon and Ogive are most informative for
quantitative data set.
39 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
Self-evaluation test
100 112 114 112 110 120 125 130 115 129 133 132 104
134 120 112 134 123 130 127 129 105 126 106 107 124
110 126 124 122 130 117 132 130 115 109 108 111
120 132 124 102 102 120 125 123 118 121 124 107
4. Consider the following grouped frequency distribution data and construct a histogram.
Marks frequency
15 - 24 3
25 – 34 4
35 – 44 10
45 -54 15
55 – 64 12
65 – 74 4
75 – 84 2
Total 50
5. Using the above table construct a frequency polygon and Ogive curves.
40 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
4.1. Introduction
One of the most important objectives of statistical analysis is to get one single value that
describes the characteristics of the entire mass of unwieldy data. Such a value is called the
central value or an ‘average’ or the expected value of the variable. So the average is a typical
value that represents a group of values.
Measure of central tendency uses a typical value or an average that describes a data set near its
center. They provide indications on middle values or most likely or most frequent values. In
other words, they tell us where the center of the distribution of the data is located.
Easy to understand.
Simple to compute.
Based on all the observation.
Not be unduly affected by extreme observations or outliers.
Rigidly defined or it should have one and only interpretation.
Stable with respect to sampling.
41 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
What is measure of central tendency? Discuss the reason why we are using measures
of central tendency?
Summation or sigma notation is a convenient and simple form used to give a concise expression
for a sum of the values of a variable. In statistics, the symbol ∑ (Greek letter sigma) means to
add or find the sum.
For example, ∑ xi means to add the numbers represented by the variable X. Thus, if X represents
7,4,9,5, and 10, then ∑ xi=7+4+9+5+10=35.
ten numbers represented by X. This notation is read as follows: sum the values of 𝑥𝑖 from 𝑥1
through 𝑥10.
42 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
The following are the most important types of measure of central tendency
4.5.1 The Mean (Arithmetic, Weighted, Geometric, Harmonic)
4.5.2 The mode
4.5.3 The median
4.2.1 The Mean
I. Arithmetic Mean
The arithmetic mean is the sum of the data set values divided by the number of observations.
Arithmetic mean or average value of a variable is the most important numerical measures of
central tendency.
Ungrouped data
For ungrouped data, the population mean (usually denoted by “”) is the sum of all the
population values divided by the total number of population values.
X i
i 1
N
where : N number of elements in the population
population mean
For ungrouped data, the sample mean is the sum of all the sample values divided by the number
of sample values:
X i
X i 1
n
X sample mean
n number of elements in the sample/sample size
Example 4.1: A sample of five teachers received the following salaries (Birr in hundred): 12,
13, 15, 19, 13, and 18, find the mean salary.
43 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
12 + 13 + 15 + 19 + 13 + 18
𝑋̅ = = 15
6
Grouped data
The mean of a sample of data organized in a frequency distribution is computed by the following
formula:
f i Xi fi i th class frequency
X i 1
k where: X i class mark of the i th class
f i 1
i k number of classes
Example 4.2: Compute the arithmetic mean for the following grouped data:
∑ 𝑓𝑚 3,300
X= = = 33
𝑁 100
44 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
No. of students 5 12 25 8
Short-cut method: Short cut formula can be used if the figures in the calculation have many
digits. The formula for calculating arithmetic mean using the short cut method is:
∑𝑑
For ungrouped data 𝑥̅ = 𝐴 +
𝑁
𝑊ℎ𝑒𝑟𝑒 𝐴 𝑖𝑠 𝑡ℎ𝑒 𝑎𝑠𝑠𝑢𝑚𝑒𝑑 𝑚𝑒𝑎𝑛 𝑎𝑛𝑑 𝑑 𝑖𝑠 𝑡ℎ𝑒 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑖𝑡𝑒𝑚𝑠 𝑓𝑟𝑜𝑚 𝑡ℎ𝑒 𝑎𝑠𝑠𝑢𝑚𝑒𝑑 𝑚𝑒𝑎𝑛,
𝑖. 𝑒. , 𝑑 = (𝑋 − 𝐴)
∑ 𝑓𝑑
For grouped data 𝑋̅ = 𝐴 +
𝑁
Example 4.3: For the following discrete series data distribution find the arithmetic mean by
applying both the direct and short cut methods?
Weights in kilograms 20 30 40 50 60 70
No. of students 8 12 20 10 6 4
20 8 160
30 12 360
40 20 800
50 10 500
60 6 360
70 4 280
ΣfX=2,460
𝛴𝑓𝑋 2,460
𝑋̅ = = = 41
𝑁 60
Short-cut method
45 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
20 8 -20 -160
30 12 -10 -120
40 20 0 0
50 10 +10 +100
60 6 +20 +120
70 4 +30 +120
Σfd=60
∑ 𝑓𝑑 60
𝑋̅ = 𝐴 + = 40 + = 41
𝑁 60
Weighted Mean: It is a special case of arithmetic mean. It is the mean value of data values that
have been weighted according to their relative importance. The term weight itself stands for the
relative importance of the different items.
The formula for computing the weighted arithmetic mean of a population or a sample is
∑ 𝑤𝑖𝑥𝑖
µ𝑤 𝑜𝑟 𝑋̅𝑤 = ∑ 𝑤𝑖
Where µ𝑤 = 𝑖𝑠 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝑚𝑒𝑎𝑛
∑ 𝑊(𝑓𝑖𝑋𝑖)
𝑋̅ 𝑤 =
∑ 𝑋𝑖
Example 4.4: An NGO covers monthly expenses of adults’, females and childrens under
different package. Based on the package adults receive 300 birr, females 250birr and children
receive 200birr every month. And the number of adults, females and childrens are 10, 15 and 20
respectively. What is the average monthly expense covered by the NGO?
Solution:
The average monthly expense covered will be the weighted mean and calculated as follows
46 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
Therefore, the average monthly expense covered by the NGO is 194.44 birr per month.
i. The sum of the deviations of each value from the mean is always zero (taking signs into
account) i.e. ∑(𝑋 − 𝑋̅) = 0
ii. The sum of the squared deviation of the items from arithmetic mean is minimum that is
less than the sum of the squared deviation of the items from any other value. i.e. ∑(𝑋 −
𝑋̅)2 is the minimum
iii. If 𝑋̅ 1 and 𝑋̅ 2are the arithmetic means of two observations with 𝑁1 and 𝑁 2number of
observations then the combined antiemetic mean will be
̅ 1+𝑁2𝑋̅2
𝑁1𝑋
𝑋̅ 12 =
𝑁1+𝑁2
Given a mean for data values, if we multiply all data values by a constant number c, then
the new mean will be c times the old one (change of scale).
The geometric mean (GM) of “n” positive numbers is defined as the nth root of their product. The
geometric mean is useful in finding the average of percents, indexes, ratios, growth rates and
logarithmically distributed series. It has a wide application in business and economics because we are
often interested in finding the percentage changes in sales, revenues, profits, GDP, etc.
The formula to calculate the geometric mean for ungrouped data is:
Where X1 X2, X3, …... , Xn are the various items of the series
If 'n' is three or more, extracting the nth root of the product is excessively difficult. To facilitate
the computation of GM logarithms are used
𝛴𝑙𝑜𝑔𝑥 𝛴𝑙𝑜𝑔𝑥
Or log 𝐺𝑀 = [ ] 𝑇ℎ𝑒𝑟𝑒𝑓𝑜𝑟𝑒 𝐺𝑀 = 𝐴𝑛𝑡𝑖𝑙𝑜𝑔 [ ]
𝑁 𝑁
𝑛
𝐺𝑀 = √𝑥1 𝑓1 ∗ 𝑥2 𝑓2 ∗ … … … … .∗ 𝑥𝑛 𝑓𝑛
∑ 𝑓𝑙𝑜𝑔𝑥𝑖
Or 𝐺𝑀 = 𝐴𝑛𝑡𝑖𝑙𝑜𝑔 [ ]
𝑁
Where x1, x2…xn is the class mark for each class, f1, f2….fn are the corresponding frequencies
for each class and n represents the total number of observations.
48 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
3
Solution 𝐺𝑀 = √2 ∗ 4 ∗ 8 = 4
Example 4.6: The return on investment earned by a company for 5 consecutive years is given by
25%, 40%, 30%, 120% and -85%. Calculate the geometric rate of return earned by the company
on investment?
Solution:
The positive percentage gain on investment means additional gain on what the company already
has. Then 25%, 40%, 30%, 120% and -85% investment return can be expressed as 1.25(1+0.25)
,1.4()01+0.4,1.3(1+0.3)+2.2(1+1.2)and 0.15(1-0.85).
5
𝐺𝑀 = √1.25 ∗ 1.4 ∗ 1.3 ∗ 2.2 ∗ 0.15 = 0.944
Example 4.7: If a person receives a 20% rise in his initial income after one year of service and a
10% rise after the second year of service, what is the average percentage increase?
Solution:
20%+10%
The average percentage raise is not 15% ( ) but 14.89% as shown below.
2
Let’s show this answer is right by assuming that the person earns Birr 10,000 at the beginning
and receives two raises: 20% and 10%.
49 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
The total increase =Birr 1500 +1725 = Birr 3225 (3225 is higher than the actual 3200 increase).
The above result clearly shows that to find the correct average percentage change we should
apply geometric mean instead of arithmetic mean.
Example 4.8: Find the geometric mean for the following grouped data on the percentage
increase in salary of 16 employees of a company.
15-19 2 17
16
Solution: 𝐺𝑀 = √25 ∗ 76 ∗ 123 ∗ 172 = 5.85 %.The geometric mean percentage increase
in salary is 5.85%.
Another use of the geometric mean is to determine the percent increase in sales, production or
other business or economic series from one time period to another.
Example: The population of a country increased from 84 million in 2005 to 108 million in 2015.
Find the annual rate of growth of population.
10 108,000,000
= √ – 1 = 2.6 = 2.6%
84,000,000
50 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
The production of sugar for a sugar factory increased from 500,000 tons in 1990 to
950,000 tons in 2006. Compute the rate of production increase per year.
i. The product of the values of series will remain unchanged when the values of the
geometric mean is substituted for each individual value. For example the geometric mean
for series 1,3,9 is; therefore we have
1*3*9=27=3*3*3
ii. The sum of the deviations of the logarithms of the original observations above or below
the logarithm of the geometric mean is equal. Thus using the previous example
3 3 9
∗ =3=
1 3 3
It is the mean of n numbers x1 , x2 ,, and x n and is defined as n divided by the sum of the
reciprocals of the n numbers. It is appropriate for situations when the average of rates is desired
(e.g., it helps to find the average speed of a trip over a route divided into constant speed
segments (of distance), ..). For example, if one travels half-way to a destination at 20 mi/hr, and
then goes 60 mi/hr for the second half of the distance. The average speed is 30mi/hr.
51 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
n n
H .M . ………for ungrouped data
1 1 1 1
....
x1 x2 xn
x
i
2 24
Example 4.9: 1. The H.M. of 6 and4 is
1 1 5
6 4
6 360
2. The H.M. of the first six natural numbers is
1 1 1 1 1 147
1
2 3 4 5 6
n n
H .M . ………for grouped data
f1 f 2 f f
.... i
x1 x2 xn
( xi )
i
For a set of data containing n-positively valued observations, the following relationships always
holds:
HM≤GM ≤AM
However, HM=GM=AM iff all values in the data set are equal.
The mode of a set of data is defined as the value with the highest frequency, and which occurs more
than once”.
Mode for ungrouped data
The mode or the modal value of a raw data is simply obtained by locating the observation with the
maximum frequency (if there exists such a value).
Example 4.10: the examination scores for ten students are: 81, 93, 84, 75, 68, 87, 81, 75,81and 87.
Because the score of 81 occurs three times, it is the mode.
Note: A data set may have
- No mode at all, e.g. 1, 3, 9, 0, 7, 8
- One mode (unimodal), e.g. 1, 3, 1, 7, 1, 9, mode is 1
- Two modes (bimodal), e.g. 7,2,4,4,7 , the modes are 7 and 4.
52 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
f f1 f f1
Mode Lo i L0 i
f f 1 f f 2 2 f f 1 f 2
Where:
Lo lower classs boundary of the modal class (i.e., the class with thehighest frequency)
f is the frequencyof the modal class
f1 frequencyof the class immediately preceding the modal class class
f2 frequencyof the class immediately following the modal class
i class interval/width
Example 4.11: Calculate the modal age for the age distribution of 228 teachers.
Class Interval Number of Teachers
Age (in years)
15 – 19 6
20 – 24 19
25 – 29 50
30 – 34 57
35 – 39 48
40 – 44 27
45 – 49 21
Total 228
53 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
Merits of Mode
It has been pointed out that whenever there is a frequency distribution with open-end intervals, the
arithmetic mean cannot be calculated. Also the mean is greatly affected by extremely large or small
values. Hence, in such cases, the mean cannot be a good representative. Instead, other measures are
used to describe the data. In this section, we will discuss the most popular measure of position, the
median, and other related measures known as quintiles. Positional measures are chosen because of
their positions.
The median is, as its name indicates, the middle most value in the arrangement in an ascending or
descending order of magnitude, which divides the data in to two equal parts. It is the value which
exceeds, and is exceeded by, equal number (i.e., half) of the observations. That is, the median is
found by arranging the data in an increasing or decreasing order of magnitude. We can consider the
following three cases in finding the median:
54 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
Case 1: Ungrouped data: In such cases, the median is the value of the middle term when the data
are arranged in order of magnitude.
When the number of observations is odd, there will always be a single value in the middle of the
arrayed data. When n is even, however, there will be two middle observations, and the median is the
mean of these two values. Let x1 , x2 ,, xn be n ordered observations. Then, the median value is
n 1
th
n = 10 is even, the median is the mean of the 5th , and 6th values; i.e., Median 1123 1180 1151.5
2
thousands.
Case 2: Ungrouped discrete series data: In this case also, the median is obtained by the same
formula. Only one more step, finding the less than cumulative frequencies, is added, because
cumulative frequency distribution is itself an arrangement of values in an order:
- Look at the cumulative frequency and find that total which is either equal to or next higher
to n 1 th obs. when n=odd and the average of the two middle values when=even, and the
2
corresponding value is the median. Example 4.13: Find the median of the data given below
xi 3 5 6 8 10
fi 4 4 7 9 5
55 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
n 1
th
2
Case 3: Grouped continuous distribution
For continuous grouped data, the exact median cannot be obtained unless the original data has been
retained. Hence, the median has to be interpolated (or estimated) from the median class. An
interpolation formula which is based on the assumption that classes are uniformly distributed is:
w n
Median L CF ~
x Where: L= the lower class boundary of the median class;
f m ed 2
w = the class width of the median class;
f med = the frequency of the median class; and
CF the cum. Freq. corresponding to the class preceding the median class.
That is, the sums of the frequencies of all classes lower than the median class. Where the median
th
n
class is the class which contains the observation whether n is odd or even, since the items
2
have already lost their originality once they are grouped into continuous classes.
Grade Frequency
40 – 49 5
50 – 59 18
60 – 69 27
70 – 79 15
80 – 89 6
Solution: Construct the less than cumulative frequency distribution as follows
56 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
Since n = 71, 71/2 = 35.5, and the smallest CF greater than or equal to 35.5 is 50; thus, the median
class is the third class. And for this class, L = 59.5, w = 10, f med 27 , CF = 23. Then applying
Properties of Median
57 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
Descriptive measures that describe the position (place) of value in a given data or distribution are
positional averages.
Measures which divided data in to many equal parts are called quartiles (fractiles).The most
important of these are quartiles, deciles and percentiles.
Quartiles: are the three values, which divide the given data in to four equal parts. They are
denoted by Q1, Q2 and Q3.
Deciles: are the nine values, which divide the series in to ten equal parts. They are denoted by
D1, D2, … , D9.
D1 = Covers 10% of the distribution
D2 = Covers 20% of the distribution
.
.
D9 = Covers 90% of the distribution
Percentiles: are the 99 values, which divide the series in to 100 equal parts. They are denoted by
P1, P2 , … , P99.
Computation of Quartiles, Deciles and Percentiles for Ungrouped and Grouped Data
First, for ungrouped data, rearrange the values in the order of magnitude and for discrete series,
compute the <Cfi column. Then apply the following formula.
58 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
i N 1
th
Qi value of item
4
i N 1
th
Di value of item
10
i N 1
th
Pi value of item
100
c iN
Qi l c. f
f 4
c iN
Di l c. f
f 10
c iN
Pi l c. f
f 100
Example 4.15: For the data given below, compute the value of Quartiles, D3, D7, P15 and P88
and interpret.
Solution:
th
N
Q1 – size of item = 25th item 10 – 20 quartile class
4
c n
Q1 l c. f 10
10
25 10 20
f 4 15
th
2N
Q2 – size of item = 50th item 20 – 40 quartile class
4
59 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
c n
Q2 l c. f 20
20
50 25 40
f 2 25
th
3N
Q3 – size of item = 75th item 40 – 60 quartile class
4
c 3n
Q3 l c. f 40
20
75 25 73 .33
f 4 30
3 th
Mark of of students is below 73.33.
4
th
3N
D3 – size of item = 30th item 20 – 40 decile class
10
c 3n
D3 l c. f 20
20
30 25 24
f 10 25
th
7N
D7 – size of item = 70th item 40 – 60 decile class
10
c 7n
D7 l c. f 40
20
70 50 53 .33
f 10 30
th
15N
P15 – size of item = 15th item 10 – 20 percentile class
100
c 15 n
P15 l c. f 10
10
15 10 13 .3
f 100 15
60 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
th
88N
P88 – size of item = 88th item 60 – 80 percentile class
10
c 88 n
P88 l c. f 60
20
88 80 71 .43
f 100 14
Chapter summary
Given any raw data involving either qualitative or quantitative variables, we look for one basic
feature; the central tendency of observation. Several measures are available for this feature, and
each measure has its own advantage and drawbacks. Commonly we use the mean, median and
mode to get some ideas about the central tendency. Out of mean, median and mode, he mean is
the most commonly used measure in central tendency. But the other two, namely the median and
mode are not any less important. Median is usually used if the collected data is qualitative type.
A data set may not have mode or it may have one, two or more than two modes.
Self-evaluation test
4. Compute the questions given below based on the following frequency distribution.
61 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
Marks frequency
15 - 24 3
25 – 34 4
35 – 44 f1
45 -54 15
55 – 64 12
65 – 74 f2
75 – 84 2
If the median is 47
62 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
5.1 Introduction
Dispersion is the scatter or variation of items from a measure of central tendency. The first
section covers the definition and objectives of measure of dispersion and properties of a good
measure of dispersion. The second section: examines the different measures of variation i.e.,
range and relative range, quartile deviation an coefficient of quartile deviation, the mean and
coefficient of mean deviation and the variance, the standard deviation and coefficient of variation
and the standard score.
Dispersion’ or `variation’ in statistics is the degree of spread of each individual item or value
from the central value in the given distribution. According to Bowley, it is “the measure of the
variation of the items.” The term dispersion indicates the extent the items in the data differ from
one another. In other words, they measure the lack of uniformity in the distribution.
The measures of dispersion are also called the average of second order, since they measure the
average of the deviations taken from the central tendency of the distribution. It measures the
scatter or variation of items from a measure of central tendency.
Example 5.1: Consider the following data on the expenditures of two groups of workers:
- Group A:ETB 6200 2000 1300 1300 1200 (the mean is ETB 2400)
- Group B: ETB 1600 1700 1300 4200 3200 (the mean is ETB 2400)
We simply conclude that the two groups spend identical amount, if we were given only the
average expenditure of the two groups without knowing the actual expenditures. But the actual
observations indicate that more variation is observed in group A.
63 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
There are two types of measures of dispersion; Absolute and Relative measures of dispersion.
An absolute measure expresses the magnitude of dispersion in the same unit of measurement in
which the data are recorded. However, the relative measure (which is unit less) expresses
dispersion in percentages or ratios. It is a quotient obtained by dividing the absolute measure by
a quantity in respect to which the absolute dispersion has been computed.
Range is an about measure of dispersion. Such measures are compared if the unit of
measurement is homogeneous i.e. if all the sets of distribution are expressed in the same
statistical unit such as birr, letter, meters, etc. But if different sets of distributions are expressed
in different statistical units, the absolute measure of range cannot be effectively compared .For
ensuring comparability, relative measure of range, known as coefficient of range is used. This is
64 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
obtained by dividing the absolute range by the sum of the largest and smallest values in the
distribution.
Range is defined as the difference between the smallest and the largest observations in a given
set of raw data.
Distinguish the absolute measure of dispersion from the relative measure of dispersion.
Group A: ETB 6200 2200 1700 1700 1200 (the mean is ETB 2400)
Group B: ETB 1600 1700 1300 4200 3200 (the mean is ETB 2400)
Solution:
For Group A:
The highest expenditure = 6200 birr
The lowest expenditure = 1200 birr
Range = highest value – lowest value = 6200 – 1200 = 5000 Birr
For Group B:
The highest expenditure = 4200
The lowest expenditure = 1300
Range = 4200 – 1300 = 2900 Birr
Therefore, in terms of expenditure more variation is observed in group A.
A large value of range shows lack of uniformity and consistency in the distribution. It
explains that the average is inadequate and not representative.
For discrete grouped data we use the same formula as given above, i.e., the difference between
the highest and lowest values.
Xi 6 24 18 22 30 15
Yi 3 2 5 1 4 5
65 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
Solution:
- By taking the difference between the upper class limit of the last class and the lower
limit of the first class.
- By taking the difference between the upper class boundary of the last class and the
lower class boundary of the first class.
- By taking the difference between the mid points of the first and the last class. This
does yield a result closer to the actual range as it reduces the margin by which it is in
error when computed by using the first the second methods.
Example 5.4: Compute the range of the data given below which shows the score (out of 35%)
of 40 students in Econometrics test.
66 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
𝑈𝐶𝐵𝐿 − 𝐿𝐶𝐵𝐹
𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑒𝑛𝑡 𝑜𝑓 𝑟𝑎𝑛𝑔𝑒 = ∗ 100%
𝑈𝐶𝐵𝐿 +𝐿𝐶𝐵𝐹
Where 𝑈𝐶𝐵𝐿 is the upper class boundary of the last class and
𝐿𝐶𝐵𝐹 is the lower class boundary of the first class
Example 5.5: Find the coefficient of range (relative range) for the data given in the above table.
Solution:
30.5 − 5.5
𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑒𝑛𝑡 𝑜𝑓 𝑟𝑎𝑛𝑔𝑒 = ∗ 100% = 69.4
30.5 + 5.5
Note
- Range is as good a measure of dispersion as any other where the data consist of a few
observations.
- It is advantageous when one wants to know only the extent of the extreme dispersion
under “ordinary” conditions.
- It tells us noting about the dispersion of the values which fall between the two
extremes.
- It is highly affected if the value of the two extremes changes.
ii. Inter- quartile range
The range, which takes into account only two extreme values, is considered a crude measure of
dispersion. To overcome certain limitations of range, another method known as `inter-quartile
range’ has been developed. Inter-quartile range is the absolute difference between the third
quartile (Q3) and the first quartile (Q1) of a given frequency distribution. In other words, it
includes only the middle 50% of items in the distributions and ignores one quarter of the
67 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
observations on either end of the distribution. The inter-quartile range is calculated with the help
of the formula.
Where I.R = Inter-quartile range. Q3 = the third quartile Q1 = the first quartile
For example, if a frequency distribution Q3 and Q1 are 72 and 20 respectively, its inter-quartile
range is I.R = 72 – 20 = 52
Q3 Q1 Q3 Q1
Q.D. = and coefficient of Q.D.= .* 100%
2 Q3 Q1
Example 5.6: The following is data corrected on the result of 11 students in a final examination of
statistics.
Result: 72 65 75 69 35 43 52 37 61 58 41
Solution: The data when arranged in ascending order will read as follows:
35 37 41 52 58 61 65 69 72 75
Calculating the quartiles,
N 1 N 1
th th
11 1
th th
12
Q3 = value of the 3 = Value of 3 item
4 4
Value of the 9th item = 69
68 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
N 1 11
th th
1
Q1 = the value of = Value of item
4 4
th
12
= Value of item
4
Value of the 3rd item = 41
Substituting 69 & 41 for the upper & lower quartiles respectively in the formula, we have:
Q3 Q1 69 41 28
Q.D = 14
2 2 2
Q3 Q1 69 41 28
Coefficient of quartile deviation = *100 % *100 % *100 % 25 .5
Q3 Q1 69 41 110
From the coefficient calculated above, it can be seen that there is a greater uniformity in the
distribution because the coefficient of quartile deviation is small.
Advantages of Quartile deviation
- It is easy to compute and understand
- It can be computed for open-ended classes given that Q3 & Q1 can be found.
- It is not affected by extreme values
Disadvantages of Quartile deviation
- It ignores the first 25% and the last 25% items.
- It is not capable of mathematical manipulations.
- Its value is very much affected by sampling fluctuations.
- It doesn’t show the scatter around the average, but only a distance on scale.
iv. Mean deviation
Mean Deviation measures the average deviation /scatters of a set of observations about a central
value (mean/median). Mean deviation is obtained by dividing the sum of the absolute deviations
taken from the average by the total number of observations. Generally, the result is in the
absolute value to denote that deviations are taken by ignoring algebraic signs because, if signs
are taken into consideration, the sum of deviations from mean will be Zero, and the sum of
deviations from median or mode will be nearly Zero.
69 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
If the deviations are taken from the mean then it is called M.D around the mean, if the deviations
are taken from the median, it is called M.D around the median, or if the deviations are taken
from the mode, it is called M.D around the mode.
n n
xi x x
i 1
i median
i 1
M.D. about the mean = , M.D. about the median =
n n
n
x
i 1
i mod e
M.D. about the mode =
n
M .D.mean M .D.median M .D. mod e
The coefficient of M.D. = *100%or *100%or *100%
x median mod e
Mean Deviation for Grouped data (discrete or continuous)
For grouped data, the mean deviations around the mean, median, and mode are obtained,
respectively, as follows:
∑𝑚 𝑚
𝑖=1 𝑓𝑖 |𝑋 − 𝑚𝑒𝑎𝑛| ∑𝑖=1 𝑓𝑖 |𝑋 − 𝑚𝑒𝑑𝑖𝑎𝑛| ∑𝑚
𝑖=1 𝑓𝑖 |𝑋 − 𝑚𝑜𝑑𝑒|
, 𝑎𝑛𝑑
𝑛 𝑛 𝑛
Where m = number of classes, and Xi = class mark of the ith class.
Example 5.7: Calculate the mean deviations from mean and median for the data given below
Class Interval (C.I) 1-5 6-10 11-15 16-20
Frequency 4 1 2 3
Solution:
C.I xi fi fi xi 10 fi xi 10.5
1-5 3 4 28 30
6-10 8 1 2 2.5
11-15 13 2 6 5
16-20 18 3 24 22.5
Total 10 60 60
3 4 8 1 13 2 18 3
Mean 10 , and
10
70 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
To overcome the limitation of ignoring the algebraic signs of the deviations, the concept of
standard deviation was introduced by Karl Pearson in 1893. Standard deviation may be defined
as the square root of the squared deviation may be defined as the square root of the squared
deviations measured form arithmetic average. It is also called the “Root-mean square deviation.”
Symbolically it is expressed by the small Greek letter S (Sigma). For calculating the standard
deviation the deviations are always taken from the mean, because, the sum of squared deviations
is minimum when measured from arithmetic mean. High value of standard deviations are
minimum when measured from arithmetic mean, high value of standard deviation denotes
greater variation & less uniformity and less measure of standard deviation denotes lesser
variation and greater uniformity. Whereas the lower value indicates that the averages are good
representatives of the data.
Variance and standard deviation are closely related each other, since, the variance is
the square of standard deviation or standard deviation is the square root of variance.
While the standard deviation and variance are the absolute measure, the relative measure is
known as coefficient of variation.
Suppose that x1, x2, ..., xn are the values of the observation in a population of size N with mean
. Then, the population variance and standard deviation are defined by:
n
x
2
i
Variance = 2 i 1
…………… . . . . . . . . . . . . . . . . . . . . . . . . .. . . .. .. .. . (#)
N
71 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
x
2
i
Standard deviation = i 1
N
Example 5.8: Suppose that the ages of all patients in the recovery room of a certain Hospital
are: 26,30,38,40,36,20,45, and 37 years. Find the population variance and standard deviation
Solution: Given N 8, x1 6, x 2 30 ,, x8 37 .
26 30 37 272
The population mean is, 34.
8 8
Then, we construct the following table for the deviations and squared deviations from (the last
column is for totals).
Age xi 26 30 38 40 36 20 45 37 ΣXi= 272
xi -8 -4 4 6 2 -14 11 3 x
Σ( i )=0
xi 2 64 16 16 36 4 196 121 9 Total=462
Thus, using above formulae we have, 2 462 / 8 57.75 ; and 57 .75 7.599 .
1 N 2
Or we can apply
N i 1
xi 2
2
8
Now, to solve the example above using this formula, we have xi 9710 , and
2
i 1
x i 272 34 , 2
1
9710 342 57.75 and 57 .75 7.599 .
i 1 8
b. Population variance and S.D for grouped data
, N is ∑ 𝑓𝑖 and m classes.
m
f x
1 2
2 i i
N i 1
Example 5.9: Find the variance and S.D for the population values:
72 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
xi 2 3 5 6 8
fi 3 4 4 5 4
Solution: First find the mean,
1 100
N
fi xi
20
5 , and proceed as:
estimate the corresponding parameters. So far, the parameters 2 and have been discussed.
Normally, one can say that if we replace N by n and by x the resulting formula would become
2
1 n
xi x
n 1
and this would be used to estimate 2. But, theoretically, it can be shown that
1 n
S xi x
2 2
n 1 i 1
Why do we use n-1?
In small sample, it provides a better estimate of the variance of the population from which the
sample is drawn.
73 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
However, as n increases above about 30, we can use n instead of n-1, as the two versions given
approximately the same result for practical purposes.
Example 5.10: A sample of 5 students was taken from a class and their weights were found to be
48, 51, 52, 51, and 53kg. Find the variance and standard deviation.
Solution: The mean is x 255 / 5 51, and prepare the following table:
Weights (xi ) xi x x x
i
2
48 -3 9
51 0 0
52 1 1
51 0 0
53 2 4
Total 0 14
n 5 2
x i x) 2
x i 51
14
s 2 i 1
1
3.5, and s 3.5 1.87 is the sample standard
n 1 5 1 4
deviation.
d. Sample variance and S.D for grouped data
If the values xi have frequencies fi (i=1,2,…,m), then the sample variance is given by:
1 m
S fi xi x
2 2
n 1 i 1
The above definition for sample variance also holds for grouped continuous distribution where
xi=class mark of the ith class
Example 5.11: Find the sample variance and standard deviation for the distribution:
C.I 1-5 6-10 11-15 16-20
Freq. 4 1 2 3
Solution:
In a continuous F.D., xi is the class mark representing the ith class.
74 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
2
C.I fi
xi f i xi f i xi
1-5 3 4 12 36
6-10 8 1 8 64
11-15 13 2 26 338
16.20 18 3 54 972
f 10, x
fi xi 100
Where n i 10, fi xi 2 1410 , so that
n 10
s2
1
9
1410 10 10
2
410
9
45.56, and s 45 .56 6.75 .
75 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
The relative measure of standard deviation is called coefficient of variation. According to Karl
Pearson, who introduced this measure, it is “the percentage variation in the mean the standard
deviation being treated as the total variation in the mean.” It is used to measure the consistency
and variability or uniformity of two or more distributions. Lower value of the coefficient of
variation indicates higher consistency, more uniformity, and lower variability. Symbolically, the
coefficient of variation can be expressed as
s
C.V . 100 % or C.V . 100% .
x
Where C.V = coefficient of variation
= standard deviation
x = sample mean
µ= Population mean
Both the coefficient of standard deviation and coefficient of variation are relative
measures of dispersion. Coefficient of variation is the percentage of coefficient of standard
deviation.
Example 5.12: Suppose typist A types out 30 pages per day on average with a S.D. of 6 and
typist B types out 45 pages per day on average with a S.D. of 10. Which typist has
shown greater consistency in her/his output?
s 6
Solution: C.V.(Typist A) 100% 100% 20 % .
x 30
s 10
C.V.(Typist B) 100% 100% 22.2 % .
x 45
Since C.V A is less than C.V B typist A is more consistent and shows uniformity on his/her
performance
Standard scores is one of the applications of standard deviation which is a relative measure. It
indicates the position of individual observations. Suppose that a student scored 66 in Statistics
76 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
and 80 in Mathematics. The question is, in which course did he score better as compared to his
classmates?
At first glance, it seems that he did much better in Mathematics. But suppose that all the students
in the class averaged 51 points in statistics with a standard deviation of 12, and averaged 72 in
Mathematics with a standard deviation of 16.
Thus, one can argue that the student’s score in Statistics is: 66 51 1.25 standard deviations
12
above the average, while his score in Mathematics is only 80 72 0.50 standard deviation
16
above the average for the class. Here, the grades have been converted in to standard scores.
Whereas the original scores cannot be meaningfully compared, the standard scores expressed in
terms of standard deviations can be compared. Thus, the student scored much higher in Statistics
than in Mathematics compared to the rest of the class.
In general, we define the standard score as:
xx x
z or Z
s
The Z- score tells us how many standard deviations a value lies above (if positive) or below (if
negative) the mean of the set of data to which it belongs.
Example 5.13: If a set of measurements has the mean 48 with a S.D. of 12, convert each of the
following in to
Standard units: a) 54; b) 72 ; c) 78.
Solution: a) For x = 54, Z 54 48 0.5 ; that is, 54 is 0.5 S.D’s below the mean .
12
Chapter summary
Given any raw data involving either qualitative or quantitative variables, we look for one basic
feature; the measure of observation. Several measures are available for this feature, and each
measure has its own advantage and drawbacks. Commonly we use the range and relative range,
quartile deviation and coefficient of quartile deviation, mean deviation and coefficient f mean
77 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
deviation, variance, standard deviation and coefficient of variation are the most commonly used
measures in dispersion. Among different groups the one which has less coefficient of variation
indicates more consistency or less variation of distribution.
1. The test grade for a number of students are: 76, 72, 100, 64, 72 and 90
a. What is the standard deviation and variance of the test score?
b. What is the coefficient of variation of the test score?
2. Metrologies interested in the consistency of temperatures in three cities during a given week
collected the following data. The temperature for the five days of the week in the three cities
were:
City 1 25 24 23 26 17
City 2 22 21 24 22 20
City 3 32 27 35 24 28
Which city have the most consistent temperature, based on these data?
3. Let n 10, x 12 and x 1530 for a certain data. Find the coefficient of variation.
2
i
4. Consider the following grouped frequency distribution data and calculate the following
questions.
Marks frequency
15 - 24 3
25 – 34 4
35 – 44 10
45 -54 15
55 – 64 12
65 – 74 4
75 – 84 2
Total 50
78 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
6.1 Introduction
This chapter consists of three sub sections. The first section explains the concept of moments and
the methods of measuring moments i.e., from mean, arbitrary number and the origin. The second
section discusses the concept of Skewness. Skewness refers to the lack of symmetry in a
frequency distribution. In addition to this we will find the four different ways of measuring
coefficient of Skewness; The Karl Pearson’s, The Bowley’s, Kelly’s and Moments measure of
coefficient of Skewness. The last section covers the concept kurtosis and the measures of
coefficient of kurtosis.
6.2 Moments
It is the mean of different powers of deviations of observations from a certain point. If this point
is the mean, then the moment about the mean is called the central moments, denoted by (read
as ‘mu’). This 1 stands for first moment about mean, 2 stands for second moment about mean,
etc.
Central moment (Moments from the mean)
The rth moment from the mean for ungrouped data is calculated as follows
∑(𝑋𝑖 − 𝑋̅)𝑟
μr =
𝑛
X i X
1
n
, Since Xi X 0 , μ 1 is always Zero. 1 0
Xi X
2
3
n
th
The r moment from the mean for frequency distribution is calculated as follows
∑ 𝑓(𝑋𝑖 − 𝑋̅)𝑟
𝜇𝑟 =
∑ 𝑓𝑖
79 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
f X i X f iXi X
2 3
1 , 3
fi fi
fi X i X fi Xi X
2 4
2 , 4
fi fi
Where Xi is the class mark of each ith class and 𝑓𝑖 frequency of ith class
X i A X A Xi A
3
1 , 3
n n
Xi A Xi A
2 4
2
, 4
n n
For a frequency distribution,
fi X A
2
2 And so on.
N
1 =
fi X 0 fiXi , 3
fiXi 3
and
N N N
fi Xi fiXi
2 4
2 = , =
4
N N
The concept of moment is of great significance in statistical work. With the help of moments we
can measure the central tendency of a set of observations, their variability, their symmetry and
the height of the peak their curve would make. Because of the great convenience in obtaining
80 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
measures of the various characteristics of a frequency distribution, the calculation of the first
four moments about the mean may well be made the first step in the analysis of a frequency
distribution.
6.3 Skewness
Two or more distributions may have the same mean and equal standard deviation Thus; we may
feel that talking only of standard deviation to describe the distribution is not sufficient. Both
distributions with equal standard deviation may differ in shape. Thus we are forced to define (or
introduce) the concept of Skewness.
The term Skewness refers to the lack of symmetry i.e. when a distribution is not
symmetrical it is called asymmetrical or skewed. We study Skewness to have an idea about the
shape of the curve which we can draw with the help of the frequency distribution. Frequency
distributions often found skewed on either side of its central value. As a result, it has a longer
tail either to the left or to the right.
If there is a longer tail to the right of the center, the distribution is said to be positively skewed.
A positive Skewness means a greater dispersal of individual observations towards the right of the
central value
If the tail is longer to the left of the center, the distribution is said to be negatively skewed. A
negative Skewness, on the other hand, implies that individual observations have greater dispersal
towards the left of the central value.
Skewness, therefore, not only refers to the lack of symmetry in distribution, it also shows the
direction of dispersion of individual observations on either side of the center of the distribution.
Tests of Skewness
- If A.M = Median = Mode, then there is no Skewness in the distribution. In other words,
the curve of the frequency distribution would be symmetrical or bell-shaped.
- If A.M is less than the values of the mode, the tail of an asymmetrical distribution is on
the left side, i.e. the distribution is negatively skewed.
- If A.M is greater than the value of mode, the tail is on the right side, i.e. the distribution
is positively skewed.
81 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
Measures of Skewness
Mean Mode
S KP 𝑆𝐾𝑃 = Karl Pearson Coefficient of Skewness
Sandard Deviation
If mode is ill defined in some frequency distribution, by using the empirical relationship between
mean, median and mode for a moderately skewed distribution
3 𝑀𝑒𝑎𝑛 − 3 𝑀𝑒𝑑𝑖𝑎𝑛
𝑆𝐾𝑃 =
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛(𝛿)
Example 6.1: Suppose the mean; the mode and the standard deviation of a certain distribution
are 32, 30.5 and 10 respectively. What is the shape of the curve representing the distribution?
Solution:
Theoretically, the values of this coefficient lies between -3 and +3; however,
practically the value of 𝑆𝐾𝑃 lies between –1 and 1. For a symmetrical distribution, its value
comes out to be zero.
82 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
This measure is called quartile measure of Skewness. It is also useful in open-end distribution
and where extreme values are present.
Q3 Q1 2 Median
S KB 𝑆𝐾𝐵 = Bowley’s Coefficient of Skewness, value of 𝑆𝐾𝐵 lies between –1
Q3 Q1
and 1.
This method is a modification of Bowley’s formula for he suggested that measure of Skewness is
better if it includes all the observations. Thus, instead of leaving the upper and the lower
quartiles, he said better to leave the upper and the lower docile and forwarded the formula as
follows:
A measure of Skewness may be obtained by making use of the third moment about the mean. 1
measures Skewness using the second and third moment.
32
1
23
Since 32 and 23 are always positive, 𝛽1 as a measure of Skewness cannot tell us about the
direction of Skewness. Thus 1always remains to be positive. This draw back can be removed by
calculating Karl Pearson’s ratio 1 which is defined as
μ3 𝜇3
1 = √β1 = 3 = 3
(δ2 ) ⁄2 𝛿
83 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
1 is always positive, and the sign of Skewness depend upon the value of 3 .If 3 is
positive, we will have positive Skewness and if 3 is negative, we will have negative Skewness.
Thus it is advisable to use 1 as a measure of Skewness.
The shape of the curve is determined by the value of 1
- In negatively skewed distribution, smaller observations are less frequent than larger
observations. I.e. the majority of the observations have a value above an average.
Example 6.2: Find 1 and 1 interpret the outcome using the table below.
X 2 3 4 5 6
f 1 3 7 3 1
Solution:
X f ̅ )=x,
(X-𝑿 fx fx2 fx3 fx4
̅ =4
𝑿
2 1 -2 -2 4 -8 16
3 3 -1 -3 3 -3 3
4 7 0 0 0 0 0
5 3 +1 +3 3 +3 3
6 1 +2 +2 4 +8 16
Σfx=0 Σ fx2=14 Σf x3=0 Σ fx4=38
84 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
∑ 𝑓(𝑋𝑖 − 𝑋̅)2 𝛴 𝑓𝑥 2 14
𝜇2 = = = = 0.933
∑ 𝑓𝑖 𝑁 15
∑ 𝑓(𝑋𝑖 − 𝑋̅)3 𝛴 𝑓𝑥 3 0
𝜇3 = = = =0
∑ 𝑓𝑖 𝑁 15
∑ 𝑓(𝑋𝑖 − 𝑋̅)4 𝛴 𝑓𝑥 4 38
𝜇4 = = = = 2.533
∑ 𝑓𝑖 𝑁 15
𝜇3 2 0
𝛽1 = = =0
𝜇2 3 0.9332
μ3 𝜇3 0
1 = √β1 = 3 = = =0
(δ2 ) ⁄2 𝛿3 0.9663
Kurtosis in Greek means “bulginess”. In statistics it refers to the degree of flatness or peakedness
in the region about the mode of a frequency curve. The degree of kurtosis is of a distribution is
measured relative to the peakedness of normal curve. If a curve is more peaked than the normal
curve, it is called “Leptokurtic”. In such a case the items are more closely bunched around the
mode. On the other hand, if a curve is more flat-topped than the normal curve, it is called
“Platykurtic”. The normal curve itself is known as “Mesokurtic”. The condition of peakedness or
flat-toppedness itself is known as kurtosis.
The most important measure of kurtosis is the value of the coefficient2. It is defined as:
4
2 = 4 = 4th moment and 𝜇2 = 2nd moment.
22
The greater the value of 2, the more peaked is the distribution.
For a normal curve the value of 2 = 3. When the value of 2 is greater than 3, the curve is more
peaked than the normal curve, i.e. Leptokurtic. When the value of 2 is less than 3, the curve is
less peaked than the normal curve, i.e. platykurtic. The normal curve and other curves with 2= 3
are called mesokurtic.
Sometimes 2, the derivative of 2, is used as a measure of kurtosis, 2 is defined as
2 = 2 – 3
85 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
Example 6.3: Based on the data given in the above table find 2 and 2.
4 2.533
2 = = =2.91
2
2 0.9332
2 = 2 – 3, 2.91-3=0.09
Since the value of 2 is less than 3 and the value of 2 is less than 0 the distribution is
platykurtic.
Chapter summary
The shape of the distribution is assessed using the measure of Skewness. Skewness refers to the
lack of symmetry in a frequency distribution. Depending on the shape of the distribution a data
may be positively, negatively or symmetrically distributed. And the peakdness of the
distribution is assessed using of kurtosis. If a curve is more peaked than the normal curve, it is
called “Leptokurtic”, and if a curve is more flat-topped than the normal curve, it is called
“Platykurtic”. The normal curve itself is known as “Mesokurtic”.
Self-evaluation test
1. Some characteristics of annually family income distribution (in Birr) in two regions is as
follows
Region Mean Median Standard deviation
A 6250 5100 960
B 6980 5500 940
86 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
7.1 Introduction
Regression and correlation analysis is used to study relationships among variables. Simple
regression analysis sees the relationship between variables, which we call the dependent and
independent variables. In that we try to determine what will be the change in the value of the
dependent for unit change in the value of the independent variable. For example, by how much
will the yield per hectare will increase as we increase the amount of fertilizers by one gram? By
how much student’s GPA increase the increase his/her stay on reading by one minute? As the
frequency of visit by the development agent increase by one day, by how much will the
probability of adopting new technology by farmer increase?
87 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
In this chapter, we will confine ourselves to the type of regression involving only two variables
and the type of relationship between our variables which is linear. Finally we will discuss the
measurement of the closeness of the relationship between variables or the correlation analysis.
Simple linear regression: refers to the linear relationship between two variables, the dependent
variable Y and the independent variable X.
A simple linear regression line is the line fitted to points plotted in the scatter diagram
which would describe the average relationship between the two variables. Therefore, to see the
type of relationship, it is advisable to prepare scatter plot before fitting the model.
𝑌 = 𝛼 + 𝛽𝑋 + 𝜀
Where
𝑦 = 𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒
𝑥 = 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒
𝛼 = 𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡
𝛽 = 𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑠𝑙𝑜𝑝𝑒
𝑌~𝑁(𝛼 + 𝛽𝑋, 𝛿 2 )
𝜀~𝑁(0𝛿 2 )
88 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
The equation of the line which is to be used in predicting the value of the dependent variable
takes the form, 𝑌̂ = 𝛼̂ + 𝛽̂𝑋𝑖.
The most universally used and statistically accepted method of fitting such an
equation is the method of least squares. The method of least squares states that the value of 𝛼
and 𝛽 should be chosen given that the sum of squared residuals is minimized.
As shown in the scatter diagram, if 𝜀1 , 𝜀2 , 𝜀3 , … , 𝜀5 are the observed Y values from the straight
line (predicted Y value values - 𝑌̂ ), fitting a straight line in keeping with the above condition
requires that (for n sample size).
𝜀1 + 𝜀2 + ⋯ . . 𝜀𝑛 = ∑ 𝜀𝑖 2 𝑖𝑠 𝑚𝑖𝑛𝑖𝑚𝑢𝑚
2 2 2
𝑖=1
This can be done by partially deviating ∑ 𝜀𝑖 2 with respect to 𝛼̂ and 𝛽̂ and equating them to zero.
𝜀𝑖 = 𝑌𝑖 − 𝑌̂
2
∑ 𝜀𝑖 2 = ∑(𝑌𝑖 − 𝑌̂)
2
∑ 𝜀𝑖 2 = ∑(𝑌𝑖 − 𝛼̂ − 𝛽̂𝑋𝑖)
2
𝜕 ∑ 𝜀𝑖 2 𝜕 ∑(𝑌𝑖 − 𝛼̂ − 𝛽̂𝑋𝑖)
= =0
𝜕𝛼̂ 𝜕𝛼̂
−2 ∑(𝑌𝑖 − 𝛼̂ − 𝛽̂𝑋𝑖) = 0
∑ 𝑌𝑖 − ∑ 𝛼̂ − ∑ 𝛽̂𝑋𝑖 = 0
𝑛 ∑ 𝛼̂ ∑ 𝑌𝑖 𝛽̂ ∑ 𝑋𝑖
= +
𝑛 𝑛 𝑛
𝛼̂ = 𝑌̅ − 𝛽̂ 𝑋̅
2
𝜕 ∑ 𝜀𝑖 2 𝜕 ∑(𝑌𝑖 − 𝛼̂ − 𝛽̂𝑋𝑖)
= =0
𝜕𝛽̂ 𝜕𝛽̂
−2 ∑(𝑌𝑖 − 𝛼̂ − 𝛽̂𝑋𝑖)𝑋𝑖 = 0
89 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
∑ 𝑌𝑖𝑋𝑖 − 𝛼̂ ∑ 𝑋𝑖 − 𝛽̂ ∑ 𝑋𝑖 2 = 0
∑ 𝑌𝑖𝑋𝑖 − ̅̅̅
(𝑌 − 𝛽̂𝑋̅) [∑ 𝑋𝑖 − 𝛽̂ ∑ 𝑋𝑖 2 ] = 0
∑ 𝑌𝑖𝑋𝑖 − 𝑌̅ ∑ 𝑋𝑖
𝛽̂ =
̅̅̅̅̅̅̅̅̅
∑ 𝑋𝑖 2 − 𝑋𝑖 ∑ 𝑋𝑖
𝑛 ∑ 𝑌𝑖𝑋𝑖 − ∑ 𝑌𝑖 ∑ 𝑋𝑖
𝛽̂ =
𝑛 ∑ 𝑋𝑖 2 − (∑ 𝑋𝑖 )2
Example 7.1: Suppose we want to study the relationship between input (number of workers) and
output (thousands of Birr) of five factories given in table given below.
Yi Xi
4 2
7 3
3 1
9 5
17 9
a. Find regression line of Yi (thousands of Birr) on Xi (number of workers, we can employ the
method of least squares as follows:
b. Estimate the amount of Yi (thousands of Birr) that the factory will have if it has employed
12 workers,
Solution: a.
Yi Xi YiXi Xi2
4 2 8 4
7 3 21 9
3 1 3 1
9 5 45 25
17 9 153 81
∑ 𝑌𝑖 = 40 𝑌̅=8 ∑ 𝑋𝑖 = 20, 𝑋̅=4 ∑ 𝑌𝑖𝑋𝑖 = 230 ∑ 𝑋𝑖 2 = 120
𝑛 ∑ 𝑌𝑖𝑋𝑖 − ∑ 𝑌𝑖 ∑ 𝑋𝑖
𝛽̂ =
𝑛 ∑ 𝑋𝑖 2 − (∑ 𝑋𝑖 )2
90 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
5 ∗ 230 − 40(20)
𝛽̂ =
5 ∗ 120 − (20)2
1150 − 800
=
600 − 400
350
=
200
𝛽̂ = 7⁄4
𝛼̂ = 𝑌̅ − 𝛽̂ 𝑋̅
𝛼̂ = 8 − 7⁄4 (4)
𝛼̂ = 8 − 7⁄4 (4) =1
𝑌̂ = 1 + 7⁄4 𝑋𝑖.
a. Xi= 12
𝑌̂ = 1 + 7⁄4 𝑋𝑖.
Consequently if the company employed 12 workers, its level of output will be 22,000 ETB.
Regression of X on Y
Sometimes it is possible, and if interest to fit the regression of X on Y type, i.e. being
Y as independent and X dependent variable. In Such case the general form of the equation is
given by
𝑋̂ = 𝛼̂ + 𝛽̂𝑌𝑖
Applying the principle of least square as before, the constants 𝛼̂ and 𝛽̂ are given as follows
𝛼̂ = 𝑋̅ − 𝛽̂𝑌̅
𝑛 ∑ 𝑌𝑖𝑋𝑖 − ∑ 𝑌𝑖 ∑ 𝑋𝑖
𝛽̂ =
𝑛 ∑ 𝑌𝑖 2 − (∑ 𝑌𝑖 )2
91 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
7.3 Correlation
Correlation analysis helps us in determining the degree of relationship between two or more
variables. If two quantities moves in such a way that movements in one are accompanied by
movements in the other, these quantities are correlated. For example there exists some
relationship between price of commodity and amount demanded, increase in rainfall up to a point
and production of Wheat, etc.
Simple, Partial and multiple correlations: The distinction between simple partial and multiple
correlations is based up on the number of variables studied. When only two variables are studied
it is a problem of simple correlation. When three or more variables are studied it is a problem of
either multiple or partial correlation.. In multiple correlation three or more variables are studied
simultaneously. On the other hand, in partial correlation we recognize more than two variables,
but consider only two variables les to be influencing each other, the effect of other influencing
variables being kept constant.
Linear and non-linear correlation: The distinction between linear and non-linear correlation is
based up on the consistency of the ratio of change between the variables. If the amount of change
in one variable tends to bear constant ratio to the amount of change in the other variable then the
correlation is said to be linear and if it does not bear a constant change to the amount of change
in the other variable the correlation would be called non-linear or curvilinear correlation.
Positive and negative correlation: If both the variables are varying in the same direction, i.e.., if
both variables are increasing or if both variables are increasing, correlation is said to be positive.
If on the other hand, if both the variables are varying in the opposite direction, i.e.., as the one
variable is increasing, the other is decreasing or vice versa, , correlation is said to be negative.
In this study we will concentrate on the relationship between two variables, which is simple
correlation.
92 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
The Karl Pearson method, popularly known as Karl Pearson’s coefficient of correlation is most
widely used in practice. The Parsonian coefficient of correlation is denoted by the symbol 𝑟.
∑(𝑋𝑖−𝑋̅)(𝑌𝑖−𝑌̅)
𝑟= And is termed as product-moment formula and it can be further
√∑(𝑋𝑖−𝑋̅)2 ∗∑(𝑌𝑖−𝑌̅)2
simplified as
𝑛 ∑ 𝑌𝑖𝑋𝑖 − ∑ 𝑌𝑖 ∑ 𝑋𝑖
𝑟=
√[𝑛 ∑ 𝑋𝑖 2 − (∑ 𝑋𝑖)2 ] ∗ [𝑛 ∑ 𝑌𝑖 2 − (∑ 𝑌𝑖)2 ]
Example 7.2: Find the Pearsonian coefficient of correlation for the two variables from the data
given below.
X 9 8 7 6 5 4 3 2 1
Y 15 16 14 13 11 12 10 8 9
Solution: N= 9
X X2 Y Y2 XY
9 81 15 225 135
8 64 16 256 128
7 49 14 196 98
6 36 13 169 78
5 25 11 121 55
4 16 12 144 48
3 9 10 100 30
2 4 8 64 16
1 1 9 81 9
∑ 𝑋 = 45 ∑ 𝑋 2 = 285 ∑ 𝑌 = 108 ∑ 𝑌 2 = 1,356 ∑ 𝑋𝑌 = 597
𝑛 ∑ 𝑌𝑖𝑋𝑖 − ∑ 𝑌𝑖 ∑ 𝑋𝑖
𝑟=
√[𝑛 ∑ 𝑋𝑖 2 − (∑ 𝑋𝑖)2 ] ∗ [𝑛 ∑ 𝑌𝑖 2 − (∑ 𝑌𝑖)2 ]
93 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
1. -1≤ 𝑟 ≤1
2. When 𝑟 = 0 there is no correlation
3. The correlation coefficient 𝑟is independent of change of scale and origin.
4. The closeness of the relationship is not proportional to the value of 𝑟.
5. When 𝑟 = +1,it means there is perfect positive correlation
𝑟 = −1, it means there is perfect negative correlation
The closer 𝑟 is to +1 and -1, the closer the relationship between the variables and the
closer 𝑟 to 0, the less close the relationship.
6. It is free from any unit of measurement used.
The Pearsonian coefficient of correlation cannot be used in cases when the direct quantitative
measurement of the phenomenon under study is not possible. In such cases, we make use of the
Spearman’s rank correlation coefficient. Spearman’s rank correlation coefficient is defined as
6 ∑ 𝐷2 6 ∑ 𝐷2
follows; 𝑅 =1− or 𝑅 =1−
𝑁(𝑁2 −1) 𝑁3 −𝑁
1. Rank the X values among themselves giving rank (1) to the largest (or smallest value and
(2) to the next largest (or smallest) value and so on.
3. Find the sum of the squares of the differences between ranks of two variables
94 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
Example 7.3: The ranking of 10 students in two subjects A and B are as follows, see if there
is any correlation between students performance in the two subjects
A 6 5 3 10 2 4 9 7 8 1
B 3 8 4 9 1 6 10 7 5 2
Solution:
R1 R2 (R1-R2)2=D2
6 3 9
5 8 9
3 4 1
10 9 1
2 1 1
4 6 4
9 10 1
7 7 0
8 5 9
1 2 1
∑ 𝐷2 = 36
6 ∑ 𝐷2 6 ∗ 36 216
𝑅 =1− 3
=1− 3 =1− = 0.782
𝑁 −𝑁 10 − 10 990
Interpretation: Since 𝑅 = 0.782 there is similarity in the rank of the 10 students in the two
subjects.
Equal ranks: If two or more individuals are equal it is customary to give each individual an
average rank. If two individuals are ranked equal at fourth place, they are each given the rank
4+5 4+5+6
= 4.5 if three are ranked at fourth place, they are given the rank = 5.
2 3
When equal ranks are assigned the same entries an adjustment in the above formula for
calculating the rank coefficient of correlation is made. The adjustment consists of adding
1
(𝑚3 − 𝑚) to the value of6 ∑ 𝐷2 , where 𝑚 stand for the number of items whose ranks are
12
common. If there are more than one such group of items with common rank, the value is added
as many times as the number of such groups. The formula can thus be written;
1 1
6 {∑ 𝐷2 + (𝑚3 − 𝑚) + (𝑚3 − 𝑚) … … . }
𝑅 =1− 12 12
𝑁3 − 𝑁
95 | P a g e
Debre Berhan University, Economics Program Introduction to Statistics
Chapter summary
Simple regression analysis is the measure of the average relationship between two variables.
Regression analysis provides estimates of values of the dependent variable from values of the
independent variable. Correlation analysis helps us in determining the degree of relationship
between two or more variables. Relationship between variables can be positive or negative. The
degree of relationship between the variables under consideration is measured through the
correlation analysis
Self-evaluation test
96 | P a g e