1 Descriptive Part
1 Descriptive Part
Introduction
Uses of statistics
Statistics is used in almost all fields of human activities and used by government bodies, private business firms
and research agencies as a major tool. Some of the uses are:
✓ It is also helpful in formulating and testing hypothesis and to develop new theories
✓ It can condenses and summarizes complex data
✓ It helps to predict the future trend
Application area of statistics
✓ In research work: statistics is indispensable in research work
✓ In engineering areas and physical science
✓ In economics and biological science
✓ In social science and politics etc
Limitation of statistics
As there is much usefulness of statistical methods, there are also many potential errors and limitations in carrying
out and interpreting statistical studies.
✓ Complete accuracy in statistics is often impossible.
✓ It cannot deal with a single value. But it deal with a set of data
✓ It cannot deal with qualitative data. It only deals with data which can be quantified. Ex: it does not deal
with marital status (married, single) but it deal with a number of married, a number of single
✓ Statistical values are true on average. The conclusions drawn from the analysis of the sample may
perhaps, differ from the conclusions that would be drawn from the entire population. For this reason
statistics is not an exact science.
1
Some Basic Terminologies in Statistics
Population:
✓ It is a totality of things, objects, people, etc with which the researcher is concerned.
✓ It can be qualitative or quantitative, finite or infinite
Sample:
✓ It is a portion or part of population of interest.
Parameter:
✓ It is a numerical characteristic of an entire population (Greek letters)
Statistic:
✓ It is a numerical characteristic of a sample (Latin letters)
Variable:
✓ It is a certain characteristic that difference from object to object.
Examples: Weight, stock prices, height, price of gasoline
Types of variables
1. Quantitative variables:
✓ They are variables that can be expressed numerically.
✓ They are variables that assume values of the measurable quantity.
✓ It can be classified as:
a. Discrete variables:
o They are variables whose values can obtain by counting.
o The possible values for such variables are 0, 1, 2…. Ex: number of children in a family,
number of trees in forest.
b. Continuous variables:
o They are variables whose value can take any value b/n two №.
o Their values are obtained by measuring. Ex: weight, height, rain fall records.
2. Qualitative variables:
✓ They are variables that cannot be expressed numerically.
✓ It is also known as categorical variables.
Note:
✓ In quantitative variable an operation such as addition or average can make a sense. But for qualitative
it can’t make a sense.
✓ A categorical variable is also known as an attribute, whereas a quantitative variable is often referred
to simply a variable.
✓ If the variable can assume only one value, it is called a constant.
✓ In general, measurements give rise to continuous data, while enumerations, or counting’s, give rise to
discrete data.
Data:
✓ It is the set of values collected for the variable for each of the elements of a population or sample
✓ Data are a numerical representation of a phenomenon.
✓ It is information that expressed in quantitative form
Types of Data
1) Nominal data- Categorical data where the categories are not ordered (e.g., ethnic group). Data that is
classified into categories and cannot be arranged in any particular order.
1. Ordinal data - Categorical data that can be ordered, but the increment between specific values is
arbitrary. data arranged in some order, but the differences between data values cannot be determined or
are meaningless
2. Cardinal data - Data on scale where addition is meaningful (e.g., change in 3 inches for height).
There are two types of cardinal data:
a) Ratio-scale data - Cardinal data on a scale where ratios between values are meaningful (e.g.,
serum-cholesterol levels).
b) Interval-scale data - Cardinal data where the zero point is arbitrary. for such data, ratios are
not meaningful (Julian dates; we can calculate the number of days between two dates, but we
can’t say that one date is twice as large as another date).
2
Note:
✓ For ratio, the origin (i.e., the value zero) is meaningful №. But the origin has no meaning with interval.
Consequently, we can add and subtract interval, we cannot divide & multiply them. In ratio we can
use all operations (i.e. addition, subt. Divi. multiplication)
✓ Nominal & ordinal scales are belongs to qualitative variables, whereas interval & ratio scale are
quantitative.
Chapter Two
Method of Data Collection and presentation
Data can be collected in a variety of ways. One of the most common methods is through the use of surveys
Question: what is survey?
Survey:
✓ It is requiring data from individual directly or indirectly.
✓ It can be conducted through the mail, telephone, personal interview, etc.
✓ There are two kinds of survey:
1. Census survey (complete enumeration survey):
• It is a survey that includes every element in the population.
2. Sample survey:
• It is a survey that includes only subset of the population.
Note:
✓ If your data represents only a portion of the population you have a sample.
✓ If your data represents the entire population you have a census.
✓ Sample survey is better than census; b/c it reduces cost, reduces effort, and accommodate more detail
information.
✓ Census is better than survey, when the number of population is small, the populations are
heterogeneous.
3
Definition: frequency distribution (f.d)
✓ It is organizing data in table form, using classes & frequencies.
✓ It shows how many observations fall in various categories.
✓ It can be classified as:
A. Categorical (qualitative) f.d
B. Numerical (quantitative) f.d
Ex: 25 army inductees were given a blood test to determine their blood type. The data set is
A B AB B O
O O AB B B
B B A O O
A O O O AB
AB A B O A
Note:
We can transform the frequency distribution into a relative frequency distribution, percentage frequency
distributions & cumulative frequency distribution.
✓ In order to transform f.d to relative f.d we can use the f.f formulae
f
Relative f.d= , w/r f = actual frequency & n = total frequency
n
✓ In order to transform f.d to percentage distribution we multiply relative f.d by 100%
f
i.e. percentage distribution = * 100%
n
✓ In order to transform f.d to cumulative f.d we have to define cumulative f.d
Definition: Cumulative frequency distribution of a class is the sum of all frequencies preceding or succeeding
that class including the frequency of that class. There are two types of cumulative frequency distributions namely
“less than “and “more than “cumulative frequency distributions.
I. The “less than” cumulative frequency distribution (LCF) of a class is obtained by adding the frequency
of the preceding classes including the frequency of that class.
II. The “more than” cumulative frequency distribution (MCF)of a class is obtained by adding the
frequency of the succeeding classes including the frequency of that class.
• From the above example let as construct all form of f.d
Class Frequency Relative Percentage Cumulative frequency
frequency frequency LCF MCF
A 5 0.2 20% 5 25
B 7 0.28 28% 12 20
O 9 0.36 36% 21 13
AB 4 0.16 16% 25 4
4
Note: from the above table we can construct
relative f.d as follows
Class Relative frequency
A 0.2
B 0.28
O 0.36
AB 0.16
percentage f.d as follows
Class Parentage frequency
A 20%
B 28%
O 36%
AB 16%
cumulative f.d as follows
Class Cumulative frequency
LCF MCF
A 5 25
B 12 20
O 21 13
AB 25 4
12 17 12 14 16 18 16 18 12 16
17 15 15 16 12 15 16 16 12 14
15 12 15 15 19 13 16 18 16 14
Solution:
Step1. Determine the class, (i.e. the classes are 12, 13, 14, 15, 16, 17, 18, and 19)
Step2. Determine the frequency for each class
Therefore the f.d is as follows
Class 12 13 14 15 16 17 18 19
Frequency 6 1 3 6 8 2 3 1
b) Grouped f.d: when the range of the data is large, the data must be grouped in to class that is more than one unit
in width, in what is called a grouped (continuous) f.d.
Ex: A machine produces the following № of rejects in each successive period of five minute. Construct f.d
16 21 26 24 11 17 26 25 13 27
24 26 3 27 23 24 15 22 22 12
22 29 18 22 28 25 7 17 22 28
19 23 23 22 3 19 13 31 23 28
24 9 20 33 30 23 20 8 21 24
Solution:
Step1.Determine the class
Here for grouped f.d we might have two types of class w/c is called class limit (CL) & class boundary (CB). In
order to have a class we have to use the f.f procedure
5
Determine the № of class (K). It can be calculated as k = 1 + 3.322 log10 . Where k=№ of class
n
I.
required (if the value becomes decimal round to the next whole №), n=№ of observation in the sample.
Or we can find K by using k = 2.5n .
14
II. Determine class width (interval) (size) (W). It can be calculated as W=Range/K. (If the value becomes
decimal round to the next whole №). W/r Range= max-min
III. Select starting pt or the lowest class limits (LCL). This can be the smallest data value. Add the width
to the lowest score taken as starting pt to get the lower limit of the next class. Keep adding the W until
the № of class becomes K.
IV. Subtract one unit from the lower limit of the 2ndclass to get the upper limit of the 1stclass. Then add the
class width to each upper limit to get all the upper limits.
V. Find the class boundaries by subtracting 0.5 from each upper class limit& adding 0.5 to the upper class
limit (UCL)
Step2. Determine the frequency for each class
The completed grouped f.d is as follows:
Class limit Class boundary Frequency
3-7 2.5-7.5 3
8-12 7.5-12.5 4
13-17 12.5-17.5 6
18-22 17.5-22.5 13
23-27 22.5-27.5 17
28-32 27.5-32.5 6
33-37 32.5-37.5 1
Definition of graph:
✓ The word graph comes from the Greek word meaning ‘’to draw or write.’’
✓ We define a graph as a pictorial representation of a set of data.
✓ Many types of graphs are employed in statistics, depending on the nature of the data involved and the
purpose for which the graph is intended.
The step of pictorial representation comes after the raw data set has been pruned & organized
The most common & simple form of Pictorial representation of data are
✓ Bar chart
✓ Pie chart
✓ Histogram
6
Bar chart/bar diagram/bar graph
✓ It is used to display distributions of categorical variables.
✓ One bar per category – height is determined by frequency or relative frequency
✓ Order of categories is arbitrary.
✓ Does NOT let you talk about the shape of a distribution.
Features of a bar chart
✓ Bars can be horizontal or vertical
✓ Bars are of uniform width & uniformly spaced [leave space b/n each bar (category) to indicate distinct]
✓ The length of the bar represents values of the variable being displayed, the frequency of occurrence, or
the percentage of occurrence. The same measurement scale is used for the length of each bar.
✓ The graph is well annotated with title, labels for each bar, & vertical scale or actual value for the length
of each bar.
✓ It can be classified as:
• Simple bar chart
• Component bar chart
• Multiple bar chart
Example: construct a bar chart to show the religion affiliation of the American population
Religion Number of population(million)
Protestant 79
Roman Catholic 31
Jewish 4
Others 2
Number of population(million)
100
50 Number of
population(million)
0
protestant Roman Calholic Jewish others
Figure of Simple Bar Diagrams
Note:
✓ The above graph show that each bar has an equal width but unequal length.
✓ The length indicates the number of population.
✓ It has a limitation b/c a diagram can display only one classification or one category of data.
✓ It may be noted that the simple bars shown in the above figure are drawn vertically. They are, therefore,
known as vertical bars. But the same bars can be drawn horizontally as shown in figure below.
.
Number of population(million)
others
Jewish
Number of
Roman Calholic
population(million)
protestant
0 20 40 60 80 100
Figure Horizontal Simple Bar Diagram
7
Component Bar Diagram: As the name of this diagram implies, it shows subdivisions of components in a
single bar. When it is desired to show how a total is divided into its components, we use a component bar chart.
In this type of bars different colors are used for identification.
Example: display the following using a suitable chart yield of farmers in SNNPR.
CROP/YEAR 1990 1991 1992 1993
PEAS 14 15 26 19
WHEAT 10 15 14 25
MAIZE 2 6 10 3
TOTAL 26 36 50 47
60
Maize
40
Wheat
20
peas
0
1990 1991 1992 1993
Multiple Bars: When two or more interrelated series of data are depicted by a bar diagram, then such a diagram
is known as a multiple-bar diagram. Suppose we have birth rate and death rate of different five countries. We
can display by two bars close to each other, one representing birth rate while the other representing death rate
figure shows such a diagram based on hypothetical data.
Example: the following table give birth rates and death rates of different five countries during 1998
Country Birth Rate Death Rate
A 33 24
B 16 11
C 20 14
D 40 18
Birth Rate
60
40
20
0
A B C D
Example: Draw a pie diagram for the following data of Five year plan public sector
Agriculture and rural Development 12.9%
Irrigation etc 12.5%
Energy 27.2%
Industry and minerals 15.4%
Transport communication 15.9%
Social services and others 16.1%
8
precentage outlay
Solution: the angle at the center is given by 360o= percentage out lay x 3.6'
100
Percentage outlays Angle at the center
Agriculture and rural Development 12.9% 12.93.6=46o
Irrigation etc 12.5% 12.53.6=45o
Energy 27.2% 27.23.6=98o
Industry and minerals 15.4% 15.43.6=56o
Transport communication 15.9% 15.93.6=57o
Social services and others 16.1% 16.13.6=58o
Transport communication
Ogive curves
So far we have discussed the graphic devices, that showed frequencies as are given to us or we may say non-
cumulative frequencies. We now take up another type of graph, which is based on cumulative frequencies. It is a
graph that represents the cumulative frequencies for the classes in f.d.
9
Chapter Three
Numerical representation of a data set
There are three basic ways to summarize numerical data. These are
1. Measure of Central Tendency(MCT)
2. Measure of Variation (Dispersion)
Measure of Central Tendency (MCT):
✓ Quantitative variables contained in raw data or in frequency tables can be summarized by means of a few
numerical values. A key element of this summary is called the MCT. It is also called measure of average
Three measures of the center of a distribution are commonly used: mean, median, and mode. Any of them can
be used with normally distributed data; however, with ordinal data, the mean of the raw scores is usually not
appropriate. Especially if one is computing certain statistics, the mean of the ranked scores of ordinal data
provides useful information. With nominal data, the mode is the only appropriate measure
Mean
✓ It is a measure of location or central value for a continuous variable.
✓ Most useful when the data have a symmetric distribution and do not contain outliers.
✓ It is the most popular & best understood MCT for a quantitative data set. Thus, it is usually the statistic
of choice, assuming that the data are normally distributed data.
10
Properties of the summation notation
n n n
1. i = 1 + 2 + 3 + ...n
i =1
4. ( xi + c) = xi + (n c )
i =1 i =1
n n n
2. 1 = n
i =1
5. cx
i =1
i = c xi , where c is a
i =1
n constant number
3. c = n c , where c is a constant n n n
i =1 6. ( xi y i ) = xi y i
i =1 i =1 i =1
number
xi f i xi fm i i
AM = E ( x ) = = i =1
AM = E (x ) = = i =1
AM = E (x ) = = i =1
N N N
For sample data For sample data For sample data
n n n
xi f i xi fm i i
AM = M (x ) = x = i =1
AM = M (x ) = x = i =1
AM = M (x ) = x = i =1
n n n
Where:
11
Note:
✓ There can be more than one mode or there may no mode when all observation in the data set have
equal frequency
✓ When all the values occur the same number of times, we usually say that there is no unique mode.
The following table indicates the formula for median & mode (for sample data)
For ungrouped data For grouped data
If n is odd: n
2 − Cf w
n +1
th
✓ Median = Median = LCB +
2 fmi
If n is even:
th th
Where:
n n ✓ LCB = the lower class boundary of the median class
+ + 1
Median = 2
2 ✓ Cf = the LCF of the class above the median class,
✓
2 ✓ fmi = frequency of the median class &
✓ W is the width of the median class.
Mode = the value that have the most frequency f 1 − f0
Mode = l o + w
( f 1 − f 0 ) + ( f 1 − f 2 )
Where:
✓ l o = is the lowest class boundary of the modal class,
✓ f1 = the frequency of modal class
✓ f 0 = the frequency of the class preceding the modal class,
✓ f 2 = the frequency of the class succeeding the modal class.
✓ w = the class width.
Note:
The value of central tendency, however, does not completely describe the data. Therefore, some additional
characteristics of the data must be used to provide for a more complete summary and description of the data and
to distinguish between dissimilar data sets. The next section deals with this additional characteristic, the
variability of the data.
Example: Consider the following two sets of data.
i. 6, 18, 30 and
ii. 17, 18, 19
6 + 18 + 30 54 17 + 18 + 19 54
xi = = = 18 and xii = = = 18
3 3 3 3
Observation Even though the two sets of data have the same arithmetic mean, the values in i are more scattered
or dispersed than that of ii.
When comparing sets of data, it is useful to have a way of measuring the scatter of spread of the data.
✓ Variation or dispersion is the degree to which numerical data is scattered or spread about some measure
of central tendency (usually the mean).
12
Variance (V): Variance also indicates a relationship between the mean of a distribution and the data points; it is
determined by averaging the sum of the squared deviations. Squaring the differences instead of taking the absolute
values allows for greater flexibility in calculating further algebraic manipulations of the data. Another measure of
variation is the standard deviation.
The following table indicates the formula for variance & standard deviation
(x − ) f (x − ) f (m − )
2 2 2
i i i i i
2 = E ( x − )2 = i =1
2 = i =1
2 = i =1
N N N
Variance: For sample data Variance: For sample data Variance: For sample data
n n n
(x − x) f (x − x) f (m − x)
2 2 2
i i i i i
S 2 = M (x − x ) = i =1
S2 = i =1
S2 = i =1
2
n −1 n −1 n −1
Standard deviation: For population Standard deviation: For popul. Standard deviation: For population
2 = 2 = 2 =
Standard deviation: For sample Standard deviation: For sample Standard deviation: For sample
S =S
2
S =S
2
S2 = S
Where:
Note:
✓ The denominator in sample variance formula is n -1. This is b/c the sample variance underestimates the
population variance when the denominator in the sample formula for variance is n.
13