CH-2 Stat I

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

CHAPTER TWO

DATA COLLECTION AND PRESENTATION

INTRODUCTION

The term “ Data Collection” refers to all the issues related to data sources, scope of investigation
and sampling techniques. In this chapter, our discussion starts with the discussion of the meaning
of data collection. Having the reader acquainted with the meaning of data collection, the chapter
advances to the discussion of the two sources of data namely, primary and secondary sources. In
addition, the different methods of collecting data from primary sources are discussed.

2.1. MEANING OF COLLECTION OF DATA

Collection of data implies a systematic and meaningful assembly of information for the
accomplishment of the objective of a statistical investigation. It refers to the methods used in
gathering the required information from the units under investigation.

The quality of data greatly affects the final output of an investigation. Hence, utmost care should
be attached to the data collection process and every possible precaution should be taken to ensure
accuracy while collecting data. Otherwise, with inaccurate and inadequate data, the whole
analysis is likely to be faulty and also the decisions to be taken will also be misleading.

PRIMARY AND SECONDARY DATA


Meaning and distinction between primary and secondary data

Statistical data may be obtained either from primary or secondary source. A primary source is a
source from where first-hand information is gathered. On the other hand, secondary source is the
one that makes data available, which were collected by some other agency. Clearly, a source,
which is not primary, is necessarily a secondary source. Primary sources are original sources of
data.

Data obtained from a primary source is called primary data. Likewise, data gathered from a
secondary source is known as secondary data. For example, assume that a simple study is to be
conducted to see the age distribution of HIV/AIDS victim citizens. Clearly, the variable of study

Page 1 of 31
is age. Data about the age of HIV/AIDS victim citizens may be obtained by making direct
interview with the victims. Note, in this specific case, the victim citizens are primary sources.
Moreover, the data to be collected from them are primary data. Alternatively, one may use
records of hospitals and other related agencies to obtain the age of the victim citizens without the
need of tracing the victims personally. Therefore, the records of the hospitals, in our case, are
secondary sources and the data copied from such records are secondary data.

In most cases, secondary data is obtained from such sources as census and survey reports, books,
official records, reported experimental results, previous research papers, bulletins, magazines,
newspapers, web sites, and other publications. Different organizations and government agencies
publish information (data) in the form of reports, periodicals, journals, etc. In the case of
Ethiopia, the Central Statistics Authority (CSA) is the first to be mentioned in publishing such
relevant information (secondary data).

Advantages and Disadvantages of Primary and Secondary data

The following are major advantages of primary data over that of secondary data.

 The primary data gives more reliable, accurate and adequate information, which is
suitable to the objective and purpose of an investigation.
 Primary source usually shows data in greater detail.
 Primary data is free from errors that may arise from copying of figures from
publications, which is the case in secondary data.

The disadvantages of primary data are:

 The process of collecting primary data is time consuming and costly.


 Often, primary data gives misleading information due to lack of integrity of
investigators and non-cooperation of respondents in providing answers to certain
delicate questions.

Advantage of Secondary data:

Page 2 of 31
 It is readily available and hence convenient and much quicker to obtain than primary
data,
 It reduces time, cost and effort as compared to primary data,
 Secondary data may be available in subjects (cases) where it is impossible to collect
primary data. Such a case can be regions where there is war.

Some of the disadvantages of Secondary data are:

 Data obtained may not be sufficiently accurate,


 Data that exactly suit our purpose may not be found,
 Error may be made while copying figures.

The choice between primary data and secondary data is determined by factors like nature and
scope of the enquiry, availability of financial resources, availability of time, degree of accuracy
desired, and the collecting agency. Often, primary data are used in situations where secondary
data do not provide adequate basis of analysis. Meaning, when the secondary data do not suit a
specific investigation we use primary data. Unless for such cases, most statistical investigations
rest up on secondary data since it minimizes cost and saves time. Nevertheless, the following
points should be carefully considered while using secondary data in our investigation.

 One should closely examine whether or not the data are suitable for the intended study,
 The source of data should be viewed, keeping in mind whether, at any time, it is reliable or
not. If there is any doubt about the reliability of data, it should not be used,
 It should be noted that the data is not obsolete,
 In case the data are based on a sample, one should see whether the sample is a proper
representative of the population,
 It should be the case that skilled persons only have handled the primary data carefully.

Finally, it should be clear that primary data in the hands of one person might be secondary in the
hands of another. That is why it is often said, “the difference between primary and secondary
data is largely one of degree.”

Methods of collecting primary data

Page 3 of 31
After discussing the two sources of data, primary and secondary, it is logical to say a few words
about the methods employed in collecting data from its original or primary source.
Many authors commonly state three methods of collecting primary data. These are:

a. Personal Enquiry Method (Interview method)


b. Direct Observation
c. Questionnaire method

a) Personal Enquiry Method (Interview method)


In personal enquiry method, a question sheet is prepared which is called schedule. The schedule
contains all the questions, which would extract a complete report from a respondent. Usually,
schedules are pre-tested so as to remove certain discrepancies like ambiguities of the questions
and irrelevant questions. This pre-testing process is called a pilot survey. It is worth mentioning
that the schedule is not directly given to the respondent. Rather, it is the interviewer who asks
those questions on the schedule and jot down the interviewee‟s (respondent‟s) response.
Depending on the nature of the interview, personal enquiry method is further classified into two
types.

 Direct Personal Interview: It is a type of personal enquiry where there is a face-to-face


contact with the persons from whom the information is to be obtained. In other words, the
investigator contacts each respondent personally, without the interference of third party,
and asks questions given in the schedule one by one and notes down respondent‟s replies
on the schedule.
 Indirect Personal Enquiry (Interview): It is the second type of personal enquiry where
the investigator contacts third parties called witnessed who are capable of supplying the
necessary information. Here, the information is not collected directly from the respondent
but from a third person who knows the respondent well. Such an approach is useful in
case where the respondent is expected to conceal information about him or her. For
example, if an enquiry about the habit of using condoms is distributed in a village, most
of the villagers may not provide the correct information. Thus, it would be wiser to get
the required information from other parties, like the nearby condom dealing shop.

Page 4 of 31
b) Direct Observation
In this approach, an investigator stays at the place of survey and notes down the observation
himself. There is no enquires in the case of direct observation. For example, an investigator
making a study on nutritional status of children may directly (physically) measure the weight,
height, and other required parameters himself/herself. Direct observation is more experimental
and usually applied in scientific studies. It is time consuming and also costly.
c) Questionnaire Method
Under this method, a list of questions related to the survey is prepared and sent to the various
respondents by post, Web sites, e-mail, etc. However, this method cannot be used if the
respondent is illiterate. It is a method that is often used in many statistical investigations.

The following are the major points that we need to take into account while preparing a
questionnaire.

 The number of questions should be small. Naturally, respondents are not comfortable with
lengthy questionnaires. Lengthy questionnaires usually bore respondents. Hence, fifteen to
twenty five questions in a questionnaire are optimal. If a lengthy questionnaire is
unavoidable, it should preferably be divided into two or more parts.

 The questions should be short, clear, simple and unambiguous. Moreover, the questions must
be arranged in a logical order so that natural and spontaneous reply to each is induced. For
instance, it is not appropriate to ask a person how many packets of cigarette he/she smokes
before asking whether he/she smokes or not.

 Questions of sensitive nature should be avoided. Sensitive questions are those questions that
are too personal and pecuniary like “ Sources of income”, “Drinking habit”, etc. The logic
here is that respondents do not willingly answer sensitive questions. Such information, if
necessary, may be gathered through interview or through other indirect questions.
 Questions should be capable of objective answers. As much as possible, avoid subjective
questions and keep to questions of fact. To this end, multiple answer questions can be used.

 Mail questionnaires should be accomplished by a covering letter, which should state the
purpose of the questionnaire, promise of confidentiality of responses, etc.

Furthermore, the questions preferably designed in such a way can easily be answered as yes/no.

Page 5 of 31
2.2. LEVEL (SCALE) OF MEASUREMENT
There are four general levels of measurements:
These are: Nominal, ordinal, interval and ratio levels of measurements

i. Nominal Level
The terms nominal level of measurements and nominal scaled are commonly used to refer to data
that can only be classified in to categories. In the strict sense of the words, however, there are no
measurements and no seals involved. In stead, there are just counts.

Look at the information presented in the table below,


Religion reported by the population of the United States 14 years old and older
Religion Total
Protestant 78,952,000
Roman catholic 30,669,000
Jewish 3,868,000
Other religion 1,545,000
No religion 3,195,000
Religion not reported 1,104,000
Total 119,333,000

In the above table, the arrangement of religions could have been changed. This indicates that for
nominal level of measurement, there is no particular order for the groupings. Further, the
categories are considered to be mutually exclusive.
Nominal level is considered the most primitive, the lowest or the most limited type of
measurement

Page 6 of 31
ii. Ordinal Level
Look at the data below.
Ratings of the company commander

Rating Number of nurses


Superior 6
Good 28
Average 25
Poor 17
Inferior 0

The table lists the ratings of company commander by the nurses under her command. This is an
illustration of the ordinal level of measurement. One category is higher than the next one; that is,
“Superior” is higher rating than” good”, “good” is higher than “average”, and so on.

If 1 is substituted for “superior”, 2 substituted for „good‟ and so on, a 1 ranking is obviously
higher than a 2 ranking, and a 2 ranking is higher than a 3 ranking. However it cannot be said
that (as an example) a company commander rated good is twice as competent as one rated
average, or that a company commander rated superior is twice as competent as one rated good. It
can only be said that a rating of superior is greater than a rating of good, and a good rating is
greater than an average rating.

The major difference between a nominal level and an ordinal level of measurement is the
“greater than” relationship between the ordinal-level categories. Otherwise, the ordinal seal of
measurement has the same characteristics as the nominal scale; namely, the categories are
mutually exclusive and exhaustive.

iii. Interval level


The interval scale of measurement is the next higher level. It includes all the characteristics of
the ordinal scale, but in addition, the distance between values is a constant size. If one
observation is greater than another by a certain amount, and the zero point is arbitrary, the
measurement is on at least an interval scale. For example, the difference between temperatures of
70 degrees and 80 degrees is 10 degrees. Likewise, a temperature of 90 degrees is 10 degrees
more than a temperature of 80 degrees, and so on. Scores on a statistics or mathematics
examination are also examples of the interval scale of measurement.
Page 7 of 31
iv. Ratio level
Ratio level is the highest level of measurement. This level has all the characteristics of interval
level. The distances between numbers are of a known, constant size; the categories are mutually
exclusive, and so on.

The major differences between interval and ratio levels of measurement are these: (1) Ratio-level
data has a meaningful zero point and (2) the ratio between two numbers is meaningful. Money is
a good illustration having zero dollars has meaning you have none! Weight is another ratio-level
measurement.

If the dial on a scale is zero, there is a complete absence of weight. Also, if you earn $40,000 a
year and John earns $ 10,000, you earn four times what he does. Likewise, if you weigh 80 kg.
and John weight 40 kg., you weigh twice John. But such comparisons are impossible in interval
level of measurement.

2.3. CLASSIFICATION OF DATA

After collecting relevant information (data) for the purpose of statistical investigation, the next
important task is classification and presentation of this data. It is difficult to group the meaning
of any considerable volume of numerical data unless their mass is some hours reduced to
relatively few convenient classes or categories and presented with the help of some kinds of
visual aid.

This section discusses classification of data. Presentation of data using graphs and charts will be
seen in the next unit.

Classification: - is the process of arranging things in groups or classes according to their


resemblance.

Purposes of Classification: -
 To eliminate unnecessary detail.
 To bring out clearly points of similarity & dissimilarity
 To enable one to form mental pictures of objects on measurements
 To enable one to make comparisons and draw inferences
Page 8 of 31
2.4. METHODS TYPES OF CLASSIFICATION
1. Geographical Classification: - Data are arranged according to places like continents,
regions, and countries
Example

Region Dominant Language Spoken


East Africa Swahili
West Africa French
North Africa Arabic
South Africa English

2. Chronological Classification:- Data are arranged according to time like year, month.
Example
Year (in EC) Population (in million)

1974 30

1986 52

1991 60

3. Qualitative Classification: - Data are arranged according to attributes like color, religion,
marital-status, sex, educational background, etc.
Example 3.

Employees in a Factory x

Educated Uneducated

Male Female Male Female

4. Quantitative Classification:- In this type of classification, the statistical data is classified


according to some quantitative variables. The variable may be either discrete or continuous.

Page 9 of 31
Example 4.
Mr. x Height (X) in cm

A 160

B 182

C 175

D 178

A. Discrete Variables – are variables that are associated with enumeration or counting
Example
Number of students in a class
Number of children in a family, etc

B. Continuous Variables – are variables associated with measurement.


Example
Weights of 10 students.
The heights of 12 persons.
Distance covered by a car between two stations etc.

2.4.1. FREQUENCY DISTRIBUTION

When the raw data have been collected, they should be put in to an ordered array in an ascending
or descending order so that it can be looked at more objectively. Then this data must be
organized in to a “FD” which simply lists the values or classes with their corresponding
frequencies in a tabular form. Here, frequency refers to the number of observations a certain
value occurred in a data.
The tabular representation of values of a variable together with the corresponding frequency is
called a Frequency Distribution (FD).
Definition:

A frequency distribution is the organization of raw data in table form, using classes and
frequencies.

Frequency distribution is of two kinds

Page 10 of 31
A. Ungrouped Frequency Distribution (UFD)
Shows a distribution where the values of a variable are linked with the respective frequencies.
Example 7. Consider the number of children in 15 families.
1 0 3 2 0
2 4 1 3 1
4 1 2 2 3
Construct ungrouped FD for the above data.
Solution:
No. of Children No. of Family Frequency
(Values) (Tallies)

0 // 2

1 //// 4

2 //// 4

3 /// 3

4 // 2

Total 15

Exercise
Consider the following scores in a statistics test obtained by 20 students in a given class.
10, 4, 4, 7, 5, 7, 7, 8, 5, 7, 8, 5, 10, 8, 7, 5, 7, 8, 7, 4
Prepare an ungrouped FD
B. Grouped Frequency Distribution (GFD)
If the mass of the data is very large, it is necessary to condense the data in to an appropriate
number of classes or groups of values of a variable and indicate the number of observed values
that fall in to each class. Therefore, a GFD is a frequency distribution where values of a
variable are linked in to groups & corresponded with the number of observations in each group.
Example * 2.8
Values (xi) 1 - 25 26 - 50 51 - 75 76 - 100

Frequency (fi) 3 10 18 6

Page 11 of 31
COMMON TERMINOLOGIES IN A GFD
i. Class:- group of values of a variable between two specified numbers called lower class limit
(LCL) & upper class limit (UCL)

*
In Example , the GFD contains four classes: 1 – 25, 26 – 50, 51 – 75, and 76 – 100

LCL1 = 1, UCL1 = 25 LCL3 = 51, UCL3 = 75


LCL2 = 26, UCL2 = 50 LCL4 = 76, UCL4 = 100

ii. Class Frequency (or Simply Frequency): refers to the number of observations
corresponding to a class.

In Example * the class frequency of the 1st, 2nd, 3rd, & 4th classes are respectively 3, 10, 18 and
6.

iii. Class Boundaries: are boundaries obtained by subtracting half of the unit of measurement
(u) from the lower limits or by adding ½ (u) on the upper limits of a class.
i.e UCBi = UCLi + ½ (u)
LCBi = LCLi - ½ (u)
Where UCBi = Upper Class Boundaries and
LCBi = Lower Class Boundaries
Remark: The unit of measurement (u) is the gap between any two successive classes. i.e
u = lower limit of a class – upper limit of the preceding class.

In Example *, consider the 2nd class, 26 – 50, since u = 26 – 25 = 1,

LCL2 = 26 UCL2 = 50
LCB2 = 26 - ½(1) = 25.5 UCB2 = 50 + ½(1) =50.5

iv. Class Width (size of a class or class interval): it is the difference between the upper and
lower class limits or the difference between the upper and lower class boundaries of any class.

Page 12 of 31
Remarks:
1. If both the LCL & UCL are included in a class, it is called an inclusive class. For
inclusive classes,
Class width (cw) = UCBi - LCBi

2. If LCL is included and the UCL is not included in a class, it is called an exclusive class.
For exclusive classes

cw = UCLi – LCLi

To be consistent, we use inclusive classes.

v. Class Mark (cm): it is the mid point (center) of a class


cmi = UCBi + LCBi
2
Note:- the difference between any two successive class marks is equal to the width of a class
vi. Range (R) : is the difference between the largest (L) and the smallest (S) values in a
data

R=L–S
Exercise consider the following GFD

Class Frequency (f)


5–9 2
10 – 14 6
15 – 19 12
20 – 24 7
25 – 29 3
Total 30

a. What is the class frequency of the 3rd class?


b. How many observations (items) are linked into the last class?
c. Find i. The LCL and UCL of the fourth class
ii. The UCB and LCB of the third class
iii. The class interval ( class width) of the fifth class
iv. The class mark (mid point) of the second class

Page 13 of 31
RULES FOR FORMING A GROUPED FREQUENCY DISTRIBUTION
To construct a GFD the following points should be considered
1) The classes should be clearly defined. That is each observation should fall in to one &
only one class.
2) The number of classes neither should either to be too larger nor should be too small.
Normally, 5 to 20 classes are recommended
3) All the classes should be of the same width. An approximate suitable class width can be
obtained as:

cw  Range i.e cw  R  L S
Number
of Classes n n
Example 8. Let
R  6.8263
n
o If all the observations are whole numbers, cw = 7
o If all the observations are to one decimal places, cw = 6.8
o If all the observations are to two decimal places, cw = 6.83, etc.
Note that a suitable number of classes can be obtained by using the formula
n  1 + 3.322 logN.
up/down to the nearest whole number, where N is the total number of observations.

Remark Unequal class intervals create problem in graphing and computing some statistical
measures

4) Determine the class limits


i. Determine the lower class limit of the first class (LCL1), then
LCL2 = LCL1 + cw, LCL3 = LCL2 + cw,… LCLi+1 = LCLi + cw
ii. Determine the upper class limit of the first class (UCL1) i.e.
UCL1 = LCL1 + cw – u, where u = the unit of measurement, then
UCL2 = UCL1 + cw , UCL3 UCL2, … , UCLi+1 = UCLi + cw
5) Complete the GFD with the respective class frequencies.
Example 9. The number of customers for consecutive 30 days in a supermarket was listed as
follows:

Page 14 of 31
20 48 65 25 48 49
35 25 72 42 22 58
53 42 23 57 65 37
18 65 37 16 39 42
49 68 69 63 29 67
a. Construct a GFD with a suitable number of classes
b. Complete the distribution obtained in (a) with class boundaries & class marks
Solution: i. Range = Largest value – smallest value
= 72 – 16 = 56
ii. N = 30 (total number of observations)
 number of classes, n = 1 + 3.322 log30
 n = 1 + 3.322 log30
= 1 + 3.322 (1.4771)
= 5.9
Hence a suitable number of class n is chosen to be 6

iii. Class width =


Range 56 = 9.33 = cw
n 6
For the sake of convenience, take cw to be 10 (note that it is also possible to
choose the cw to be 9).
iv. Take lower limit of the 1st class (LCL1) to be 16 & u = 1
i.e. LCL1 = 16 and UCL1 = LCL1 + cw – u =16+10-1 = 25
LCL2 = LCL1 + cw = 16 + 10 = 26 UCL2 = UCL1 + cw = 25 + 10 = 35
LCL3 = LCL2 + cw = 26 + 10 = 36 UCL3 = UCL2 + cw = 35 + 10 = 45
Therefore, the GFD would be A
a)

Class (xi) Frequency (fi)

16 – 25 7
26 – 35 2
36 – 45 6
46 – 55 5
56 – 65 6
66 – 75 4

Page 15 of 31
b)

Class (xi) Frequency (fi) CBi cmi

16 – 25 7 15.5 – 25.5 2.05

26 – 35 2 25.5 – 35.5 30.5

36 – 45 6 35.5 – 45.5 40.5

46 – 55 5 45.5 – 55.5 50.5

56 – 65 6 55.5 – 65.5 60.5

66 – 75 4 65.5 – 75.5 70.5

Exercise
Construct a grouped frequency distribution for the following ages of 50 persons with 6 classes.
37 40 69 35 36 70 72 62 36 72
65 64 47 59 55 42 45 50 46 65
54 63 51 50 61 60 58 58 56 58
55 45 49 51 50 56 44 60 70 44
52 43 55 46 42 62 57 48 60 55

I. CUMULATIVE FREQUENCY DISTRIBUTION (CFD)


It is the collection of values of a variable above or below specified values in a distribution. CFD
is of two types.
a. ‘Less Than’ Cumulative Frequency Distribution (<CFD): shows the collection of
cases lying below the upper class boundaries of each class.

b. ‘More Than’ Cumulative Frequency Distribution (>CFD): shows the collection of


cases lying above the lower class boundaries of each class.

Remark: The frequency distribution does not tell us directly the number of units above or
below specified values of the classes this can be determined from a “cumulative Frequency
Distribution‟

Page 16 of 31
Example 11 Consider the frequency distribution with a class width 3

Class (xi) Frequency (fi) Less than Cumulative More than Cumulative
Frequency (<cfi) Frequency (>cfi)
3–6 4 4 30
7 – 10 7 11 26
11 – 14 10 21 19
15 – 18 6 27 9
19 – 22 3 30 3

This means that from „less than‟ cumulative frequency distribution there are 4 observations less
than 6.5, 11 observations below 10.5, etc and from „more than‟ cumulative frequency
distribution 30 observations are above 2.5, 26 above 6.5 etc.

II. RELATIVE FREQUENCY DISTRIBUTION (RFD)


It enables the researcher to know the proportion or percentage of cases in each class. Relative
frequencies can be obtained by dividing the frequency of each class by the total frequency. It
can be converted in to a percentage frequency by multiplying each relative frequency by 100%.
i.e.

Rfi  fi
n
Where Rfi – is the relative frequency of the ith class
fi – is the frequency of the ith class
n – is the total number of observations
Note: Pfi = Rfi  100%
Where Pfi is percentage frequency of each class.

Example 14: The relative and percentage frequency distribution of Example 9 is :

xi fi Rfi %freq. (Pfi)

3–6 4 4/30 4/30  100


7 – 10 7 7/30 7/30  100
11 – 14 10 10/30 10/30  100
15 – 18 6 6/30 6/30  100
19 – 22 3 3/30 3/30  100
Total 30 1 100%

Page 17 of 31
QUESTIONS

1. Determine whether each statement is true or false.

a) A frequency distribution is the organization of raw data, in table form, that lists values or
classes with their corresponding frequencies.
b) The mid point of a class can be obtained by adding the upper and lower limits, and
dividing by 2.
c) If the gap between any two successive classes is one and the limits of a class are 10-19,
then the width of the class is 9.
d) If the limits of a class in a frequency distribution are 26-30, then the boundaries are
25.5-30.5.
e) When data is first collected, it is called raw data.
f) A frequency distribution should contain between 50 and 100 classes.
g) It is not important to keep the width of each class the same in a frequency distribution.

2. Classify each variable as discrete or continuous.


a) Number of cartoons of milk manufactured each day.
b) Temperatures of airplane interiors at a given airport.
c) Lifetimes of transistors in a stereo set.
d) Weights of newborn calves.
3. 100 employees were surveyed in a factory to find out their ages. The result was obtained as
follows.

32 21 28 31 35 46 48 49 49 48
36 37 22 31 28 34 20 45 44 48
38 33 33 23 28 29 33 26 36 30
43 42 32 36 24 27 27 32 45 45
39 39 38 32 33 25 30 28 37 36
42 43 38 40 35 34 20 30 36 32
40 38 38 40 46 36 35 21 31 35
41 42 39 40 46 44 32 37 22 27
41 39 40 38 44 45 48 36 32 23
40 41 40 44 49 49 49 49 37 33

Construct a Grouped Frequency Distribution (GFD) with five classes for the above data.

Page 18 of 31
PRESENTATION OF DATA

AIMS AND OBJECTIVES

The aim of this section is to study how to construct and present data using different types of
graphs, charts, and diagrams that can facilitate comparisons and in general to have an overall
good picture of data.

INTRODUCTION
This unit deals with the study of organizing a set of raw data in to a Frequency Distribution (FD)
and describes the distribution graphically in a histogram, a frequency polygon, & a cumulative
frequency curve (ogive). The other types of numerical information will be summarized &
presented in the form of bar chart, pie chart or a pictogram.
Definition:

Presentation is a statistical procedure of arranging and putting data in a form of tables,


graphs, charts and/or diagrams

A. HISTOGRAM

After you complete a frequency distribution, your next step will be to construct a “picture” of
these data values using a histogram. A histogram is a graph consisting of a series of adjacent
rectangles whose bases are equal to the class width of the corresponding classes and whose
heights are proportional to the corresponding class frequencies. Here, class boundaries are
marked along the horizontal axis (x – axis) and the class frequencies along the vertical axis ( y –
axis) according to a suitable scale. It describes the shape of the data. You can use it to answer
quickly such questions a,s are the data symmetric? And where do most of the data values lie?

Page 19 of 31
Example 1. Considers the following GFD and construct a histogram
Class (xi) Frequency (fi)

3–6 4

7 – 10 7

11 – 14 10

15 – 18 6

19 - 22 3

Total 30

Solution:
Histogram for the above distribution

10
Class frequency (fi)

8
6
4
2

2.5 6.5 1.05 14.5 18.5 22.5


Class boundaries (CBi)

CYP 1 construct a histogram for the following distribution


Class (xi) Frequency (fi)

5 – 10 4

10 – 15 7

15 – 20 9

20 – 25 12

25 - 30 6

30 – 35 5

Page 20 of 31
B. FREQUENCY POLYGON

It is a line graph of frequency distribution. Although a histogram does demonstrate the shape of
the data, perhaps the shape can be more clearly illustrated by using a frequency polygon. Here,
you merely connect the centers of the tops of the histogram bars (located at the class midpoints)
with a series of straight lines. The resulting figure is a frequency polygon. Here the class marks
are plotted along the x – axis and the class frequencies along the y – axis. Empty classes are
include at each end so that the curve will anchor with the x – axis.

Example 2. Construct a frequency polygon for the frequency distribution given in Example9
Solution:
A frequency polygon for the
distribution in example 9

14
12
frequency (fi)

10
8
6
4
2
0
0.5 7.5 12.5 17.5 22.5 27.5 32.5 37.5
Class marks (cmi)

CYP 2 Construct a frequency polygon for the frequency distribution given under CYP 1

C. CUMULATIVE FREQUENCY CURVE, (OGIVE)

It is the graphic representation of a cumulative frequency distribution Ogives are of two kinds.
„Less than‟ ogive and „more than‟ Ogive < Ogive and > Ogive.
A) „Less than’ ogive: here, upper class boundaries are plotted against the „less than‟
cumulative frequencies of the respective class & they are joined by adjacent lines.
Example 3. Draw a „less than‟ ogive for the frequency distribution in Example 11

Page 21 of 31
Solution:
A less than ogive showing the frequency distribution
above

35
Less than cumulative frequency

30
25
20
15
(<Cfi)

10
5
0
6.5 10.5 14.5 18.5 22.5
Upper class boundary (UCBi)

B) „More than’ ogive: here, lower class boundaries are plotted against the „more than‟
cumulative frequencies of their respective class and they are joined by adjacent line

Example 4. Draw a „More than‟ ogive for the frequency distribution in Example 11
Solution:
A more than ogive for the above frequency
distribution

35
More than cumulative frequency (>Cfi

30
25
20
15
10
5
0
2.5 6.5 10.5 14.5 18.5
lower class boundaries (LCBi)

Page 22 of 31
D. LINE GRAPH

It represents the relation ship between time (on the x-axis) and values of variable (on the y-axis).
The values are recorded with respect to the time of occurrence.

Example 5. Draw a line graph for the following time series.

Year 1986 1987 1988 1989 1991

Values 20 10 30 15 1

Solution:
A line graph showing the above time series

35
30 30
25 25
Values

20 20
15 15
10 10 10
5
0
1986 1987 1988 1989 1990 1991
Year

E. VERTICAL LINE GRAPH

Is a graphical representation of discrete data (or characteristics expressed with whole numbers)
with respect to the frequencies? Vertical solid lines are used to indicate the frequencies.

Example 6. Draw a vertical line graph for the following data


Family A B C D E

Number of children 3 2 7 6 4

Page 23 of 31
Solution:
Y
7 …………………
6 …………………………
5
4 ………………………………
3 ……
2 ……………
1
A B C D E X

Vertical line graph showing number of children in family A, B, C, D and E

F. BAR CHART (BAR DIAGRAM)

Histogram, Frequency polygon, ogives are used for data having an interval or ratio level of
measurement. The other kinds of presenting statistical data suitable for a particular kind of
situations are bar charts, pie chart and pictograph.

Bar chart is a series of equally spaced bars of uniform width where the height (length) of a bar
represents the amount (magnitude) of frequency corresponding with a category. Bars may be
drawn horizontally or vertically. Vertical bar graphs are preferred as they allow comparison with
other bars.

TYPES OF BAR CHARTS

A. SIMPLE BAR CHART:


It represents a single set of data (variable) classified in different categories. Singular bars are
drawn with the respective frequencies.

Example18: Revenue (in millions of Birr) of company x from 1980 to 1982 is given below

Year Revenue
1980 50
1981 150
1982 200

Page 24 of 31
Solution:
A simple bar chart showing revenues of company
X from 1980 to 1982

250
200
Revenue

150
100
50
0
1980 1981 1982
year

B. MULTIPLE BAR CHART:


Here two or more bars are grouped with the corresponding frequency to represent two or more
interrelated data in each category. The bars of related variables are kept adjacent to each other
for every set of values. These charts can be used if the overall total is not required and each bar
is shaded or colored separately and a key is given to distinguish them.

Example19: The following table shows the production of wheat and maize in hundreds of
quintals.

Year Maize Wheat

1980 40 80

1981 20 60

1982 60 100

Page 25 of 31
Solution:
The number of quintals(in thousands) of
wheat and maize production
100
100
80
80
60 60
Number of 60
quintals 40
40 maize
20
20 wheat

0
1980 1981 1982
Year

C. SUBDIVIDED BAR CHART:


It is used to present data by subdividing a single bar with respect to the proportional frequency.
Each portion of the bar is then shaded or colored and a key is give to distinguish them.

Example20: The number of quintals of wheat and maize (in millions of quintals) produced by
country x in the indicated years.

Year Wheat Maize

1980 150 150

1981 300 200

1982 350 100

Solution:
The number of quintals of wheat and maize
produced by country X

600
Number of quintals

500
400 200 100 Maize
300
200 150 Wheat
300 350
100 150
0
1980 1981 1982

Year

Page 26 of 31
D. PERCENTAGE BAR CHART:
It is a subdivided bar chart where percentages are used in each classification rather than the actual
frequencies.

Example 21: construct percentage bar chart for the data in Example 19.

Solution:
Year % of Wheat Production % of Maize
Production

1980 150/300  100 = 50 150/300  100 = 50

1981 300/500  100 = 60 200/500  100 = 40

1982 350/450  100 = 78 100/450  100 = 22

Percentage of wheat and maize production from 1980-1982


Percentage produced

100%
22
80% 50 40
60% wheat
40% 78 maize
50 60
20%
0%
1980 1981 1982

Year

G. PIE CHART

A pie chart is a circle divided in to various sectors with areas proportional to the value of the
component they represent. It shows the components in terms of percentages not in absolute
magnitude. The degree of the angle formed at the center has to be proportional to the values
represented.

Page 27 of 31
Example 22: the monthly expenditure of a certain family is given below.
Items Expenditure % Proportion (Pfi) Degrees (360o Rfi)

Clothing 100 100/1000  100 = 10 100/1000  360o = 36

Food 350 350/1000  100 = 35 350/1000  360o = 126

House Rent 250 250/1000  100 = 25 250/1000  360o = 90

Miscellaneous 300 300/1000  100 = 30 300/1000  360o = 108

Total 1000 100% 360o

Solution: The pie chart for the above expenditure is as follows

Food
300 350
House rent
Clothing
100 Misc.
250

H. PICTOGRAPH (PICTOGRAM)

A pictograph is a graph that uses symbols or pictures to represent data.

Example 23: In comparing the population of a country from 1990 to 1992, we simply draw
pictures of people where each picture may represent 1000,000 people.

1992 -  Key:  = 1,000,000

1991 - 

1990 - 

Page 28 of 31
Summary
This unit discussed how to present the organized data. Once a frequency distribution is
constructed, the representation of the data by using graphs is a simple task. The most commonly
used graphs in research statistics are the histograms, frequency polygon, an ogive, and other
graphs and diagrams, like the bar charts, pie charts, pictograms can also be used. And some of
these graphs are seen frequently in newspapers, magazines, and various statistical reports.

ANSWERS TO CHECK YOUR PROGRESS (CYP) QUESTIONS

CYP 1
freq.12
y
10

2 X
5 10 15 20 25 30 35

Page 29 of 31
Class boundaries (CBi)
CYP 2
. y
Cummulative Frequency 12
10
8

2
2.5 7.5 12.5 17.5 22.5 27.5 32.5 37.5
Class Marks (cmi)

QUESTION

Direction: Answer each of the following questions.


1. Determine whether each statement is true or false.
a. The ogive uses cumulative frequencies.
b. Histogram can be drawn by using vertical or horizontal bars.
c. In the construction of a frequency polygon, the class limits are used for the x-
axis.
d. Data collected over a period of time can be graphed by using a pie chart.
e. When the data is represented graphically by symbols or pictures, the graph is
called a frequency curve.
3. Construct a histogram, frequency polygon, and both ogives to represent the data shown
below.

Page 30 of 31
Class Boundaries (CBi) Frequency fi

5.5-10.5 1

10.5-15.5 2

15.5-20.5 3

20.5-25.5 5

25.5-30.5 4

30.5-35.5 3

35.5-40.5 2

4. Construct a horizontal and vertical bar chart for the areas (in square miles) of each of the
great lakes in Ethiopia.
Lake Area (km2)
Tana 3600
Abaya 1160
Chamo 551
Ziway 434
Shala 409

Page 31 of 31

You might also like