CH-2 Stat I
CH-2 Stat I
CH-2 Stat I
INTRODUCTION
The term “ Data Collection” refers to all the issues related to data sources, scope of investigation
and sampling techniques. In this chapter, our discussion starts with the discussion of the meaning
of data collection. Having the reader acquainted with the meaning of data collection, the chapter
advances to the discussion of the two sources of data namely, primary and secondary sources. In
addition, the different methods of collecting data from primary sources are discussed.
Collection of data implies a systematic and meaningful assembly of information for the
accomplishment of the objective of a statistical investigation. It refers to the methods used in
gathering the required information from the units under investigation.
The quality of data greatly affects the final output of an investigation. Hence, utmost care should
be attached to the data collection process and every possible precaution should be taken to ensure
accuracy while collecting data. Otherwise, with inaccurate and inadequate data, the whole
analysis is likely to be faulty and also the decisions to be taken will also be misleading.
Statistical data may be obtained either from primary or secondary source. A primary source is a
source from where first-hand information is gathered. On the other hand, secondary source is the
one that makes data available, which were collected by some other agency. Clearly, a source,
which is not primary, is necessarily a secondary source. Primary sources are original sources of
data.
Data obtained from a primary source is called primary data. Likewise, data gathered from a
secondary source is known as secondary data. For example, assume that a simple study is to be
conducted to see the age distribution of HIV/AIDS victim citizens. Clearly, the variable of study
Page 1 of 31
is age. Data about the age of HIV/AIDS victim citizens may be obtained by making direct
interview with the victims. Note, in this specific case, the victim citizens are primary sources.
Moreover, the data to be collected from them are primary data. Alternatively, one may use
records of hospitals and other related agencies to obtain the age of the victim citizens without the
need of tracing the victims personally. Therefore, the records of the hospitals, in our case, are
secondary sources and the data copied from such records are secondary data.
In most cases, secondary data is obtained from such sources as census and survey reports, books,
official records, reported experimental results, previous research papers, bulletins, magazines,
newspapers, web sites, and other publications. Different organizations and government agencies
publish information (data) in the form of reports, periodicals, journals, etc. In the case of
Ethiopia, the Central Statistics Authority (CSA) is the first to be mentioned in publishing such
relevant information (secondary data).
The following are major advantages of primary data over that of secondary data.
The primary data gives more reliable, accurate and adequate information, which is
suitable to the objective and purpose of an investigation.
Primary source usually shows data in greater detail.
Primary data is free from errors that may arise from copying of figures from
publications, which is the case in secondary data.
Page 2 of 31
It is readily available and hence convenient and much quicker to obtain than primary
data,
It reduces time, cost and effort as compared to primary data,
Secondary data may be available in subjects (cases) where it is impossible to collect
primary data. Such a case can be regions where there is war.
The choice between primary data and secondary data is determined by factors like nature and
scope of the enquiry, availability of financial resources, availability of time, degree of accuracy
desired, and the collecting agency. Often, primary data are used in situations where secondary
data do not provide adequate basis of analysis. Meaning, when the secondary data do not suit a
specific investigation we use primary data. Unless for such cases, most statistical investigations
rest up on secondary data since it minimizes cost and saves time. Nevertheless, the following
points should be carefully considered while using secondary data in our investigation.
One should closely examine whether or not the data are suitable for the intended study,
The source of data should be viewed, keeping in mind whether, at any time, it is reliable or
not. If there is any doubt about the reliability of data, it should not be used,
It should be noted that the data is not obsolete,
In case the data are based on a sample, one should see whether the sample is a proper
representative of the population,
It should be the case that skilled persons only have handled the primary data carefully.
Finally, it should be clear that primary data in the hands of one person might be secondary in the
hands of another. That is why it is often said, “the difference between primary and secondary
data is largely one of degree.”
Page 3 of 31
After discussing the two sources of data, primary and secondary, it is logical to say a few words
about the methods employed in collecting data from its original or primary source.
Many authors commonly state three methods of collecting primary data. These are:
Page 4 of 31
b) Direct Observation
In this approach, an investigator stays at the place of survey and notes down the observation
himself. There is no enquires in the case of direct observation. For example, an investigator
making a study on nutritional status of children may directly (physically) measure the weight,
height, and other required parameters himself/herself. Direct observation is more experimental
and usually applied in scientific studies. It is time consuming and also costly.
c) Questionnaire Method
Under this method, a list of questions related to the survey is prepared and sent to the various
respondents by post, Web sites, e-mail, etc. However, this method cannot be used if the
respondent is illiterate. It is a method that is often used in many statistical investigations.
The following are the major points that we need to take into account while preparing a
questionnaire.
The number of questions should be small. Naturally, respondents are not comfortable with
lengthy questionnaires. Lengthy questionnaires usually bore respondents. Hence, fifteen to
twenty five questions in a questionnaire are optimal. If a lengthy questionnaire is
unavoidable, it should preferably be divided into two or more parts.
The questions should be short, clear, simple and unambiguous. Moreover, the questions must
be arranged in a logical order so that natural and spontaneous reply to each is induced. For
instance, it is not appropriate to ask a person how many packets of cigarette he/she smokes
before asking whether he/she smokes or not.
Questions of sensitive nature should be avoided. Sensitive questions are those questions that
are too personal and pecuniary like “ Sources of income”, “Drinking habit”, etc. The logic
here is that respondents do not willingly answer sensitive questions. Such information, if
necessary, may be gathered through interview or through other indirect questions.
Questions should be capable of objective answers. As much as possible, avoid subjective
questions and keep to questions of fact. To this end, multiple answer questions can be used.
Mail questionnaires should be accomplished by a covering letter, which should state the
purpose of the questionnaire, promise of confidentiality of responses, etc.
Furthermore, the questions preferably designed in such a way can easily be answered as yes/no.
Page 5 of 31
2.2. LEVEL (SCALE) OF MEASUREMENT
There are four general levels of measurements:
These are: Nominal, ordinal, interval and ratio levels of measurements
i. Nominal Level
The terms nominal level of measurements and nominal scaled are commonly used to refer to data
that can only be classified in to categories. In the strict sense of the words, however, there are no
measurements and no seals involved. In stead, there are just counts.
In the above table, the arrangement of religions could have been changed. This indicates that for
nominal level of measurement, there is no particular order for the groupings. Further, the
categories are considered to be mutually exclusive.
Nominal level is considered the most primitive, the lowest or the most limited type of
measurement
Page 6 of 31
ii. Ordinal Level
Look at the data below.
Ratings of the company commander
The table lists the ratings of company commander by the nurses under her command. This is an
illustration of the ordinal level of measurement. One category is higher than the next one; that is,
“Superior” is higher rating than” good”, “good” is higher than “average”, and so on.
If 1 is substituted for “superior”, 2 substituted for „good‟ and so on, a 1 ranking is obviously
higher than a 2 ranking, and a 2 ranking is higher than a 3 ranking. However it cannot be said
that (as an example) a company commander rated good is twice as competent as one rated
average, or that a company commander rated superior is twice as competent as one rated good. It
can only be said that a rating of superior is greater than a rating of good, and a good rating is
greater than an average rating.
The major difference between a nominal level and an ordinal level of measurement is the
“greater than” relationship between the ordinal-level categories. Otherwise, the ordinal seal of
measurement has the same characteristics as the nominal scale; namely, the categories are
mutually exclusive and exhaustive.
The major differences between interval and ratio levels of measurement are these: (1) Ratio-level
data has a meaningful zero point and (2) the ratio between two numbers is meaningful. Money is
a good illustration having zero dollars has meaning you have none! Weight is another ratio-level
measurement.
If the dial on a scale is zero, there is a complete absence of weight. Also, if you earn $40,000 a
year and John earns $ 10,000, you earn four times what he does. Likewise, if you weigh 80 kg.
and John weight 40 kg., you weigh twice John. But such comparisons are impossible in interval
level of measurement.
After collecting relevant information (data) for the purpose of statistical investigation, the next
important task is classification and presentation of this data. It is difficult to group the meaning
of any considerable volume of numerical data unless their mass is some hours reduced to
relatively few convenient classes or categories and presented with the help of some kinds of
visual aid.
This section discusses classification of data. Presentation of data using graphs and charts will be
seen in the next unit.
Purposes of Classification: -
To eliminate unnecessary detail.
To bring out clearly points of similarity & dissimilarity
To enable one to form mental pictures of objects on measurements
To enable one to make comparisons and draw inferences
Page 8 of 31
2.4. METHODS TYPES OF CLASSIFICATION
1. Geographical Classification: - Data are arranged according to places like continents,
regions, and countries
Example
2. Chronological Classification:- Data are arranged according to time like year, month.
Example
Year (in EC) Population (in million)
1974 30
1986 52
1991 60
3. Qualitative Classification: - Data are arranged according to attributes like color, religion,
marital-status, sex, educational background, etc.
Example 3.
Employees in a Factory x
Educated Uneducated
Page 9 of 31
Example 4.
Mr. x Height (X) in cm
A 160
B 182
C 175
D 178
A. Discrete Variables – are variables that are associated with enumeration or counting
Example
Number of students in a class
Number of children in a family, etc
When the raw data have been collected, they should be put in to an ordered array in an ascending
or descending order so that it can be looked at more objectively. Then this data must be
organized in to a “FD” which simply lists the values or classes with their corresponding
frequencies in a tabular form. Here, frequency refers to the number of observations a certain
value occurred in a data.
The tabular representation of values of a variable together with the corresponding frequency is
called a Frequency Distribution (FD).
Definition:
A frequency distribution is the organization of raw data in table form, using classes and
frequencies.
Page 10 of 31
A. Ungrouped Frequency Distribution (UFD)
Shows a distribution where the values of a variable are linked with the respective frequencies.
Example 7. Consider the number of children in 15 families.
1 0 3 2 0
2 4 1 3 1
4 1 2 2 3
Construct ungrouped FD for the above data.
Solution:
No. of Children No. of Family Frequency
(Values) (Tallies)
0 // 2
1 //// 4
2 //// 4
3 /// 3
4 // 2
Total 15
Exercise
Consider the following scores in a statistics test obtained by 20 students in a given class.
10, 4, 4, 7, 5, 7, 7, 8, 5, 7, 8, 5, 10, 8, 7, 5, 7, 8, 7, 4
Prepare an ungrouped FD
B. Grouped Frequency Distribution (GFD)
If the mass of the data is very large, it is necessary to condense the data in to an appropriate
number of classes or groups of values of a variable and indicate the number of observed values
that fall in to each class. Therefore, a GFD is a frequency distribution where values of a
variable are linked in to groups & corresponded with the number of observations in each group.
Example * 2.8
Values (xi) 1 - 25 26 - 50 51 - 75 76 - 100
Frequency (fi) 3 10 18 6
Page 11 of 31
COMMON TERMINOLOGIES IN A GFD
i. Class:- group of values of a variable between two specified numbers called lower class limit
(LCL) & upper class limit (UCL)
*
In Example , the GFD contains four classes: 1 – 25, 26 – 50, 51 – 75, and 76 – 100
ii. Class Frequency (or Simply Frequency): refers to the number of observations
corresponding to a class.
In Example * the class frequency of the 1st, 2nd, 3rd, & 4th classes are respectively 3, 10, 18 and
6.
iii. Class Boundaries: are boundaries obtained by subtracting half of the unit of measurement
(u) from the lower limits or by adding ½ (u) on the upper limits of a class.
i.e UCBi = UCLi + ½ (u)
LCBi = LCLi - ½ (u)
Where UCBi = Upper Class Boundaries and
LCBi = Lower Class Boundaries
Remark: The unit of measurement (u) is the gap between any two successive classes. i.e
u = lower limit of a class – upper limit of the preceding class.
LCL2 = 26 UCL2 = 50
LCB2 = 26 - ½(1) = 25.5 UCB2 = 50 + ½(1) =50.5
iv. Class Width (size of a class or class interval): it is the difference between the upper and
lower class limits or the difference between the upper and lower class boundaries of any class.
Page 12 of 31
Remarks:
1. If both the LCL & UCL are included in a class, it is called an inclusive class. For
inclusive classes,
Class width (cw) = UCBi - LCBi
2. If LCL is included and the UCL is not included in a class, it is called an exclusive class.
For exclusive classes
cw = UCLi – LCLi
R=L–S
Exercise consider the following GFD
Page 13 of 31
RULES FOR FORMING A GROUPED FREQUENCY DISTRIBUTION
To construct a GFD the following points should be considered
1) The classes should be clearly defined. That is each observation should fall in to one &
only one class.
2) The number of classes neither should either to be too larger nor should be too small.
Normally, 5 to 20 classes are recommended
3) All the classes should be of the same width. An approximate suitable class width can be
obtained as:
cw Range i.e cw R L S
Number
of Classes n n
Example 8. Let
R 6.8263
n
o If all the observations are whole numbers, cw = 7
o If all the observations are to one decimal places, cw = 6.8
o If all the observations are to two decimal places, cw = 6.83, etc.
Note that a suitable number of classes can be obtained by using the formula
n 1 + 3.322 logN.
up/down to the nearest whole number, where N is the total number of observations.
Remark Unequal class intervals create problem in graphing and computing some statistical
measures
Page 14 of 31
20 48 65 25 48 49
35 25 72 42 22 58
53 42 23 57 65 37
18 65 37 16 39 42
49 68 69 63 29 67
a. Construct a GFD with a suitable number of classes
b. Complete the distribution obtained in (a) with class boundaries & class marks
Solution: i. Range = Largest value – smallest value
= 72 – 16 = 56
ii. N = 30 (total number of observations)
number of classes, n = 1 + 3.322 log30
n = 1 + 3.322 log30
= 1 + 3.322 (1.4771)
= 5.9
Hence a suitable number of class n is chosen to be 6
16 – 25 7
26 – 35 2
36 – 45 6
46 – 55 5
56 – 65 6
66 – 75 4
Page 15 of 31
b)
Exercise
Construct a grouped frequency distribution for the following ages of 50 persons with 6 classes.
37 40 69 35 36 70 72 62 36 72
65 64 47 59 55 42 45 50 46 65
54 63 51 50 61 60 58 58 56 58
55 45 49 51 50 56 44 60 70 44
52 43 55 46 42 62 57 48 60 55
Remark: The frequency distribution does not tell us directly the number of units above or
below specified values of the classes this can be determined from a “cumulative Frequency
Distribution‟
Page 16 of 31
Example 11 Consider the frequency distribution with a class width 3
Class (xi) Frequency (fi) Less than Cumulative More than Cumulative
Frequency (<cfi) Frequency (>cfi)
3–6 4 4 30
7 – 10 7 11 26
11 – 14 10 21 19
15 – 18 6 27 9
19 – 22 3 30 3
This means that from „less than‟ cumulative frequency distribution there are 4 observations less
than 6.5, 11 observations below 10.5, etc and from „more than‟ cumulative frequency
distribution 30 observations are above 2.5, 26 above 6.5 etc.
Rfi fi
n
Where Rfi – is the relative frequency of the ith class
fi – is the frequency of the ith class
n – is the total number of observations
Note: Pfi = Rfi 100%
Where Pfi is percentage frequency of each class.
Page 17 of 31
QUESTIONS
a) A frequency distribution is the organization of raw data, in table form, that lists values or
classes with their corresponding frequencies.
b) The mid point of a class can be obtained by adding the upper and lower limits, and
dividing by 2.
c) If the gap between any two successive classes is one and the limits of a class are 10-19,
then the width of the class is 9.
d) If the limits of a class in a frequency distribution are 26-30, then the boundaries are
25.5-30.5.
e) When data is first collected, it is called raw data.
f) A frequency distribution should contain between 50 and 100 classes.
g) It is not important to keep the width of each class the same in a frequency distribution.
32 21 28 31 35 46 48 49 49 48
36 37 22 31 28 34 20 45 44 48
38 33 33 23 28 29 33 26 36 30
43 42 32 36 24 27 27 32 45 45
39 39 38 32 33 25 30 28 37 36
42 43 38 40 35 34 20 30 36 32
40 38 38 40 46 36 35 21 31 35
41 42 39 40 46 44 32 37 22 27
41 39 40 38 44 45 48 36 32 23
40 41 40 44 49 49 49 49 37 33
Construct a Grouped Frequency Distribution (GFD) with five classes for the above data.
Page 18 of 31
PRESENTATION OF DATA
The aim of this section is to study how to construct and present data using different types of
graphs, charts, and diagrams that can facilitate comparisons and in general to have an overall
good picture of data.
INTRODUCTION
This unit deals with the study of organizing a set of raw data in to a Frequency Distribution (FD)
and describes the distribution graphically in a histogram, a frequency polygon, & a cumulative
frequency curve (ogive). The other types of numerical information will be summarized &
presented in the form of bar chart, pie chart or a pictogram.
Definition:
A. HISTOGRAM
After you complete a frequency distribution, your next step will be to construct a “picture” of
these data values using a histogram. A histogram is a graph consisting of a series of adjacent
rectangles whose bases are equal to the class width of the corresponding classes and whose
heights are proportional to the corresponding class frequencies. Here, class boundaries are
marked along the horizontal axis (x – axis) and the class frequencies along the vertical axis ( y –
axis) according to a suitable scale. It describes the shape of the data. You can use it to answer
quickly such questions a,s are the data symmetric? And where do most of the data values lie?
Page 19 of 31
Example 1. Considers the following GFD and construct a histogram
Class (xi) Frequency (fi)
3–6 4
7 – 10 7
11 – 14 10
15 – 18 6
19 - 22 3
Total 30
Solution:
Histogram for the above distribution
10
Class frequency (fi)
8
6
4
2
5 – 10 4
10 – 15 7
15 – 20 9
20 – 25 12
25 - 30 6
30 – 35 5
Page 20 of 31
B. FREQUENCY POLYGON
It is a line graph of frequency distribution. Although a histogram does demonstrate the shape of
the data, perhaps the shape can be more clearly illustrated by using a frequency polygon. Here,
you merely connect the centers of the tops of the histogram bars (located at the class midpoints)
with a series of straight lines. The resulting figure is a frequency polygon. Here the class marks
are plotted along the x – axis and the class frequencies along the y – axis. Empty classes are
include at each end so that the curve will anchor with the x – axis.
Example 2. Construct a frequency polygon for the frequency distribution given in Example9
Solution:
A frequency polygon for the
distribution in example 9
14
12
frequency (fi)
10
8
6
4
2
0
0.5 7.5 12.5 17.5 22.5 27.5 32.5 37.5
Class marks (cmi)
CYP 2 Construct a frequency polygon for the frequency distribution given under CYP 1
It is the graphic representation of a cumulative frequency distribution Ogives are of two kinds.
„Less than‟ ogive and „more than‟ Ogive < Ogive and > Ogive.
A) „Less than’ ogive: here, upper class boundaries are plotted against the „less than‟
cumulative frequencies of the respective class & they are joined by adjacent lines.
Example 3. Draw a „less than‟ ogive for the frequency distribution in Example 11
Page 21 of 31
Solution:
A less than ogive showing the frequency distribution
above
35
Less than cumulative frequency
30
25
20
15
(<Cfi)
10
5
0
6.5 10.5 14.5 18.5 22.5
Upper class boundary (UCBi)
B) „More than’ ogive: here, lower class boundaries are plotted against the „more than‟
cumulative frequencies of their respective class and they are joined by adjacent line
Example 4. Draw a „More than‟ ogive for the frequency distribution in Example 11
Solution:
A more than ogive for the above frequency
distribution
35
More than cumulative frequency (>Cfi
30
25
20
15
10
5
0
2.5 6.5 10.5 14.5 18.5
lower class boundaries (LCBi)
Page 22 of 31
D. LINE GRAPH
It represents the relation ship between time (on the x-axis) and values of variable (on the y-axis).
The values are recorded with respect to the time of occurrence.
Values 20 10 30 15 1
Solution:
A line graph showing the above time series
35
30 30
25 25
Values
20 20
15 15
10 10 10
5
0
1986 1987 1988 1989 1990 1991
Year
Is a graphical representation of discrete data (or characteristics expressed with whole numbers)
with respect to the frequencies? Vertical solid lines are used to indicate the frequencies.
Number of children 3 2 7 6 4
Page 23 of 31
Solution:
Y
7 …………………
6 …………………………
5
4 ………………………………
3 ……
2 ……………
1
A B C D E X
Histogram, Frequency polygon, ogives are used for data having an interval or ratio level of
measurement. The other kinds of presenting statistical data suitable for a particular kind of
situations are bar charts, pie chart and pictograph.
Bar chart is a series of equally spaced bars of uniform width where the height (length) of a bar
represents the amount (magnitude) of frequency corresponding with a category. Bars may be
drawn horizontally or vertically. Vertical bar graphs are preferred as they allow comparison with
other bars.
Example18: Revenue (in millions of Birr) of company x from 1980 to 1982 is given below
Year Revenue
1980 50
1981 150
1982 200
Page 24 of 31
Solution:
A simple bar chart showing revenues of company
X from 1980 to 1982
250
200
Revenue
150
100
50
0
1980 1981 1982
year
Example19: The following table shows the production of wheat and maize in hundreds of
quintals.
1980 40 80
1981 20 60
1982 60 100
Page 25 of 31
Solution:
The number of quintals(in thousands) of
wheat and maize production
100
100
80
80
60 60
Number of 60
quintals 40
40 maize
20
20 wheat
0
1980 1981 1982
Year
Example20: The number of quintals of wheat and maize (in millions of quintals) produced by
country x in the indicated years.
Solution:
The number of quintals of wheat and maize
produced by country X
600
Number of quintals
500
400 200 100 Maize
300
200 150 Wheat
300 350
100 150
0
1980 1981 1982
Year
Page 26 of 31
D. PERCENTAGE BAR CHART:
It is a subdivided bar chart where percentages are used in each classification rather than the actual
frequencies.
Example 21: construct percentage bar chart for the data in Example 19.
Solution:
Year % of Wheat Production % of Maize
Production
100%
22
80% 50 40
60% wheat
40% 78 maize
50 60
20%
0%
1980 1981 1982
Year
G. PIE CHART
A pie chart is a circle divided in to various sectors with areas proportional to the value of the
component they represent. It shows the components in terms of percentages not in absolute
magnitude. The degree of the angle formed at the center has to be proportional to the values
represented.
Page 27 of 31
Example 22: the monthly expenditure of a certain family is given below.
Items Expenditure % Proportion (Pfi) Degrees (360o Rfi)
Food
300 350
House rent
Clothing
100 Misc.
250
H. PICTOGRAPH (PICTOGRAM)
Example 23: In comparing the population of a country from 1990 to 1992, we simply draw
pictures of people where each picture may represent 1000,000 people.
1991 -
1990 -
Page 28 of 31
Summary
This unit discussed how to present the organized data. Once a frequency distribution is
constructed, the representation of the data by using graphs is a simple task. The most commonly
used graphs in research statistics are the histograms, frequency polygon, an ogive, and other
graphs and diagrams, like the bar charts, pie charts, pictograms can also be used. And some of
these graphs are seen frequently in newspapers, magazines, and various statistical reports.
CYP 1
freq.12
y
10
2 X
5 10 15 20 25 30 35
Page 29 of 31
Class boundaries (CBi)
CYP 2
. y
Cummulative Frequency 12
10
8
2
2.5 7.5 12.5 17.5 22.5 27.5 32.5 37.5
Class Marks (cmi)
QUESTION
Page 30 of 31
Class Boundaries (CBi) Frequency fi
5.5-10.5 1
10.5-15.5 2
15.5-20.5 3
20.5-25.5 5
25.5-30.5 4
30.5-35.5 3
35.5-40.5 2
4. Construct a horizontal and vertical bar chart for the areas (in square miles) of each of the
great lakes in Ethiopia.
Lake Area (km2)
Tana 3600
Abaya 1160
Chamo 551
Ziway 434
Shala 409
Page 31 of 31