Data Collection + Data Presentation
Data Collection + Data Presentation
Statistical data is collected from different sources and different methods are adopted to collect
adequate and reliable data. The data is collected in order to conduct some inquiries and to
analyse some problems.
STATISTICAL INQUIRIES
In order to collect data for a particular investigation it is important to keep in mind the
following:
i. OBJECT AND SCOPE OF INQUIRY
Every investigation has its objectives to be achieved. Therefore, the objective and scope must
be determined beforehand i.e. what will be the objective of the inquiry, from where, whom
and when it should be collected etc
1
Semi-official inquiry is one which is made by bodies which are supported by the government.
Non official inquiry is conducted by private bodies.
a. Physical units
These are units which are used in day to day life e.g.kg, metres, litres, centimetres etc.
b. Arbitrary units
These are units adopted by statisticians for their own use in statistics e.g. salaries, wages,
workers, sales etc.
CLASSIFICATION OF DATA
Depending on the source, data can be classified as primary data or secondary data.
Primary Data
It is data collected the first time whether directly or indirectly. Is is original in character and
shape e.g. census in Kenya. The following methods are used to collect primary data:
1. Questionnaire:
It is the most commonly used method in survey. Questionnaires are a list of questions either
open-ended or close-ended for which the respondents give answers. Questionnaire can be
conducted via telephone, mail, live in a public area, or in an institute, through electronic mail
or through fax and other methods.
Types of Questionnaires
a. Open ended or unstructured Questionnaire
These are the type of questions that are used to allow the respondents to express their views
in a free-flowing manner.
By using such questions, the respondents do not have to follow the criteria for answering
questions and he/she can truly express their beliefs and suggestions.
An ideal questionnaire is a type of questionnaire that includes open-ended questions and also
has feedback and suggestions for future improvements.
Disadvantages
i. Time-consuming
ii. Coding: very costly and slow to process
2
The user is restricted to answer their opinions through the options that are set by the surveyor.
Therefore the respondent selects one or more options from pre-determined set of responses.
Disadvantages:
i) Loss of spontaneous response
ii) Bias in answer categories
iii) May irritate respondents
3
iv) Non-confidential: Questions used should be non-confidential in nature because no
one would like to answer personal questions.
v) Relevant questions: The questions should be relevant to the problem under
investigation
vi) Definiteness: Questions should be framed in such a way that the answers to them
are perfectly definite i.e. in form of ‘yes’ or ‘no’
2. Interview
Interview is a face-to-face conversation with the respondent. In interview the main problem
arises when the respondent deliberately hides information otherwise it is an in depth source of
information. The interviewer can not only record the statements the interviewee speaks but he
can observe the body language, expressions and other reactions to the questions too. This
enables the interviewer to draw conclusions easily.
Advantages
i) Information collected by this method is reliable and accurate
ii) It is a good method for intensive investigation
iii) The interviewer can explain part of the questions not understood
by the respondents
iv) It is quick method of obtaining information
v) It gives satisfactory results provided the scope of inquiry is narrow
Disadvantages
i) It requires a lot of expenses and time
ii) Sometime the respondent may not be willing to answer the questions
iii) The method is not suitable for extensive inquiry
3. Observation
Observation can be done while letting the observing person know that s/he is being observed
or without letting him know. Observations can also be made in natural settings as well as in
artificially created environment.
Advantages
i) Data collected is highly reliable
ii) It is cheap method
iii) It gives more relevant and accurate information
4
Disadvantages
i) The presence of an investigator may make the performer to work in a different
manner
ii) The results may be different under different conditions
4. Sampling
In this method each unit of the population has an equal chance of being selected
SECONDARY DATA
Data collected from a source that has already been published in any form is called as
secondary data. The review of literature in any research is based on secondary data. It is
collected by someone else for some other purpose (but being utilized by the investigator for
another purpose). For examples, Census data being used to analyse the impact of education on
career choice and earning.
Common sources of secondary data for social science include censuses, organizational records
and data collected through qualitative methodologies or qualitative research.
5
Importance of Secondary Data:
Secondary data can be less valid but its importance is still there.
Sometimes it is difficult to obtain primary data; in these cases getting information
from secondary sources is easier and possible.
Sometimes primary data does not exist in such situation one has to confine the
research on secondary data.
Sometimes primary data is present but the respondents are not willing to reveal it in
such case too secondary data can suffice. For example, if the research is on the
psychology of transsexuals first it is difficult to find out transsexuals and second they
may not be willing to give information you want for your research, so you can collect
data from books or other published sources.
A clear benefit of using secondary data is that much of the background work needed
has already been carried out. For example, literature reviews, case studies might have
been carried out, published texts and statistics could have been already used
elsewhere, media promotion and personal contacts have also been utilized. This
wealth of background work means that secondary data generally have a pre-
established degree of validity and reliability which need not be re-examined by the
researcher who is re-using such data.
Furthermore, secondary data can also be helpful in the research design of subsequent
primary research and can provide a baseline with which the collected primary data
results can be compared to. Therefore, it is always wise to begin any research activity
with a review of the secondary data.
Errors in statistics
A mistake in statistics refers to incorrect presentation or calculation due to human factors.
Mistakes may have occurred in the collection of data e.g. a respondent might have mistakenly
ticked yes box instead of no box. Mistakes can also occur when the collected data was being
misplotted on the graph.
An error refers to the difference between the actual figure and the estimated figure. The
deviation is just by chance and not due to carelessness of human beings. Normally errors arise
due to approximation or rounding off of figures.
6
Types of errors
1. Sampling error
It refers to the difference between the actual value and the estimated value as obtained from
the sample. The amount of the sampling error will depend on the size of the sample.
Therefore the greater the sample size the smaller the size of sampling error and vice versa.
3. Biased errors
These are errors which arise due to the bias on the part of the investigator, enumerator or the
instrument.
4. Un biased errors
These are errors which arise by chance in the usual course of the investigation. These errors
are compensatory in nature since both negative and positive errors cancel each other and
mostly the estimated value is equal to the actual value.
Measurement of Errors
Errors are measured absolutely or relatively.
Absolute error
It is the difference between the actual value and the estimated value.
Ae =A –E
Example
The actual sales of an enterprise amounted to Ksh.987,500 but the estimated sales were
Ksh.1,000,000. Find the absolute error.
Ae =A –E
7
Re = Ae = Absolute error
A Actul value
Re = Ae = -12,500
A 987,500
= -0.013
NB: Relative error can be expressed as a percentage by multiplying with 100. When relative
error is expressed as percentage it is known as percentage error.
Example
Assume that the population of a town was estimated as 1,424,880 whereas the actual
population was 1,578,620. Find the following:
a) Absolute error
b) Relative error
c) Percentage error
Solution
Absolute error (Ae) =A –E
= 1,578,620 -1,424,880
= 153,740
= 0.1
= 0.1 ×100 = 10 %
ORGANISATION OF DATA
Data organisation refers to classification and tabulation of data. The collected data is mostly
large in quantity and it is necessary to organise data in such a way that further analysis and
interpretation of data is made easily and correctly.
Classification of Data
8
It refers to arranging of data in groups or classes according to some resemblance of the data in
each group or class. In data classification the elements which possess the same characteristics
are grouped in one class and therefore the whole data is divided into a number of classes.
Tabulation of data
It refers to systematic arrangement of the statistical data in columns and rows
Example
Out of the total number of 2,500 women who were interviewed for employment in a factory,
1,500 were married and the rest unmarried. Amongst the married women 900 were
experienced and the rest inexperienced while from the unmarried 300 were experienced.
Present the information in tabular form.
Solution
Job interview
9
Experienced Inexperienced Total
Married 900 600 1,500
Unmarried 300 700 1,000
Total 1,200 1,300 2,500
Example
The following report was prepared by an examination officer on the performance of Meru
Central district in a national examination. Out of 4,000 male candidates below 20 years of age
3,000 passed and 1,000 failed. Of the 1,100 male candidates 20 years old and over 500 passed
and 600 failed. As regards the female candidates out of 600 below 20 years of age 400 passed
and 200 failed. Of the 350 females 20 years old and over 100 passed and 250 failed. Present
the information in a tabular form.
Solution
Question
On 1st January 2010, a company had 50 employees. Among them 18 were women. During the
year 9 employees left and 5 of those were men. The total of new employees in the year was 11
out of whom 4 were women. During the year 2011, 2 men left employment and 14 men and 4
women joined the work force. Present the information in a tabular form.
Statistical series
It refers to arrangement of statistical data in a systematic manner.
10
Types of statistical series
1. Spatial series
It refers to the data that is arranged in relation to geographical location. e.g.
Town profit (sh.’000’)
Nairobi 42,590
Mombasa 11,243
Meru 2,348
2. Time series
It is data arranged with respect to time e.g.
Year Sales (sh.million)
2011 324
2012 489
2013 1,128
2014 2,056
3. Condition series
It is data arranged with respect to a specific condition such as examination marks, height,
weight, expenditure etc e.g.
FREQUENCY DISTRIBUTION
It refers to grouping of statistical data according to size or magnitude. A frequency
distribution will consist of class intervals and their corresponding frequencies.
Class interval: It determines how large a class is. In order to find the class interval we find
the range and divide it by the number of classes.
Class interval = Range
Number of classes
11
15 -19
Upper limit
Lower limit
Frequency in each class: the values falling in a particular class are called frequencies. It is
calculated using strokes or tally sheets.
Example
The following data shows the number of children in the families of 32 employees:
5, 8, 3, 4, 2, 1, 4, 3, 3, 4, 1, 2, 7, 5, 6, 4, 5, 5, 4, 5, 8, 2, 1, 2, 2, 4, 3, 6, 0, 4, 7, 6 Construct
a frequency distribution table from the data.
1 3
2 5
3 4
4 7
5 5
6 3
7 2
8 2
Total 32
Question
The following data represents the number of refrigerators sold on 22 working days by a
leading company.
23, 30, 40, 23 ,23, 28, 30, 30, 40, 40, 30, 30, 20, 20, 26, 28, 40, 26, 23, 20, 20, 20 Construct
a frequency distribution table for the data.
12
a. Inclusive form of grouping/ Inclusive method
This is whereby both the lower limit and upper limit are included while taking the items in a
group e.g. if the first class is 1 -9 then both 1 and 9 values will be included.
Example
The following data relates to marks of 30 students in a statistics test.
10, 36, 40, 30, 26, 20, 19, 10, 10, 16, 19, 27, 15, 26, 20, 19, 7, 44, 33, 21, 26, 27, 6, 20, 11, 37,
37, 30, 20, 5
Construct a frequency distribution table with 8 classes using exclusive method.
Solution
Number of classes =8
=44 -5 =39
39
Class interval = /8 = 5
Example
The following data relate to marks of 60 applicants who were given a certain test for the
purpose of selection to a particular post.
41, 17, 83, 63, 58, 92, 60, 58, 70, 57, 67, 82, 33, 44, 51, 49, 34, 73, 54, 6, 36, 52, 32, 75, 60,
33, 9, 79, 28, 63, 42, 93, 43, 80, 3, 32, 57, 67, 84, 30, 63, 11, 35, 28, 10, 23, 8, 41, 60, 64, 72,
53, 92, 88, 62, 55, 60, 33, 40, 32
Construct a frequency distribution table using inclusive method. Take the first class as 0-9
Solution
Class Tally column Frequency
13
0-9 4
10-19 3
20-29 3
30-39 10
40-49 7
50-59 9
60-69 11
70-79 5
80-89 5
90-99 3
Total 60
Question
Group the following data taking the class interval of 5 in:
a. Exclusive form of grouping
b. Inclusive form of grouping
2, 4, 1, 3, 5, 7, 9, 2, 13, 15, 18, 11, 14, 10, 12, 16, 7, 6, 19, 22, 11, 23, 22, 24, 2, 5, 3, 4, 3, 2
Question
The following data shows the amount of unsecured personal loans in thousands of shillings
from a commercial bank.
700 450 725 1,125 625 1,650 750 400 1,050
500 750 850 1,250 725 475 925 1,050 925
850 625 900 1,750 700 825 550 925 850
475 750 550 725 575 575 1,450 700 450
700 1,650 925 500 675 1,300 1,125 775 850
Required:
i) Frequency distribution table of the data with Sh.200 thousand class interval
ii) The mean of the personal loans
iii) The standard deviation of the personal loans
iv) An ogive for the personal loans data.
14
Limitations/ Disadvantages
1. Do not give an accurate result but a rough idea
2. A technical hand can construct a diagram but a common man cannot construct it
correctly.
3. Diagrams takes more time to construct than tables
4. The method of data presentation is very expensive
5. Many people are not accustomed to diagrams and therefore they do not attach much
importance to them.
6. Comparison of diagrams may not be possible unless the units used are the same.
Types of diagrams
1. Bar charts
2. Pie chart
3. Pictogram
15
In this bar chart, the component figures are shown as separate bar charts adjoining each other.
The height of each bar represent the actual value of the component figure. They are suitable
when the totals of the components are not required.
Disadvantages
a) Not more informative
b) They are restricted to three or four components figures only.
Example
The national income statistics of a country for 3 years are given in the following table:
Solution
16
A simple bar chart showing National Income
National Income (Sh.million) Statistics
500
450
400
350
300
250
200
150
100
50
0
2015 2016 2017
Years
Scale
Y axis: 1 cm represent sh. 50 million
17
A Component Bar Chart Showing National
National Income (Sh.million) Income Statistics
500
450
400
350
300
250 Other sectors
200 Industry
150
Agriculture
100
50
0
2015 2016 2017
Years
Scale
Y axis: 1 cm represent sh. 50 million
100
80
60 Other sector
Industry
40
Agriculture
20
0
2015 2016 2017
Years
18
Scale
Y axis: 1 cm represent sh. 20%
250
200
150
Agriculture
100 Industry
Other sectors
50
0
2015 2016 2017
Years
Scale
Y axis: 1 cm represent sh. 50 million
Question
ABC Limited are manufacturers of biscuits, bread and cakes. Their sales for period of four
years were as follows:
Sales (sh.’000’)
Year Biscuit Bread Cakes
2016 50 80 40
2017 60 100 50
2018 70 110 30
2019 90 120 50
19
Question
The following data shows the number of different types of insurance policies issued in the
month of September 2019 by four insurance companies: Wyed Ltd., Xed Ltd., Yed Ltd., and
Zed Ltd.:
Insurance
Company
Required:
Present the above data using a component bar chart
2. Pie chart
It is a circle divided by lines into sections so that the area of each section is proportional to the
size of the figure represented. To find the angle of each component we use the following
formula:
When the angle of the various sectors are known, arrange them in the ascending order of
magnitude and then draw the circle.
Disadvantages
Changes in the overall total cannot be shown by changing the size of the pie chart.
Example
Draw a pie chart from the following data
Country Production (units)
A 30,000
B 25,000
C 24,000
D 23,000
E 20,000
F 10,000
20
Solution
Country Production (units)
A 30,000
B 25,000
C 24,000
D 23,000
E 20,000
F 10,000
132,000
21
A Pie Chart showing Production in units of
countries
27
82
55
A
B
C
D
E
63 68
F
65
Question
The data below shows the number of households in a village using electricity over the past 5
years
Year Number of households
2010 6
2011 9
2012 12
2013 15
2014 18
Characteristics of a graph
1. A graph must be neat and clean
2. The graph must have a clear title
3. It must not be overcrowded with curves
4. The scale chosen along the x axis and y axis must be suitable according to the given
data.
5. The graph must give the correct impression
22
Principles/ Rules for construction of graphs
1. The independent variable should always be taken along the x-axis and the dependent
variable along the y-axis
2. The vertical scale should always start at zero and if not possible a break should be
shown in the scale between zero and the next number.
3. The scale chosen must be one which can easily accommodate the whole data.
4. Against each value of independent variable given, there is a corresponding value of
dependent variable.
5. A graph must have a clear and comprehensive title
6. The source of data must be given
7. If more than one graph is plotted on the same Cartesian plane, then a different type of
line should be used for each curve e.g. dotted line, straight line etc
8. The scale caption for x-axis is placed under the centre of the horizontal axis while the
scale caption for y-axis is placed at the top of y scale.
Types of Graphs
An ogive curve is the name given to the curve obtained when the cumulative frequency of a
distribution is graphed.
The following steps are followed when drawing an ogive curve:
1. Compute the cumulative frequency (CF) of the distribution
2. Prepare a graph with cumulative frequency on the y axis and class intervals on the x
axis
3. Plot a starting point at zero (0) on the y axis and the lower class limit of the first class
on the x axis.
4. Plot the cumulative frequency on the graph at the upper class limit of the classes to
which they refer. This cumulative frequency is called less than cumulative
frequency.
5. Join the points by the help of a curve.
NB: An ogive curve is used to find the values of median, quartiles, deciles and percentiles
graphically.
Example
23
The following distribution shows the daily wages of 100 employees.
Wages (sh.) Number of employees
0-30 20
30-60 35
60-90 30
90-120 15
Solution
Wages (sh.) Number of employees Cumulative frequency
0-30 20 20
30-60 35 55
60-90 30 85
90-120 15 100
80
60
40
20
0
30 60 90 120
Wages (Sh.)
Scale
Y axis: 1 cm represent 20 CF
X axis: 1 cm represent sh.30
Example
24
The table below shows the age distribution of employees of XYZ Limited
Age group Frequency
21-25 5
26-30 12
31-35 23
36-40 39
41-45 32
46-50 21
51-55 9
56-60 2
Draw an ogive curve using the above data.
Solution
Age group Frequency Class boundaries Cumulative Frequency
21-25 5 20.5-25.5 5
26-30 12 25.5-30.5 17
31-35 23 30.5-35.5 40
36-40 39 35.5-40.5 79
41-45 32 40.5-45.5 111
46-50 21 45.5-50.5 132
51-55 9 50.5-55.5 141
56-60 2 55.5-60.5 143
Total 143
Cumulative
140 Frequency
120
100
80
60
40
20
0
20.5 25.5 30.5 35.5 40.5 45.5 50.5 55.5 60.5
Age Group
Scale
Y axis: 1 cm represent 20 CF
25
X axis: 1 cm represent 5
Question
The table below shows the profit made by 160 companies in the manufacturing industry for
the year ended December 2014.
26
A Percentage Cumulative Frequency showing
Age Distribution of 500 employees
120
% Cumulative Frequency
100
80
60
40
20
0
5 15 25 35 45 55 65
Age Group Years)
Scale
Y axis: 1 cm represent 20%
X axis: 1 cm represent 5 years
Question
Using the following data draw a percentage cumulative frequency
2. HISTOGRAM
It is a graph that represent the class frequencies in a frequency distribution by vertical
rectangles. It consist of a series of rectangles having a base measured along the x axis and this
is proportional to the class intervals and a height measured along the y axis which represent
the class frequency.
Example
27
Present the following data by means of a frequency histogram
Class Frequency
0-10 5
10-20 11
20-30 19
30-40 21
40-50 16
50-60 8
60-70 10
70-80 6
80-90 3
90-100 1
Solution
A Frequency Histogram
25
20
Frequency
15
10
0
0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100
Class Interval
Scale
Y axis: 1 cm represent 2
X axis: 1 cm represent 10
28
Question
The following table gives a frequency distribution of mass of 40 objects
Mass frequency
10-14 3
15-19 4
20-24 10
25-29 12
30-34 6
35-39 3
40-44 2
Draw a histogram
3. FREQUENCY POLYGON
It is a line graph drawn from histogram by joining the midpoints at the height of class interval
rectangles. The points to draw the frequency polygon will be joined with the help of a straight
line.
The frequency polygon gives the area of the histogram because it includes as much as the area
from outside the histogram as is left out from the inside.
Example
Use the following data to draw a frequency polygon
Class interval Frequency
0-10 3
10-20 5
20-30 7
30-40 12
40-50 15
50-60 8
60-70 4
Solution
Class interval Frequency Midpoint
0-10 3 5
10-20 5 15
20-30 7 25
30-40 12 35
40-50 15 45
50-60 8 55
60-70 4 65
Direct construction
29
A Frequency Polygon showing Class Interval
and Frequency
16
14
12
Frequency
10
0
5 15 25 35 45 55 65
Midpoints
Scale
Y axis: 1 cm represent 2
X axis: 1 cm represent 5
Histogram and frequency polygon
30
A Histogram and Frequency Polygon showing
Class Interval and Frequency
16
14
12
Frequency
10
0
5 15 25 35 45 55 65
Class Interval
Scale
Y axis: 1 cm represent 2
X axis: 1 cm represent 5
Question
Use the following data to draw a histogram and superimpose frequency polygon Class
interval Frequency
0-10 3
10-20 5
20-30 7
30-40 12
40-50 15
50-60 8
60-70 4
4. FREQUENCY CURVE
It has the same structure as frequency polygon except that the midpoints are joined with a
smooth curve and by rounding off the top.
31
Example
Construct a frequency histogram for the below data and superimpose a frequency curve on the
same graph
Daily profit (sh.000’) Frequency
0-50 12
50-100 18
100-150 27
150-200 20
200-250 17
250-300 6
25
Frequency
20
15
10
0
0-50 50-100 100-150 150-200 200-250 250-300
Daily Profit (Sh.'000')
32