Stat I CH - 2
Stat I CH - 2
CHAPTER TWO
2. DATA COLLECTION AND PRESENTATION
2.1. Data collection
Statistical investigation is a comprehensive and requires systematic collection of data about some group
of people or objects, describing and organizing the data, analyzing the data with the help of different
statistical method, summarizing the analysis and using these results for making judgments, decisions and
predictions. The validity and accuracy of final judgment is most crucial and depends heavily on how well
the data was collected in the first place. The quality of data will greatly affect the conditions and hence at
most importance must be given to this process and every possible precaution should be taken to ensure
accuracy while collecting the data.
b) Experimentation
We record the results of our experiment.
In experimentation, researchers are interested to identify the cause and effect relationships
between variables.
c) Observational Research
We see what is happening and record it. E.g. traffic accident, etc
Observation relies on watching or listening, then, counting or measuring.
There are no respondents.
It is time consuming/expensive.
a) The set of classes must be mutually exclusive. That is, a given data value should fall into only one
class/category. There should be no overlap between classes and limits.
b) The class must be exhaustive. That is, we have to include all possible data values. No data value
should fall outside the range covered by the frequency distribution.
c) If possible, the classes should have equal widths. Unequal class widths make it difficult to interpret
both frequency distribution and their graphical presentation. One exception occurs when there is an
open-ended distribution i.e., it has no specific beginning value or no specific ending value. In open –
ended classes, the lowest class lacks a lower limit or the highest class lacks an upper limit. Open – ended
classes are classes with either no lower limit or no upper limit.
d) Selecting the number of classes to use. There is no hard and fast rule to determine the number of
classes of a data set but it is a subjective process. If we have too few classes important characteristics of
the data may be buried within the small number of categories. If there are too many classes, many
categories will contain either zero or a small number of values. In general 5 to 20 classes will be suitable
or recommended.
e) When possible, class widths should be rounded numbers (e.g. 5, 10, 25, 50,100 etc)
f) If possible, avoid using open – ended classes.
Types of frequency distributions
There are three types of frequency distribution tables. These are:-
a. the absolute frequency
b. the relative frequency
c. the cumulative frequency
a) Absolute frequency: An absolute frequency distribution table shows the absolute number of
occurrences of an entry or groups of entries in a data set. To construct an absolute frequency distribution
table, list all the scores in the first column and count the number of times each score occurs in the
original data set. Record this against each item in the second column.
b) Relative frequency: The relative frequency distribution table shows the number of occurrence of
each item or class of items in the data set as a proportion of the total number of observation. This can be
AF AF
expressed in decimal, fraction or percentage form. RF = = where n is total number of
TF n
observations, RF = Relative frequency, AF = Absolute Frequency, TF = total Frequency (number of
observations that is, n)
c) Cumulative frequency: The cumulative frequency distribution table shows the absolute frequency of
occurrence added at each successive class in the data set. Alternatively one can use the relative
cumulative frequency table based on relative frequencies.
Given the following frequency distribution
The class boundaries in the second column are used to separate classes so that there are no gaps in the
frequency distribution. The basic rule of thumb is that the class limits should have the same decimal
place value as the data, but the class boundaries should have one additional place value. Example: lower
limit – 0.5 = 31-0.5 = 30.5 => lower boundary ; upper limit +0.5 = , 37+0.5 = 37.5 => upper boundary
The “less than” and “more than” cumulative frequencies
The “less than” cumulative frequency of a class is the total frequency of all values less than the upper
boundary of the class and the “more than” cumulative frequency of a class is the total frequency of all
values which are greater than the lower boundary of the class.
Example:
Class Class Upper Absolute Relative Less than Lower More than
Limits boundaries boundaries frequency frequency Cumulativ boundaries Cumulative
e frequency
frequency
100-104 99.5-104.5 104.5 2 0.04 2 99.5 50
105-109 104.5-109.5 109.5 8 0.16 10 104.5 48
110-114 109.5-114.5 114.5 18 0.36 28 109.5 40
115-119 114.5-119.5 119.5 13 0.26 41 114.5 22
120-124 119.5-124.5 124.5 7 0.14 48 119.5 9
125-129 124.5-129.5 129.5 1 0.02 49 124.5 2
130-134 129.5-134.5 134.5 1 0.02 50 129.5 1
Total 50
Example: The following data is given on a monthly household income of a community, construct a
frequency distribution and calculate
a. The absolute, relative and cumulative frequencies
b. The less than and the more than cumulative frequencies
c. Interpret the values found at (a) and (b) above
Date set
112 100 127 120 134 105 110 118 109 112
110 118 117 116 118 114 114 122 105 109
107 112 114 115 118 118 122 117 106 110
116 108 110 121 113 119 111 120 104 110
120 113 120 117 105 118 112 110 114 114
n = 50
Solution: Steps:
1. Array the data
2. Determine the number of classes
Rule of thumbs
i) We could use the Sturge’s formula to determine the number of classes (k): K = 1+ 3.322logn where n
is the number of observations.
In this case, k =1+3.322log50, log 50 = 1.7 = 1+3.322x1.7=1+5.64 = 6.64 ≈ 7
ii) Apply the 2k rule: This guide suggests you to select the smallest number (k) for the number of classes
such that2k is greater than the number of observations.
n = 50, 25 = 32, 32 < 50,26 = 64 > 50, so the recommended number of classes is 6.
Class Class Upper Absolute Relative Less than Lower More than
Limits boundaries boundaries frequency frequency Cumulativ boundaries Cumulative
e frequency
frequency
100-104 99.5-104.5 104.5 2 0.04 2 99.5 50
105-109 104.5-109.5 109.5 8 0.16 10 104.5 48
110-114 109.5-114.5 114.5 18 0.36 28 109.5 40
115-119 114.5-119.5 119.5 13 0.26 41 114.5 22
120-124 119.5-124.5 124.5 7 0.14 48 119.5 9
125-129 124.5-129.5 129.5 1 0.02 49 124.5 2
130-134 129.5-134.5 134.5 1 0.02 50 129.5 1
Total 50
* Note that the sum of the relative frequencies is always 1 or 100%. That is,
n
∑ ( fin ¿ )=1 ¿
i=1
c) Interpretation
31 (18+13) of the households earn a monthly income from birr 110 – 119
62% of the households earn a monthly income from birr 110 – 119 (31/50 ×100%)
28 of the households earn a monthly income less than birr 114.5
40 of the households earn a monthly income at least birr 109.5
The reasons for constructing a frequency distribution are:
a. To organize the data in a meaningful way
b. To enable researchers to draw charts and graphs for the presentation of data.
c. To enable a reader to make comparisons among different data sets.
a) The Histogram: - is a graph that displays the data by using adjacent vertical rectangles (unless
frequency of a class is zero) of various heights to represent the frequencies of the classes. That is, in a
histogram the class boundaries are marked on the horizontal axis and the class frequencies on the vertical
axis. N.B: The length of adjacent rectangles of a histogram (along the y-axis) can be the absolute or
relative frequencies of a class. The tallest rectangle in a histogram is associated with a class having the
greatest number of observations (frequencies).
Example: Below is the frequency distribution of the selling prices of vehicles sold at Nyala Motors last
month. Construct a histogram.
Selling Prices Frequency
(Thousands of birr)
170 up to 180 4
180 up to 190 12
190 up to 200 6
200 up to 210 8
210 up to 220 11
220 up to 230 3
230 up to 240 10
240 up to 250 6
Total 60
Solution: To construct a histogram, the class frequencies are scaled along the vertical axis (Y-axis) and
either the class limits or the class midpoints along the horizontal axis. The complete histogram is shown
in the above chart. Note that there is no space between the bars. This is a feature of histogram. In bar
charts, which are described in a later section, the vertical bars are separated slightly.
Selling price
Histogram of the Selling Prices of 60 automobiles at Nyala Motors
Thus, the histogram provides an easily interpreted visual representation of a frequency distribution. We
should also point out that we would have reached the same conclusions and the shape of the histogram
would have been the same had we used a relative frequency distribution instead of the actual frequencies.
That is, if we had used the relative frequencies, we would have had a histogram of the same shape as
Chart 2-1. The difference is that the vertical axis would have been reported in percent of vehicles instead
of the number of vehicles.
Frequency Polygon
A frequency polygon is similar to histogram. It consists of line segments connecting the points formed
by the intersections of the class midpoints and the class frequencies. The construction of a frequency
polygon is illustrated in Chart 2-1. We use the vehicle selling prices for the cars sold last month at Nyala
Motors. The midpoint of each class is scaled on the X-axis and the class frequencies on the Y-axis.
Recall that the class midpoint is the value at the centre of a class and represents the value in the class.
The class frequency is the number of observations in a particular class. The vehicle selling prices at
Nyala Motors are:
As noted previously, the 170,000 up to 180,000 classes is represented by the midpoint, 175,000. To
construct a frequency polygon, move horizontally on the graph to the midpoint, 175, and then vertically
to 4, the class frequency, and place a dot. The X and Y values of this point are called the coordinates. The
coordinates of the next point are X = 185 and Y = 12. The process is continued for all classes. Then, the
points are connected in order. That is, the point representing the lowest class is joined to the one
representing the second class and so on.
14
12
10
Frequency
8
6
4
2
0
165 175 185 185 205 215 225 235 245 255
Selling price (in 000)
Both the histogram and the frequency polygon allow us to get a quick picture of the main characteristics
of the data (highs, lows, point of concentration, etc.). Although the two representations are similar in
purpose, the histogram has the advantgae of depciting each class as a rectangle, with the area of the
rectanglural bar representing the number in frequencies of each class. The frequency polygon, in turn,
has an advantage over the histogram. It allows us to compare direclty two or more frequency
distributions.
Total 60
Less Than Cumulative Frequency Distribution for Vehicle Selling Price
To plot a less-than cumulative frequency distribution, scale the upper limit of the class along the X-axis
and the cumulative frequencies along the Y-axis. To provide additional information, you can scale the
horizontal axis on the left in units and the vertical axis on the right in percent.
To begin the plotting, 4 vehicles sold for less than 180,000, so the first plot is at X = 180 and Y = 4. The
coordinates for the next plot are X = 190 and Y = 16. The rest of the points are plotted and then the dots
connected to form the chart. To find the selling price below which half the cars sold, we draw a line from
the 90 percent mark on the right-hand vertical axis over the ogive, then drop down to the X-axis and read
the selling price. The value on the X-axis is 240, so we estimate that 90 percent of the vehicles sold for
less than 240,000.
70
60
50
Frequency
40
30
20
10
0
165 175 185 185 205 215 225 235 245
Selling price (in 000)
a) Line Charts
Line charts: are particularly effective in business because we can show the change in a variable
overtime. The variable, such as the number of units sold or the total value of the sales, is called along the
vertical axis and time along the horizontal axis. The table below shows the line chart for the above data.
14
12
10
8
6
4
2
0
175 185 185 205 215 225 235 245
b) Bar Chart
Bar Charts: This is used when the horizontal axis deals with information that is qualitative or non –
continuous in nature, e.g. Gender, Marital status, etc. When we represent data using bar charts, the bars
are not joined together. All the bars must have equal width and the distance between bars must be equal.
The following chart shows the bar chart of educational background of managers in a certain company.
c) Pie – Chart:
Pie – Chart: - is useful for displaying a relative frequency distribution. A circle is divided proportionally
to the relative frequency and portions of the circle are allocated for the different groups. We will use the
information in the following table which shows a breakdown of the educational qualification of
employees at Hawassa University, to explain the details of constructing a pie chart.
10 %
25 % PhD
15 %
MA/MSc
BA/BSc
Other
50 %