Processing of Data
Processing of Data
Variable
If, we observe a characteristic, we find that it takes on different values in different
persons, place or things; we label the characteristic a variable. We do this for the simple
reason that the characteristic is not the same when observed in different possessors of it.
Example
Types of Variables
Qualitative Quantitative
Children in a family
Continuous
Weight of a student
Types of Variables
1. Qualitative
2. Quantitative
1
Qualitative Variable or an attribute
When the characteristic being studied is nonnumeric, it is called a qualitative variable or
an attribute.
Examples
Qualitative variables are gender, religious, affiliation, type of automobiles owned, state of
birth and eye color.
When the data are qualitative, we are usually interested in how many or what proportion
fall in each category. For example, what percent of the population has blue eyes? How
many Catholics and how many Protestants are there in the United States?
Quantitative Variable
When the variable studied can be reported numerically, the variable is called a
quantitative variable.
Examples
Quantitative variables are the balance in your checking account, the ages of company
presidents, the life of an automobile battery (such as 42 months) and the number of
children in a family.
i. Discrete Variables
Discrete variables can assume only certain values, and there are usually “gaps” between
the values. Examples of discrete variables are the number of bedrooms in a house(1,2,3,4
etc), the number of cars arriving and the number of students in each section course(25 in
section A, 42 in section B and 18 in section C).
Observations of a continuous variable can assume any value within a specific range.
Examples of continuous variables are in the air pressure in a tire and the weight of a
shipment of tomatoes.
2
Classification of Data
After collection and editing of data an important step towards processing the data is
classification.
Types of Classification
Broadly, the data can be classified on the following four basis:
Classification of
Data
i. Geographical Classification
Geographical classifications are usually listed in alphabetical order for easy reference.
Items may also be listed by size to emphasize the important areas as in ranking the states
by the population.
3
ii. Chronological Classification
When data are observed over a period of time the type of classification is known as
chronological classification. For examples, the sales figures of a company are given
below:
Year Sales(Tk.lakhs)
2000 18810
2001 23601
2002 23816
2003 32435
2004 39343
In qualitative classification, data are classified on the basis of some attribute or quality
such as sex, color of hair, literacy, religion etc. The point to note in this type of
classification is that the attribute under study is blindness, we may found out how many
persons are blind in a given population.
Population
Blinds Non-Blinds
Monthly No of Workers
Wages(Tk.)
1500-1600 50
1600-1700 200
1700-1800 260
4
Formation of a Frequency Distribution
The process of preparing this type of distribution is very simple. We have just to count
the number of times a particular value is repeated which is called the frequency of that
class. In order to facilitate counting, prepare a column of “tally”. In other column, place
all possible values of the variable from the lowest to the highest. Then put a bar (vertical
line) opposite the particular value to which it relates.
We finally count the number of bars corresponding to each value of the variable and
place it in the column of the frequency.
Example
The number of refrigerators sold on 22 working days by a leading agency house:
23 30 20 26 30 20 23 40 40 26 20 30
23 40 28 26 23 40 28 28 30 30
No of Tally Frequency no of
Refrigerators Days
20 lll 3
23 llll 4
26 lll 3
28 lll 3
30 IIII 5
40 llll 4
The table clearly shows that on 3 days 20 refrigerators were sold each day, on 4
days 23 refrigerators were sold each day etc.
This method of classification helps in condensing the data only where values are
largely repeated, otherwise there will be hardly and condensation. In order to make the
series more compact so that its characteristics can be easily studied, data may be
classified according to class- intervals.
5
Cumulative Frequency
In some situations, we may be interested, not in the frequencies in various classes,
but rather in the frequencies or proportions of observation which are “less than” or
“greater than” a given value. This leads to a cumulative frequency distribution. This is
derived from a frequency distribution by forming a cumulative frequency column. This
column is computed by adding the successive class frequencies from top to bottom. The
entry corresponding to the top interval is the frequency of that class., the entry opposite
the second interval is the sum of the frequencies in first and second class intervals etc.
and so on.
0-10 4 4 4/96
10-20 12 16 16/96
20-30 24 40 40/96
30-40 36 76 76/96
40-50 20 96 96/96
6
Classification according to class intervals
i. Class limits
The class limits are the lowest and the highest values that can be included in the class.
For example, take the class 20-24. The lowest value of this class is 20 and the highest 40.
The two boundaries of a class are known as the lower limit and upper limit of the class.
The lower limit of a class is the value below which there can be no value in that class.
The upper limit of a class is the value above which no value can belong to that class. Of
the class 70-89, 70 is the lower limit and 89 is the upper limit, i.e. in this class there can
be no value which is less than 70 or more than 89. Similarly, if we take the class 90-109,
there can be no value in that class is less than 90 or more than 109.
The span of a class, that is, the difference between the upper limit and lower limit, is
known as class interval. For example, in the class 20-40, the class interval is 20 (i.e. 40
minus 20). The size of the class interval is determined by the number of the classes and
the total range in the data.
It is the value lying half-way between the lower and the upper class limits of a class
interval. Mid point of a class is ascertained as follows:
Mid point of a class= (Upper limit of the class+ Lower limit of the class)/2
7
Methods of classifying the data according to class interval
There are two methods of classifying the data according to class intervals namely
a. Exclusive method
b. Inclusive method
a. Exclusive Method
When the class intervals are so fixed that the upper limit of one class is the lower limit of
the next class it is known as the „Exclusive‟ method of classification. The following data
are classified on the basis:
Income(Tk.) No of Employees
1800-1900 50
1900-2000 100
2000-2200 200
It is clear that „Exclusive method‟ ensures continuity of data inasmuch as the upper limit
of one class is the lower limit of the next class. Thus in the above example, there are 50
persons whose income is between Tk. 1800 and Tk. 1888.99. A person who is getting
exactly Tk. 1900 would be included in the class 1900-2000.
Here, whenever this method is used it is necessary to give clear instructions in the
questionnaire. However, the reader should note that if class intervals are given like 0-10,
10-20,, it is always presumed that upper limit is exclusive i.e. an observation exactly to
the upper limit is not included in that class.
b. Inclusive method
Under the “Inclusive method‟ of classification, the upper limit of one class is included in
that class itself.
Income(Tk.) No of Employees
800-899 50
900-999 100
1000-1099 200
In the class 800-899 we include persons whose income is between Tk 800 and Tk.
899. If the income of persons is exactly Tk. 900 he is included in the next class.
8
Principles of Classification
It is difficult to lay down any hard and fast rules for classifying the data as the
type of classification.
1. The number of classes should preferably be between 5 and 15. However, there
is no rigidity about it. The classes can be more than 15 depending upon the total number
of observations in the series and the details required, but they should not be less than five
because in that case the classification may not reveal the essential characteristics.
Struges suggested the following formula for determining the approximate number
of classes:
K=1+3.322 logN
However, the precise number of classes to be used for a given variable depends
upon personal judgment and other considerations such as the details required, The case of
calculation of further statistical work etc.
2. As far as possible one should avoid odd values of class intervals e.g. 3, 7, 11,
26, 39 etc. Preferably, one should have class intervals of either five or multiples of five
like 10, 20, 25, 100 etc.
3. The starting point, i.e. the lower limit of the first class, should either be zero or
5 or multiple of 5. For example, if the lowest value of the series is 63 and we have taken a
class interval of 10, then the first class should be 60-70, instead of 63-75. Similarly, if the
lowest value of the series is 76 and the class interval is 5 then the first class should be 75
to 80 rather than 76 to 81.
9
Example
The profits (in lakhs of Tk‟s) of 30 Bangladeshis companies for the year 2005-2006 are
given below:
18 16 23 37 35 49 63 65 55
45 58 57 69 20 22 35 42 37
42 48 53 49 65 39 48 67 25
29 58 65
Solution
Let us determine the suitable class interval with the help of the following formula:
Range
i=
K
Where, K=1+3.322logN and Range=Highest value-lowest value
K=1+3.322log30=5.91 6, Range=69-16=53
Range 53
i= = 8.97 or 9
K 5.91
Since values like 3, 7, 9 etc. should be avoided we will take 10 as the class interval and
the first class be 15-25.
25-35 ll 2
35-45 IIII II 7
45-55 IIII l 6
55-65 llll 5
65-75 llll 5
Total 30
10
Example
The following are the marks of the 30 students in statistics. Prepare a frequency
distribution taking a suitable class interval.
12 33 23 25 18 35 37 49 54 51 37 15
27 33 42 45 47 55 69 65 63 46 29 18
37 45 46 59 29 55
Solution
Let us determine the suitable class interval with the help of the following formula:
Range
i=
K
Where, K=1+3.322logN and Range=Highest value-lowest value
K=1+3.322log30=5.91 6, Range=69-12=57
Range 57
i= = 9.64 or 10
K 5.91
20-30 llll 5
30-40 llll I 6
40-50 llll II 7
50-60 llll 5
60-70 lll 3
Total 30
11
Tabulation of Data
One of the simplest and most revealing devices for summarizing data and presenting
them in meaningful fashion is the statistical table. A table is a systematic arrangement of
statistical data in columns and rows. Rows are horizontal arrangemen, whereas columns
are vertical ones.
Parts of a table
The various parts of a table may vary from case to case depending upon the given data.
But a good table must contain at least the following parts:
1. Table number
2. Title of the table
3. Caption
4. Stub
5. Body of the table
6. Head note
7. Footnote
1. Table number
Each table should be numbered. There are the different practices with regard to the place
where this number is to be given. The number may be given either in the centre at the top
above the title or in the side of the table at the top or at the bottom of the table on the left
hand side.
2. Title of the table
Every table must have a suitable title.
3. Caption
Captions refer to the column headings. It explains what the column represents. It may
consists of one or more column headings. Under a column heading there may be sub-
heads.
4. Stub
As distinguished from caption, stubs are the designation of rows or row headings.
5. Body
The body of the table contains the numerically information. This is the most vital part of
the table.
6. Head note
It is used to explain certain points relating to the whole table that have not been included
in the title nor in the captions or stubs. For example, the unit of measurement is
frequently written as the head note, such as “in thousand” or “in millions” or “in crores”
etc.
7. Footnote
Anything in a table which the reader may find difficult to understand from the title,
captions and stubs should be explained in footnotes.
12
Types of Tables
In this type of table only one characteristics is shown. This is the simplest of tables. The
following is the illustration of such a table:
Total 180
Such a table shows two characteristics and is formed when either the stub or the caption
is divided into two coordinate parts.
When three or more characteristics are represented in the same table, such a table is
called higher order table.
13
2. General Purpose and Special Purpose Tables
General purpose tables, also known as the reference tables or repository tables, provided
information for general use or reference.
Special purpose tables, also known as summary or analytical tables, provided information
for particular discussion. They show relationship between different groups of figures.
Example
14
Charting Data
A chart can take the shape of either a diagram or a graph. For the sake of clarity we will
discuss them under two separate heads:
i. Diagrams
ii. Graphs
Diagrams
For representing data diagrams are more commonly used than graphs.
1. Title
Every diagram must be given a suitable title. The title should convey in as few a words as
possible the main idea that the diagram is intended to portray.
4. Footnotes
In order to clarify certain points about the diagrams footnotes may be given at the bottom
of the diagram.
5. Index
Index illustration different types of lines or different shades, colors, should be given so
that the reader can easily make out the meaning of the diagram.
7. Simplicity
Diagrams should be as simple as possible so that the reader can understand their meaning
clearly.
15
Types of
Diagrams
Bar diagrams are the most common type of diagrams used in practice. A bar is a thick
line whose width is shown merely for attention. They are called one-dimensional because
it is only the length of the bar that matters and not the width.
2. The gap between one bar and another bar should be uniform throughout.
3. Bars may be either horizontal or vertical. The vertical bars should be preferred
because they give better look and also facilitate comparison.
4. While constructing the bar diagrams, it is desirable to write the respective figure
at the end of each bar so that the reader can know the precise value without
looking at the scale.
16
Simple Bar Diagrams
A simple bar diagram is used to represent only one variable. For example the figures of
sales, production, population etc, for various years may be shown by means of a simple
bar diagram. However, an important limitation of such diagrams is that they can present
only one classification or one category of data.
Example
The funds flow of Goodwill India Ltd from 1991-92 to 1995-96 are given below:
1992-93 109.61
1993-94 204.29
1994-95 126.31
1995-96 209.89
250
204.29 209.89
Funds flow(Rs.crores)
200
150 126.31
109.61 Funds Flow
100 85.8
50
0
1991-92 1992-93 1993-94 1994-95 1995-96
Years
17
Sub-divided Bar Diagrams
These diagrams are used to represent various parts of the total. For example, the number
of employees in various departments of a company may be represented by a sub-divided
bar diagrams. While constructing such a diagram the various components in each bar
should be kept in the same order. To distinguish between the different components, it is
useful to use different shades or colors. Sub-divided bar diagrams can be vertical as well
as horizontal.
Example
Represent the following data by sub-divided bar diagrams
(in Rs.Crores)
18
Multiple bar Diagrams
In multiple bar diagram two or more sets of inter-related data are represented. The
technique of drawing such a diagram is the same as that of simple bar diagram. The only
difference is that since more than one phenomenon is represented, different shades, colors
or crossings are used to distinguish between the bars.
Example
2500
Gross profits
2000 1663
1376 Profits before tax
1500 1219
982 Profits after tax
1000
500
0
1994-95 1995-96
Year
19
Graphs of Frequency Distributions
A frequency distribution can be presented graphically in any of the following diagrams:
1. Histogram
2. Frequency Polygon
3. Smoothed frequency curve
4. Cumulative frequency curves or „Ogives‟.
Histogram
A histogram is a graphical method for presenting data, where the observations are located
on a horizontal axis (usually grouped into intervals) and the frequency of those
observations is depicted along the vertical axis.
While constructing histograms the variable (class interval) is always taken on the X-axis
and the frequencies depending on it on the Y axis. The distance for each rectangle on the
X-axis shall remain the same in case the class intervals are uniform throughout; if they
are different the width of the rectangles shall also vary. The Y axis represents the
frequencies of each class which constitute the height of its rectangle.
20
Construction of Histogram when Class-intervals are Equal
When class-interval are equal, take frequency on the Y axis, the variable on the X-axis
and construct adjacent rectangles. In such a case the heights of the rectangles will be
proportional to the frequencies.
Example
25
21
0-10
20 19
10-20
16
20-30
Frequency
15
30-40
11
10 40-50
10 8
6 50-60
5
5 60-70
3
1 70-80
0 80-90
1
Size class 90-100
21
Construction of Histogram when Class-intervals are Unequal
When class-intervals are equal, the frequencies must be adjusted before constructing the
histogram. For making the adjustment we take that class which has the lowest class-
interval and adjust the frequencies of other classes in the following manner. If one class-
interval is twice as wide as the one having lowest class interval, we divide the height of
its rectangle by two; if it is three times more, we divide the height of its rectangles by
three.
Example
Represent the following data by means of a histogram.
22
Pie Diagram
This type of diagram enable us to show the portioning of a tatal into component parts. A
very common use of the pie chart is to represent the division of a sum of money into its
components. For example, the entire circle or pie, may represent the budget of a family
for a month and the sections may represent portions of the budget allotted to rent, food,
clothing and so on. Similarly, through a pie diagram we can show how a rupee by a firm
is distributed over various heads such as wages, raw materials, administration expenses
etc.
Example
Areas of continents of the world
23
The pie diagram is intended to compare the distinct components which together constitute
a whole. The whole is represented by a circle of arbitrary radius and the segments of the
circle represent the component parts. To construct such a diagram we use the fact “the
whole” (51.5 in the above illustration) corresponds to the total number of degrees in the
circular arc, namely 3600. This 3600 is then proportionately divided among the various
components of the whole. Thus the above illustration; the arc of the segment representing
3600
Asia subtends on angle of 730(= 10.4 ) at the centre of the circle.
51.5
This diagram should be sparingly used, especially if there are many segments.
24
Line Diagram
If we are given values of a variable at different points of time, the set of values is known
as a time series. The line diagram is used to represent this type of data. In this diagram
time is represented along the X-axis and the variable is plotted along the Y-axis.Thus we
get a point, for each time period and successive points, when connected by straight lines,
give the desired diagram. Often smooth curve is drawn through these points. This
diagram is alternatively called a line diagram or a time series graph.
Example
Below are given the figures of production (in thousand quintals) of a sugar factory:
Year Production
(in ‘000 qtl)
1999 80
2000 90
2001 92
2002 83
2003 94
2004 99
2005 92
120
100 99
92 94 92
90
80 80 83
Production
60
40
20
0
1998 1999 2000 2001 2002 2003 2004 2005 2006
Year
25
26