0% found this document useful (0 votes)
57 views

Module 3 Notes-RM

Uploaded by

Prachurjya Jena
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

Module 3 Notes-RM

Uploaded by

Prachurjya Jena
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 107

DESCRIPTIVE AND INFERENTIAL STATISTICS…..

• Descriptive statistics are numbers that are used to summarize and describe data. The word "data" refers
to the information that has been collected from an experiment, a survey, a historical record, etc.

• Descriptive statistics consists of the collection, organization, summarization, and


presentation of data.
• Inferential statistics consists of generalizing information from samples to
populations, performing estimations and hypothesis tests, and making predictions.
• Probability distributions, hypothesis testing, correlation testing and regression analysis all fall under the
category of inferential statistics.
DESCRIPTIVE STATISTICS
• Variability
• Number
• Variance and standard
• Frequency Count deviation
• Percentage • Graphs
• Normal Curve
• Deciles and quartiles
• Measures of Central
Tendency (Mean,
Midpoint, Mode)
Allows for comparisons
across variables
INFERENTIAL • i.e. is there a relation between
one’s occupation and their
STATISTICS expenditure style.

Hypothesis Testing
CLASSIFICATION OF DATA

• Chronological classification
• Spatial Classification
• Qualitative Classification
• Quantitative Classification
ORGANISATION OF DATA

• Variable
• Discrete
• can take only certain values.
• Continuous
• Can take any value
• May be whole number, fractional values, range of Values etc..
FREQUENCY DISTRIBUTION

• Grouped according to class intervals


• Reflects number of items fall under a particular class or group.
• Bounded by Class limits
• Class interval is the difference between Upper class limit and Lower Class
limit.
HOW TO PREAPARE A FREQUENCY DISTRIBUTION

• Should we have equal or unequal sized class intervals?

• How many classes should we have?


• What should be the size of each class?
• How should we determine the class limits?
• How should we get the frequency for each class?
CHAPTER

Presentation of Data

• Textual or Descriptive presentation


Studying this chapter should • Tabular presentation
enable you to: • Diagrammatic presentation.
• present data using tables;
• represent data using appropriate
diagrams.
2. TEXTUAL PRESENTATION OF DATA
In textual presentation, data are
1. INTRODUCTION described within the text. When the
quantity of data is not too large this form
You have already learnt in previous of presentation is more suitable. Look
chapters how data are collected and at the following cases:
organised. As data are generally
voluminous, they need to be put in a Case 1
compact and presentable form. This In a bandh call given on 08 September
chapter deals with presentation of data 2005 protesting the hike in prices of
precisely so that the voluminous data petrol and diesel, 5 petrol pumps were
collected could be made usable readily found open and 17 were closed whereas
and are easily comprehended. There are 2 schools were closed and remaining 9
generally three forms of presentation of schools were found open in a town of
data: Bihar.

2019-20
PRESENTATION OF DATA 41

Case 2 and columns (read vertically). For


Census of India 2001 reported that example see Table 4.1 tabulating
Indian population had risen to 102 crore information about literacy rates. It has
of which only 49 crore were females three rows (for male, female and total)
against 53 crore males. Seventy-four and three columns (for urban, rural
crore people resided in rural India and and total). It is called a 3 × 3 Table giving
only 28 crore lived in towns or cities. 9 items of information in 9 boxes called
While there were 62 crore non-worker the "cells" of the Table. Each cell gives
population against 40 crore workers in information that relates an attribute of
the entire country. Urban population had gender ("male", "female" or total) with a
an even higher share of non-workers (19 number (literacy percentages of rural
crore) against workers (9 crore) as people, urban people and total). The
compared to the rural population where most important advantage of tabulation
there were 31 crore workers out of a 74 is that it organises data for further
crore population... statistical treatment and decision-
In both the cases data have been making. Classification used in
presented only in the text. A serious tabulation is of four kinds:
drawback of this method of presentation • Qualitative
is that one has to go through the • Quantitative
complete text of presentation for • Temporal and
comprehension. But, it is also true that • Spatial
this matter often enables one to
emphasise certain points of the Qualitative classification
presentation.
When classification is done according
to attributes, such as social status,
physical status, nationality, etc., it is
called qualitative classification. For
example, in Table 4.1 the attributes for
classification are sex and location which
are qualitative in nature.
TABLE 4.1
Literacy in India by sex and location (per cent)
Location Total
Sex Rural Urban
Male 79 90 82
Female 59 80 65
3. TABULAR PRESENTATION OF DATA Total 68 84 74
In a tabular presentation, data are Source: Census of India 2011. (Literacy rates
presented in rows (read horizontally) relate to population aged 7 years and above)

2019-20
42 STATISTICS FOR ECONOMICS

Quantitative classification (i) heights (in cm) and


In quantitative classification, the data (ii) weights (in kg) of students
of your class.
are classified on the basis of
characteristics which are quantitative in Temporal classification
nature. In other words these
characteristics can be measured In this classification time becomes the
quantitatively. For example, age, height, classifying variable and data are
production, income, etc are quantitative categorised according to time. Time
characteristics. Classes are formed by may be in hours, days, weeks, months,
assigning limits called class limits for the years, etc. For example, see Table 4.3.
values of the characteristic under TABLE 4.3
consideration. An example of quantit- Yearly sales of a tea shop
ative classification is given in Table 4.2. from 1995 to 2000
Calculate the missing figures in the Table. Years Sale (Rs in lakhs)

TABLE 4.2 1995 79.2


Distribution of 542 respondents by 1996 81.3
their age in an election study in Bihar 1997 82.4
1998 80.5
Age group No. of 1999 100.2
(yrs) respondents Per cent
2000 91.2
20–30 3 0.55
30–40 61 11.25 Data Source: Unpublished data.
40–50 132 24.35
50–60 153 28.24 In this table the classifying
60–70 ? ? characteristic is sales in a year and
70–80 51 9.41 takes values in the scale of time.
80–90 2 0.37
All ? 100.00
Activity
Source: Assembly election Patna central • Go to your school office and
constituency 2005, A.N. Sinha Institute of Social
collect data on the number of
Studies, Patna.
students studied in the school in
Here classifying characteristic is age each class for the last ten years
in years and is quantifiable. and present the data in a table.

Activities Spatial classification


• Discuss how the total values
When classification is done on the basis
are arrived at in Table 4.1
• Construct a table presenting of place, it is called spatial
data on preferential liking of the classification. The place may be a
students of your class for Star village/town, block, district, state,
News, Zee News, BBC World, country, etc.
CNN, Aaj Tak and DD News. Table 4.4 is an example of spatial
• Prepare a table of classification.

2019-20
PRESENTATION OF DATA 43

TABLE 4.4 (i) Table Number


Export from India to rest of the world in
2013-14 as share of total export (per cent) Table number is assigned to a table for
identification purpose. If more than one
Destination Export share table is presented, it is the table
USA 12.5 number that distinguishes one table
Germany 2.4 from another. It is given at the top or
Other EU 10.9
at the beginning of the title of the table.
UK 3.1
Japan 2.2 Generally, table numbers are whole
Russia 0.7 numbers in ascending order if there are
China 4.7 many tables in a book. Subscripted
West Asia -Gulf Coop. Council 15.3 numbers, like 1.2, 3.1, etc., are also
Other Asia 29.4
Others 18.8
used for identifying the table according
to its location. For example, Table 4.5
All 100.0
should be read as the fifth table of the
fourth chapter, and so on
(Total Exports: US $ 314.40 billion)
(See Table 4.5).

Activity (ii) Title


• Construct a table presenting The title of a table narrates about the
data collected from students of contents of the table. It has to be clear,
your class according to their brief and carefully worded so that the
native states/residential
interpretations made from the table are
locality.
clear and free from ambiguity. It finds
place at the head of the table
4. TABULATION OF DATA AND PARTS succeeding the table number or just
OF A TABLE below it (See Table 4.5).
To construct a table it is important to
(iii) Captions or Column Headings
learn first what are the parts of a good
statistical table. When put together At the top of each column in a table a
systematically these parts form a table. column designation is given to explain
The most simple way of conceptualising figures of the column. This is
a table is to present the data in rows called caption or column heading
(See Table 4.5).
and columns alongwith some
explanatory notes. Tabulation can be (iv) Stubs or Row Headings
done using one-way, two-way or three-
Like a caption or column heading, each
way classification depending upon the
row of the table has to be given a
number of characteristics involved. A heading. The designations of the rows
good table should essentially have the are also called stubs or stub items, and
following: the complete left column is known as

2019-20
44 STATISTICS FOR ECONOMICS

stub column. A brief description of the India were non-workers in 2001 (See
row headings may also be given at the Table 4.5).
left hand top in the table. (See Table
4.5). (vi) Unit of Measurement
The unit of measurement of the
(v) Body of the Table figures in the table (actual data)
Body of a table is the main part and it should always be stated alongwith
contains the actual data. Location of the title. If different units are there
any one figure/data in the table is for rows or columns of the table,
fixed and determined by the row and these units must be stated
column of the table. For example, data alongwith ‘stubs’ or ‘captions’. If
in the second row and fourth column figures are large, they should
indicate that 25 crore females in rural be rounded up and the method

Table Number Title


↓ ↓
Table 4.5 Population of India according to workers and non-workers by gender and location, 2001
(Crore)
Column Headings/Captions ↑
↓ Units
Location Gender Workers Non-worker Total
Main Marginal Total
Male 17 3 20 18 38
Row Headings/stubs

Rural

Female 6 5 11 25 36 Body of the table


Total 23 8 31 43 74
→ Male 7 1 8 7 15

Urban

Female 1 0 1 12 13
Total 8 1 9 19 28

Male 24 4 28 25 53
All

Female 7 5 12 37 49
Total 31 9 40 62 102

Source : Census of India 2001


↑ Note : Figures are rounded to nearest crore
Source

Note

(Note : Table 4.5 presents the same data in tabular form already presented through case 2 in
textual presentation of data)

2019-20
PRESENTATION OF DATA 45

of rounding should be indicated (See numbers into more concrete and easily
Table 4.5). comprehensible form.
Diagrams may be less accurate but
(vii) Source are much more effective than tables in
It is a brief statement or phrase presenting the data.
indicating the source of data presented There are various kinds of diagrams
in the table. If more than one source is in common use. Amongst them the
there, all the sources are to be written in important ones are the following:
the source. Source is generally written (i) Geometric diagram
at the bottom of the table. (See Table 4.5). (ii) Frequency diagram
(iii) Arithmetic line graph
(viii) Note
Geometric Diagram
Note is the last part of the table. It
Bar diagram and pie diagram come in
explains the specific feature of the data
the category of geometric diagram. The
content of the table which is not self
explanatory and has not been explained bar diagrams are of three types — simple,
earlier. multiple and component bar diagrams.
Bar Diagram
Activities
Simple Bar Diagram
• How many rows and columns
are essentially required to form Bar diagram comprises a group of
a table? equispaced and equiwidth rectangular
• Can the column/row headings bars for each class or category of data.
of a table be quantitative? Height or length of the bar reads the
• Can you present tables 4.2 and magnitude of data. The lower end of the
4.3 after rounding off figures
bar touches the base line such that the
appropriately.
• Present the first two sentences height of a bar starts from the zero unit.
of case 2 on p.41 as a table. Bars of a bar diagram can be visually
Some details for this would be compared by their relative height and
found elsewhere in this chapter. accordingly data are comprehended
quickly. Data for this can be of
5. D IAGRAMMATIC P RESENTATION OF
frequency or non-frequency type. In
D ATA non-frequency type data a particular
This is the third method of presenting characteristic, say production, yield,
data. This method provides the population, etc. at various points of
quickest understanding of the actual time or of different states are noted and
situation to be explained by data in corresponding bars are made of the
comparison to tabular or textual respective heights according to the
presentations. Diagrammatic presenta- values of the characteristic to construct
tion of data translates quite effectively the diagram. The values of the
the highly abstract ideas contained in characteristics (measured or counted)

2019-20
46 STATISTICS FOR ECONOMICS

retain the identity of each value. Figure expenditure profile, export/imports


4.1 is an example of a bar diagram. over the years, etc.
Activity
• Collect the number of students
in each class studying in the
current year in your school.
Draw a bar diagram for the
same table.
Different types of data may require
different modes of diagrammatical
representation. Bar diagrams are
suitable both for frequency type and A category that has a longer bar
non-frequency type variables and (literacy of Kerala) than another
attributes. Discrete variables like family category (literacy of West Bengal), has
size, spots on a dice, grades in an more of the measured (or enumerated)
examination, etc. and attributes such characteristics than the other. Bars
as gender, religion, caste, country, etc. (also called columns) are usually used
can be represented by bar diagrams. in time series data (food grain
Bar diagrams are more convenient for produced between 1980 and 2000,
non-frequency data such as income- decadal variation in work participation
TABLE 4.6
Literacy Rates of Major States of India
2001 2011
Major Indian States Male Female Male Female
Andhra Pradesh (AP) 70.3 50.4 75.6 59.7
Assam (AS) 71.3 54.6 78.8 67.3
Bihar (BR) 59.7 33.1 73.4 53.3
Jharkhand (JH) 67.3 38.9 78.4 56.2
Gujarat (GJ) 79.7 57.8 87.2 70.7
Haryana (HR) 78.5 55.7 85.3 66.8
Karnataka (KA) 76.1 56.9 82.9 68.1
Kerala (KE) 94.2 87.7 96.0 92.0
Madhya Pradesh (MP) 76.1 50.3 80.5 60.0
Chhattisgarh (CH) 77.4 51.9 81.5 60.6
Maharashtra (MR) 86.0 67.0 89.8 75.5
Odisha (OD) 75.3 50.5 82.4 64.4
Punjab (PB) 75.2 63.4 81.5 71.3
Rajasthan (RJ) 75.7 43.9 80.5 52.7
Tamil Nadu (TN) 82.4 64.4 86.8 73.9
Uttar Pradesh (UP) 68.8 42.2 79.2 59.3
Uttarakhand (UK) 83.3 59.6 88.3 70.7
West Bengal (WB) 77.0 59.6 82.7 71.2
India 75.3 53.7 82.1 65.5

2019-20
PRESENTATION OF DATA 47

Fig. 4.1: Bar diagram showing male literacy rates of major states of India, 2011. (Literacy rates
relate to population aged 7 years and above)

rate, registered unemployed over the different years, marks obtained in


years, literacy rates, etc.) (Fig 4.2). different subjects in different classes,
Bar diagrams can have different etc.
forms such as multiple bar diagram
and component bar diagram. Component Bar Diagram

Activities Component bar diagrams or charts


(Fig.4.3), also called sub-diagrams, are
• How many states (among the
very useful in comparing the sizes of
major states of India) had
different component parts (the elements
higher female literacy rate than
or parts which a thing is made up of)
the national average in 2011?
and also for throwing light on the
• Has the gap between maximum
and minimum female literacy relationship among these integral parts.
rates over the states in two For example, sales proceeds from
consecutive census years 2001 different products, expenditure pattern
and 2011 declined? in a typical Indian family (components
being food, rent, medicine, education,
Multiple Bar Diagram power, etc.), budget outlay for receipts
Multiple bar diagrams (Fig.4.2) are and expenditures, components of
used for comparing two or more sets of labour force, population etc.
data, for example income and Component bar diagrams are usually
expenditure or import and export for shaded or coloured suitably.

2019-20
48 STATISTICS FOR ECONOMICS

Fig. 4.2: Multiple bar (column) diagram showing female literacy rates over two census years 2001
and 2011 by major states of India. (Data Source Table 4.6)
Interpretation: It can be very easily derived from Figure 4.2 that female literacy rate over the years
was on increase throughout the country. Similar other interpretations can be made from the figure.
For example, the figure shows that the states of Bihar, Jharkhand and Uttar Pradesh experienced
the sharpest rise in female literacy, etc.

component bar diagram, first of all, a


TABLE 4.7
Enrolment by gender at schools (per cent)
bar is constructed on the x-axis with
of children aged 6–14 years in a district of its height equivalent to the total value
Bihar of the bar [for per cent data the bar
Gender Enrolled Out of school height is of 100 units (Figure 4.3)].
(per cent) (per cent) Otherwise the height is equated to total
Boy 91.5 8.5 value of the bar and proportional
Girl 58.6 41.4 heights of the components are worked
All 78.0 22.0 out using unitary method. Smaller
Data Source: Unpublished data
components are given priority in
parting the bar.
A component bar diagram shows
the bar and its sub-divisions into two
or more components. For example, the
bar might show the total population of
children in the age-group of 6–14 years.
The components show the proportion
of those who are enrolled and those
who are not. A component bar diagram
might also contain different component
bars for boys, girls and the total of
children in the given age group range, Fig. 4.3: Enrolment at primary level in a district
as shown in Figure 4.3. To construct a of Bihar (Component Bar Diagram)

2019-20
PRESENTATION OF DATA 49

Pie Diagram It may be interesting to note that


A pie diagram is also a component data represented by a component bar
diagram can also be represented
diagram, but unlike a bar diagram,
equally well by a pie chart, the only
here it is a circle whose area is
requirement being that absolute values
proportionally divided among the of the components have to be converted
components (Fig.4.4) it represents. It into percentages before they can be
used for a pie diagram.

TABLE 4.8
Distribution of Indian population (2011)
by their working status (crores)
Status Population Per cent Angular
Component
Marginal Worker 12 9.9 36°
Main Worker 36 29.8 107°
Non-worker 73 60.3 217°
All 102 100.0 360°

Fig. 4.4: Pie diagram for different categories of


Indian population according to working status
is also called a pie chart. The circle is 2011.
divided into as many parts as there are
components by drawing straight lines
from the centre to the circumference.
Pie charts usually are not drawn
with absolute values of a category. The
values of each category are first
expressed as percentage of the total
value of all the categories. A circle in a
pie chart, irrespective of its value of
radius, is thought of having 100 equal
parts of 3.6° (360°/100) each. To find
out the angle, the component shall Activities
subtend at the centre of the circle, each • Represent data presented
percentage figure of every component through Figure 4.4 by a
is multiplied by 3.6°. An example of this component bar diagram.
• Does the area of a pie have any
conversion of percentages of bearing on the total value of
components into angular components the data to be represented by
of the circle is shown in Table 4.8. the pie diagram?

2019-20
50 STATISTICS FOR ECONOMICS

Frequency Diagram When bases vary in their width, the


Data in the form of grouped frequency heights of rectangles are to be adjusted
distributions are generally represented to yield comparable measurements.
by frequency diagrams like histogram, The answer in such a situation is
frequency polygon, frequency curve frequency density (class frequency
and ogive. divided by width of the class interval)
instead of absolute frequency.
Histogram TABLE 4.9
Distribution of daily wage earners in a
A histogram is a two dimensional locality of a town
diagram. It is a set of rectangles with
Daily No.
base as the intervals between class earning of wage
boundaries (along X-axis) and with (Rs) earners (f)
areas proportional to the class 45–49 2
frequency (Fig.4.5). If the class intervals 50–54 3
55–59 5
are of equal width, which they generally 60–64 3
are, the area of the rectangles are 65–69 6
proportional to their respective 70–74 7
75–79 12
frequencies. However, in some type of 80–84 13
data, it is convenient, at times 85–89 9
90–94 7
necessary, to use varying width of class
95–99 6
intervals. For example, when tabulating 100–104 4
deaths by age at death, it would be very 105–109 2
110–114 3
meaningful as well as useful too to have 115–119 3
very short age intervals (0, 1, 2, ..., yrs/
0, 7, 28, ..., days) at the beginning Source: Unpublished data

when death rates are very high Since histograms are rectangles, a line
compared to deaths at most other parallel to the base line and of the same
higher age segments of the population. magnitude is to be drawn at a vertical
For graphical representation of such distance equal to frequency (or
data, height for area of a rectangle is frequency density) of the class interval.
the quotient of height (here frequency) A histogram is never drawn. Since, for
countinuous variables, the lower class
and base (here width of the class
boundary of a class interval fuses with
interval). When intervals are equal, that
the upper class boundary of the
is, when all rectangles have the same previous interval, equal or unequal, the
base, area can conveniently be rectangles are all adjacent and there is
represented by the frequency of any no open space between two consecutive
interval for purposes of comparison. rectangles. If the classes are not

2019-20
PRESENTATION OF DATA 51

continuous they are first converted into bars (except in multiple bar or
continuous classes as discussed in component bar diagram). Although the
Chapter 3. Sometimes the common bars have the same width, the width of
portion between two adjacent a bar is unimportant for the purpose
rectangles (Fig.4.6) is omitted giving a of comparison. The width in a
better impression of continuity. The histogram is as important as its height.
resulting figure gives the impression of We can have a bar diagram both for
a double staircase. discrete and continuous variables, but
A histogram looks similar to a bar histogram is drawn only for a
diagram. But there are more differences continuous variable. Histogram also
than similarities between the two than gives value of mode of the frequency
it may appear at the first impression. distribution graphically as shown in
The spacing and the width or the area Figure 4.5 and the x-coordinate of the
of bars are all arbitrary. It is the height dotted vertical line gives the mode.
and not the width or the area of the bar
that really matters. A single vertical line Frequency Polygon
could have served the same purpose A frequency polygon is a plane
as a bar of same width. Moreover, in bounded by straight lines, usually four
histogram no space is left between two or more lines. Frequency polygon is an
rectangles, but in a bar diagram some alternative to histogram and is also
space must be left between consecutive derived from histogram itself. A

Fig. 4.5: Histogram for the distribution of 85 daily wage earners in a locality of a town.

2019-20
52 STATISTICS FOR ECONOMICS

frequency polygon can be fitted to a boundaries and class-marks can be


histogram for studying the shape of the used along the X-axis, the distances
curve. The simplest method of drawing between two consecutive class marks
a frequency polygon is to join the being proportional/equal to the width
midpoints of the topside of the of the class intervals. Plotting of data
consecutive rectangles of the becomes easier if the class-marks fall
histogram. It leaves us with the two on the heavy lines of the graph paper.
ends away from the base line, denying No matter whether class boundaries or
the calculation of the area under the midpoints are used in the X-axis,
curve. The solution is to join the two frequencies (as ordinates) are always
end-points thus obtained to the base plotted against the mid-point of class
line at the mid-values of the two classes intervals. When all the points have been
with zero frequency immediately at plotted in the graph, they are carefully
each end of the distribution. Broken joined by a series of short straight lines.
lines or dots may join the two ends with Broken lines join midpoints of two
the base line. Now the total area under intervals, one in the beginning and the
the curve, like the area in the other at the end, with the two ends of
histogram, represents the total the plotted curve (Fig.4.6). When
frequency or sample size. comparing two or more distributions
Frequency polygon is the most plotted on the same axes, frequency
common method of presenting grouped polygon is likely to be more useful since
frequency distribution. Both class the vertical and horizontal lines of two

Fig. 4.6: Frequency polygon drawn for the data given in Table 4.9

2019-20
PRESENTATION OF DATA 53

Fig. 4.7: Frequency curve for Table 4.9

or more distributions may coincide in in the case of frequency polygon,


a histogram. cumulative frequencies are plotted
along y-axis against class limits of the
Frequency Curve frequency distribution. For ‘‘less than’’
The frequency curve is obtained by ogive the cumulative frequencies are
drawing a smooth freehand curve plotted against the respective upper
passing through the points of the limits of the class intervals whereas for
more than ogives the cumulative
frequency polygon as closely as
frequencies are plotted against the
possible. It may not necessarily pass
respective lower limits of the class
through all the points of the frequency
interval. An interesting feature of the
polygon but it passes through them as
two ogives together is that their
closely as possible (Fig. 4.7).
intersection point gives the median Fig.
4.8 (b) of the frequency distribution. As
Ogive
the shapes of the two ogives suggest,
Ogive is also called cumulative ‘‘less than’’ ogive is never decreasing
frequency curve. As there are two types and ‘‘more than’’ ogive is never
of cumulative frequencies, for example increasing.
‘‘less than’’ type and ‘‘more than’’ type,
accordingly there are two ogives for any Arithmetic Line Graph
grouped frequency distribution data. An arithmetic line graph is also called
Here in place of simple frequencies as time series graph. In this graph, time

2019-20
54 STATISTICS FOR ECONOMICS

TABLE 4.10
Frequency distribution of marks
obtained in mathematics

Table 4.10 (a) Table 4.10 (b) Table 4.10 (e)

Frequency distribution Less than cumulative More than cumulative


of marks obtained in frequency distribution frequency distribution
mathematics of marks obtained in of markes obtained
mathematics in mathematics
Marks Number of Marks 'Less than' Marks 'More than'
students cumulative cumulative
frequency frequency

0-20 6 Less than 20 6 More than 0 64

20-40 5 Less than 40 11 More than 20 58

40-60 33 Less than 60 44 More than 40 53

60-80 14 Less than 80 58 More than 60 20

80-100 6 Less than 100 64 More than 80 6

Total 64

Fig. 4.8(b): ‘Less than’ and


Fig. 4.8(a): 'Less than' and 'More than' ogive for data ‘More than’ ogive for data
given in Table 4.10 given in Table 4.10

(hour, day/date, week, month, thus, obtained is called arithmetic


year, etc.) is plotted along x-axis and line graph (time series graph). It
the value of the variable (time series helps in understanding the trend,
data) along y-axis. A line graph periodicity, etc., in a long term time
by joining these plotted points, series data.

2019-20
PRESENTATION OF DATA 55

Here you can see from Fig. 4.9 that TABLE 4.11
for the period 1993-94 to 2013-14, the Value of Exports and Imports of India
(Rs in 100 crores)
imports were more than the exports all
Year Exports Imports
through the period. You may notice the
value of both exports and imports rising 1993-94 698 731
1994–95 827 900
rapidy after 2001-02. Also the gap 1995–96 1064 1227
between the two (imports and exports) 1996–97 1188 1389
has widened after 2001-02. 1997–98 1301 1542
1998-99 1398 1783
1999-2000 1591 2155
6. CONCLUSION 2000-01 2036 2309
2001-02 2090 2452
By now you must have been able to learn 2002-03 2549 2964
how the data could be presented using 2003-04 2934 3591
various forms of presentation — textual, 2004-05 3753 5011
2005-06 4564 6604
tabular and diagrammatic. You are now
2006-07 5718 8815
also able to make an appropriate choice 2007-08 6559 10123
of the form of data presentation as well 2008-09 8408 13744
2009-10 8455 13637
as the type of diagram to be used for a
2010-11` 11370 16835
given set of data. Thus you can make 2011-12 14660 23455
presentation of data meaningful, 2012-13 16343 26692
comprehensive and purposeful. 2013-14 19050 27154
Source: DGCI&S, Kolkata

Fig. 4.9: Arithmetic line graph for time series data given in Table 4.11

2019-20
56 STATISTICS FOR ECONOMICS

Recap
• Data (even voluminous data) speak meaningfully through
presentation.
• For small data (quantity) textual presentation serves the purpose
better.
• For large quantity of data tabular presentation helps in
accommodating any volume of data for one or more variables.
• Tabulated data can be presented through diagrams which enable
quicker comprehension of the facts presented otherwise.

EXERCISES

Answer the following questions, 1 to 10, choosing the correct answer


1. Bar diagram is a
(i) one-dimensional diagram
(ii) two-dimensional diagram
(iii) diagram with no dimension
(iv) none of the above
2. Data represented through a histogram can help in finding graphically the
(i) mean
(ii) mode
(iii) median
(iv) all the above
3. Ogives can be helpful in locating graphically the
(i) mode
(ii) mean
(iii) median
(iv) none of the above
4. Data represented through arithmetic line graph help in understanding
(i) long term trend
(ii) cyclicity in data
(iii) seasonality in data
(iv) all the above
5. Width of bars in a bar diagram need not be equal (True/False).
6. Width of rectangles in a histogram should essentially be equal (True/
False).
7. Histogram can only be formed with continuous classification of data
(True/False).

2019-20
PRESENTATION OF DATA 57

8. Histogram and column diagram are the same method of presentation of


data. (True/False)
9. Mode of a frequency distribution can be known graphically with the
help of histogram. (True/False)
10. Median of a frequency distribution cannot be known from the ogives.
(True/False)
11. What kind of diagrams are more effective in representing the following?
(i) Monthly rainfall in a year
(ii) Composition of the population of Delhi by religion
(iii) Components of cost in a factory
12. Suppose you want to emphasise the increase in the share of urban
non-workers and lower level of urbanisation in India as shown in
Example 4.2. How would you do it in the tabular form?
13. How does the procedure of drawing a histogram differ when class
intervals are unequal in comparison to equal class intervals in a
frequency table?
14. The Indian Sugar Mills Association reported that, ‘Sugar production
during the first fortnight of December 2001 was about 3,87,000 tonnes,
as against 3,78,000 tonnes during the same fortnight last year (2000).
The off-take of sugar from factories during the first fortnight of December
2001 was 2,83,000 tonnes for internal consumption and 41,000 tonnes
for exports as against 1,54,000 tonnes for internal consumption and
nil for exports during the same fortnight last season.’
(i) Present the data in tabular form.
(ii) Suppose you were to present these data in diagrammatic form which
of the diagrams would you use and why?
(iii) Present these data diagrammatically.
15. The following table shows the estimated sectoral real growth rates
(percentage change over the previous year) in GDP at factor cost.

Year Agriculture and allied sectors Industry Services


1994–95 5.0 9.2 7.0
1995–96 –0.9 11.8 10.3
1996–97 9.6 6.0 7.1
1997–98 –1.9 5.9 9.0
1998–99 7.2 4.0 8.3
1999–2000 0.8 6.9 8.2
Represent the data as multiple time series graphs.

2019-20
CHAPTER

Measures of Central Tendency

Studying this chapter should representation of the data. In this


enable you to: chapter, you will study the measures
• understand the need for of central tendency which is a
summarising a set of data by one numerical method to explain the data
single number; in brief. You can see examples of
• recognise and distinguish summarising a large set of data in
between the different types of day-to-day life, like average marks
averages;
obtained by students of a class in a test,
• learn to compute different types
of averages; average rainfall in an area, average
• draw meaningful conclusions production in a factory, average income
from a set of data; of persons living in a locality or
• develop an understanding of working in a firm, etc.
which type of average would be Baiju is a farmer. He grows food
the most useful in a particular grains in his land in a village called
situation. Balapur in Buxar district of Bihar. The
village consists of 50 small farmers.
Baiju has 1 acre of land. You are
1. INTRODUCTION
interested in knowing the economic
In the previous chapter, you have read condition of small farmers of Balapur.
about the tabular and graphic You want to compare the economic

2019-20
MEASURES OF CENTRAL TENDENCY 59

condition of Baiju in Balapur village. 2. ARITHMETIC MEAN


For this, you may have to evaluate the
Suppose the monthly income (in Rs) of
size of his land holding, by comparing
six families is given as:
with the size of land holdings of other
1600, 1500, 1400, 1525, 1625, 1630.
farmers of Balapur. You may like to see
The mean family income is
if the land owned by Baiju is –
obtained by adding up the incomes
1. above average in ordinary sense (see
and dividing by the number of
the Arithmetic Mean)
families.
2. above the size of what half the
farmers own (see the Median) =
3. above what most of the farmers own = Rs 1,547
(see the Mode) It implies that on an average, a
In order to evaluate Baiju’s relative family earns Rs 1,547.
economic condition, you will have to Arithmetic mean is the most
summarise the whole set of data of land commonly used measure of central
holdings of the farmers of Balapur. This tendency. It is defined as the sum of
can be done by the use of central the values of all observations divided
tendency, which summarises the data by the number of observations and is
in a single value in such a way that this usually denoted by X . In general, if
single value can represent the entire there are N observations as X1, X2, X3,
data. The measuring of central tendency ..., XN, then the Arithmetic Mean is given
is a way of summarising the data in the by
form of a typical or representative value. X + X 2 + X3 +...+ X N
There are several statistical X= 1
measures of central tendency or N
“averages”. The three most commonly The right hand side can be written
used averages are: N
∑ i = 1 Xi
• Arithmetic Mean as . Here, i is an index
N
• Median
which takes successive values 1, 2,
• Mode
3,...N.
You should note that there are two For convenience, this will be written in
more types of averages i.e. Geometric simpler form without the index i. Thus
Mean and Harmonic Mean, which are
suitable in certain situations. ∑X
X= , where, ΣX = sum of all
However, the present discussion will N
be limited to the three types of observations and N = total number of
averages mentioned above. observations.

2019-20
60 STATISTICS FOR ECONOMICS

How Arithmetic Mean is Calculated The average mark of students in the


economics test is 56.2.
The calculation of arithmetic mean can
be studied under two broad categories: Assumed Mean Method
1. Arithmetic Mean for Ungrouped If the number of observations in the
Data. data is more and/or figures are large,
2. Arithmetic Mean for Grouped Data. it is difficult to compute arithmetic
mean by direct method. The
Arithmetic Mean for Series of computation can be made easier by
Ungrouped Data using assumed mean method.
In order to save time in calculating
Direct Method mean from a data set containing a large
number of observations as well as large
Arithmetic mean by direct method is
numerical figures, you can use
the sum of all observations in a series
assumed mean method. Here you
divided by the total number of
assume a particular figure in the data
observations.
as the arithmetic mean on the basis of
Example 1 logic/experience. Then you may take
deviations of the said assumed mean
Calculate Arithmetic Mean from the from each of the observation. You can,
data showing marks of students in a then, take the summation of these
class in an economics test: 40, 50, 55, deviations and divide it by the number
78, 58.
of observations in the data. The actual
ΣX arithmetic mean is estimated by taking
X=
N the sum of the assumed mean and the
ratio of sum of deviations to number of
40 + 50 + 55 + 78 + 58
= = 56.2 observations. Symbolically,
5

(HEIGHT IN INCHES)

2019-20
MEASURES OF CENTRAL TENDENCY 61

Let, A = assumed mean D 750 –100 –10


E 5000 +4150 +415
X = individual observations
F 80 –770 –77
N = total numbers of observa- G 420 –430 –43
tions H 2500 +1650 +165
d = deviation of assumed mean I 400 –450 –45
J 360 –490 –49
from individual observation,
i.e. d = X – A 11160 +2660 +266
Then sum of all deviations is taken Arithmetic Mean using assumed mean
as Σd=Σ (X-A) method
Then find Σd
X =A + = 850 + (2, 660 )/10
Σd N
Then add A and to get X = Rs1,116.
N
Thus, the average weekly income
Therefore, of a family by both methods is
You should remember that any Rs 1,116. You can check this by using
value, whether existing in the data or the direct method.
not, can be taken as assumed mean. Step Deviation Method
However, in order to simplify the
calculation, centrally located value in The calculations can be further
the data can be selected as assumed simplified by dividing all the deviations
mean. taken from assumed mean by the
common factor ‘c’. The objective is to
Example 2 avoid large numerical figures, i.e., if
The following data shows the weekly d = X – A is very large, then find d'.
income of 10 families. This can be done as follows:
Family
A B C D E F G H d X−A
d' = = .
I J c c
Weekly Income (in Rs)
850 700 100 750 5000 80 420 2500 The formula is given below:
400 360 Σ d′
Compute mean family income. X =A + ×c
N
TABLE 5.1 where d' = (X – A)/c, c = common
Computation of Arithmetic Mean by
factor, N = number of observations, A=
Assumed Mean Method
Assumed mean.
Families Income d = X – 850 d'
Thus, you can calculate the
(X) = (X – 850)/10
arithmetic mean in the example 2, by
A 850 0 0 the step deviation method,
B 700 –150 –15
C 100 –750 –75 X = 850 + (266/10) × 10 = Rs 1,116.

2019-20
62 STATISTICS FOR ECONOMICS

Calculation of arithmetic mean for Therefore, the mean plot size in the
Grouped data housing colony is 126.92 Sq. metre.
Discrete Series
Assumed Mean Method
Direct Method
As in case of individual series the
In case of discrete series, calculations can be simplified by using
frequency against each observation is assumed mean method, as described
multiplied by the value of the earlier, with a simple modification.
observation. The values, so obtained, Since frequency (f) of each item is
are summed up and divided by the total given here, we multiply each deviation
number of frequencies. Symbolically, (d) by the frequency to get fd. Then we
get Σ fd. The next step is to get the total
ΣfX
X = of all frequencies i.e. Σ f. Then find out
Σf
Σ fd/ Σ f. Finally, the arithmetic mean
Where, Σ fX = sum of the product Σfd
of variables and frequencies. is calculated by X = A + using
Σf
Σ f = sum of frequencies.
assumed mean method.
Example 3
Step Deviation Method
Plots in a housing colony come in only
In this case, the deviations are divided
three sizes: 100 sq. metre, 200 sq.
by the common factor ‘c’ which
meters and 300 sq. metre and the
simplifies the calculation. Here we
number of plots are respectively 200
50 and 10. d X−A
estimate d' = = in order to
c c
TABLE 5.2
reduce the size of numerical figures for
Computation of Arithmetic Mean by
Direct Method easier calculation. Then get fd' and Σ fd'.
Plot size in No. of d' = X–200 The formula for arithmetic mean using
Sq. metre X Plots (f) fX 100 fd' step deviation method is given as,
100 200 20000 –1 –200 Σfd ′
200 50 10000 0 0 X =A + ×c
300 10 3000 +1 10 Σf
260 33000 0 –190 Activity
Arithmetic mean using direct method, • Find the mean plot size for the
data given in example 3, by
∑X 33000 using step deviation and
X= = = 126.92 Sq. metre
N 260 assumed mean methods.

2019-20
MEASURES OF CENTRAL TENDENCY 63

Continuous Series 40–50 8 45 360 1 8


50–60 3 55 165 2 6
Here, class intervals are given. The 60–70 2 65 130 3 6
process of calculating arithmetic mean 70 2110 –34
in case of continuous series is same as
that of a discrete series. The only Steps:
difference is that the mid-points of
various class intervals are taken. We 1. Obtain mid values for each class
have already known that class intervals denoted by m.
may be exclusive or inclusive or of 2. Obtain Σ fm and apply the direct
unequal size. Example of exclusive method formula:
class interval is, say, 0–10, 10–20 and Σfm 2110
so on. Example of inclusive class X= = = 30.14 marks
Σf 70
interval is, say, 0–9, 10–19 and so on.
Example of unequal class interval is, Step deviation method
say, 0–20, 20–50 and so on. In all these
cases, calculation of arithmetic mean 1. Obtain d' =
is done in a similar way.
2. Take A = 35, (any arbitrary figure),
Example 4
c = common factor.
Calculate average marks of the
following students using (a) Direct
method (b) Step deviation method.

Direct Method
Two interesting properties of A.M.
Marks
0–10 10–20 20–30 30–40 40–50 (i) the sum of deviations of items
50–60 60–70 about arithmetic mean is always equal
No. of Students
5 12 15 25 8 to zero. Symbolically, Σ ( X – X ) = 0.
3 2 (ii) arithmetic mean is affected by
extreme values. Any large value, on
TABLE 5.3
Computation of Average Marks for
either end, can push it up or down.
Exclusive Class Interval by Direct Method
Weighted Arithmetic Mean
Mark No. of Mid fm d'=(m-35) fd'
(x) students value (2)×(3) 10 Sometimes it is important to assign
(f) (m)
weights to various items according to
(1) (2) (3) (4) (5) (6)
their importance when you calculate
0–10 5 5 25 –3 –15
10–20 12 15 180 –2 –24 the arithmetic mean. For example,
20–30 15 25 375 –1 –15 there are two commodities, mangoes
30–40 25 35 875 0 0 and potatoes. You are interested in

2019-20
64 STATISTICS FOR ECONOMICS

finding the average price of mangoes that mean remains the same.
(P1) and potatoes (P2). The arithmetic • Replace the value 12 by 96.
What happens to the arithmetic
mean will be . However, you
mean? Comment.
might want to give more importance to
the rise in price of potatoes (P2). To do 3. MEDIAN
this, you may use as ‘weights’ the share
Median is that positional value of the
of mangoes in the budget of the variable which divides the distribution
consumer (W 1) and the share of into two equal parts, one part
potatoes in the budget (W2). Now the comprises all values greater than or
arithmetic mean weighted by the equal to the median value and the other
shares in the budget would comprises all values less than or equal
to it. The Median is the “middle”
W1 P1 + W2 P2
element when the data set is
be .
W1 + W2 arranged in order of the magnitude.
In general the weighted arithmetic Since the median is determined by the
position of different values, it remains
mean is given by,
unaffected if, say, the size of the
largest value increases.

Computation of median
When the prices rise, you may be
interested in the rise in prices of The median can be easily computed by
commodities that are more important sorting the data from smallest to largest
and finding out the middle value.
to you. You will read more about it in
the discussion of Index Numbers in Example 5
Chapter 8.
Suppose we have the following
observation in a data set: 5, 7, 6, 1, 8,
Activities
10, 12, 4, and 3.
• Check property of arithmetic Arranging the data, in ascending order
mean for the following example: you have:
X: 4 6 8 10 12 1, 3, 4, 5, 6, 7, 8, 10, 12.
• In the above example if mean
is increased by 2, then what
happens to the individual The “middle score” is 6, so the
observations. median is 6. Half of the scores are larger
• If first three items increase by than 6 and half of the scores are smaller.
2, then what should be the If there are even numbers in the
values of the last two items, so data, there will be two observations

2019-20
MEASURES OF CENTRAL TENDENCY 65

which fall in the middle. The median in 45 + 46


Median = = 45.5 marks
this case is computed as the arithmetic 2
mean of the two middle values.
In order to calculate median it is
Activities
important to know the position of the
median i.e. item/items at which the
• Find mean and median for all median lies. The position of the median
four values of the series. What
can be calculated by the following
do you observe?
formula:
TABLE 5.4 th
Mean and Median of different series (N+1)
Position of median = item
Series X (Variable Mean Median 2
Values)
Where N = number of items.
A 1, 2, 3 ? ?
B 1, 2, 30 ? ? You may note that the above
C 1, 2, 300 ? ? formula gives you the position of the
D 1, 2, 3000 ? ? median in an ordered array, not the
• Is median affected by extreme median itself. Median is computed by
values? What are outliers? the formula:
• Is median a better method than th
mean? (N+1)
Median = size of item
2
Example 6
Discrete Series
The following data provides marks of
In case of discrete series the position of
20 students. You are required to
median i.e. (N+1)/2th item can be
calculate the median marks.
located through cumulative freque-
25, 72, 28, 65, 29, 60, 30, 54, 32, 53, ncy. The corresponding value at this
33, 52, 35, 51, 42, 48, 45, 47, 46, 33. position is the value of median.

Arranging the data in an ascending Example 7


order, you get The frequency distributsion of the
25, 28, 29, 30, 32, 33, 33, 35, 42, number of persons and their
45, 46, 47, 48, 51, 52, 53, 54, 60, respective incomes (in Rs) are given
65, 72. below. Calculate the median income.
You can see that there are two Income (in Rs): 10 20 30 40
Number of persons: 2 4 10 4
observations in the middle, 45 and
46. The median can be obtained by In order to calculate the median
taking the mean of the two income, you may prepare the
observations: frequency distribution as given below.

2019-20
66 STATISTICS FOR ECONOMICS

TABLE 5.5 preceding the median class,


Computation of Median for Discrete Series
f = frequency of the median class,
Income No. of Cumulative h = magnitude of the median class
(in Rs) persons(f) frequency(cf)
interval.
10 2 2 No adjustment is r equired if
20 4 6
30 10 16
frequency is of unequal size or
40 4 20 magnitude.

The median is located in the (N+1)/ Example 8


2 = (20+1)/2 = 10.5th observation. This
can be easily located through Following data relates to daily wages
cumulative frequency. The 10.5 th of persons working in a factory.
observation lies in the c.f. of 16. The Compute the median daily wage.
income corresponding to this is Rs 30, Daily wages (in Rs):
so the median income is Rs 30. 55–60 50–55 45–50 40–45 35–40 30–35
25–30 20–25
Continuous Series Number of workers:
7 13 15 20 30 33
In case of continuous series you have 28 14
to locate the median class where
The data is arranged in descending
N/2th item [not (N+1)/2th item] lies. The
order here.
median can then be obtained as follows:
In the above illustration median
class is the value of (N/2) th item
Median =
(i.e.160/2) = 80th item of the series,
Where, L = lower limit of the median which lies in 35–40 class interval.
class, Applying the formula of the median
c.f. = cumulative frequency of the class as:

2019-20
MEASURES OF CENTRAL TENDENCY 67

TABLE 5.6 Quartile (denoted by Q2) or median has


Computation of Median for Continuous 50% of items below it and 50% of the
Series
observations above it. The third
Daily wages No. of Cumulative Quartile (denoted by Q 3) or upper
(in Rs) Workers (f) Frequency
Quartile has 75% of the items of the
20–25 14 14 distribution below it and 25% of the
25–30 28 42
items above it. Thus, Q1 and Q3 denote
30–35 33 75
35–40 30 105
the two limits within which central 50%
40–45 20 125 of the data lies.
45–50 15 140
50–55 13 153
55–60 7 160

Thus, the median daily wage is


Rs 35.83. This means that 50% of the
workers are getting less than or equal Percentiles
to Rs 35.83 and 50% of the workers Percentiles divide the distribution into
are getting more than or equal to this hundred equal parts, so you can get
wage. 99 dividing positions denoted by P1, P2,
You should r emember that P3, ..., P99. P50 is the median value. If
median, as a measure of central you have secured 82 percentile in a
tendency, is not sensitive to all the
management entrance examination, it
values in the series. It concentrates
means that your position is below 18
on the values of the central items of
per cent of total candidates appeared
the data.
in the examination. If a total of one lakh
students appeared, where do you
Quartiles
stand?
Quartiles are the measures which
divide the data into four equal parts, Calculation of Quartiles
each portion contains equal number of
The method for locating the Quartile is
observations. There are three quartiles.
The first Quartile (denoted by Q1) or same as that of the median in case of
lower quartile has 25% of the items of individual and discrete series. The
the distribution below it and 75% of value of Q1 and Q3 of an ordered series
the items are greater than it. The second can be obtained by the following

2019-20
68 STATISTICS FOR ECONOMICS

formula where N is the number of has been derived from the French word
observations. “la Mode” which signifies the most
(N + 1)th fashionable values of a distribution,
Q1= size of item because it is repeated the highest
4
number of times in the series. Mode is
3(N +1)th the most frequently observed data
Q3 = size of item.
4 value. It is denoted by Mo.

Example 9 Computation of Mode

Calculate the value of lower quartile Discrete Series


from the data of the marks obtained Consider the data set 1, 2, 3, 4, 4, 5.
by ten students in an examination. The mode for this data is 4 because 4
22, 26, 14, 30, 18, 11, 35, 41, 12, 32. occurs most frequently (twice) in the
Arranging the data in an ascending data.
order,
11, 12, 14, 18, 22, 26, 30, 32, 35, 41. Example 10
(N +1)th Look at the following discrete series:
Q 1 = size of item = size of
4 Variable 10 20 30 40 50
Frequency 2 8 20 10 5
(10 +1)th
item = size of 2.75 th item Here, as you can see the maximum
4
frequency is 20, the value of mode is
= 2nd item + .75 (3rd item – 2nd item) 30. In this case, as there is a unique
= 12 + .75(14 –12) = 13.5 marks. value of mode, the data is unimodal.
But, the mode is not necessarily
Activity unique, unlike arithmetic mean and
• Find out Q3 yourself. median. You can have data with two
modes (bi-modal) or more than two
5. MODE modes (multi-modal). It may be
possible that there may be no mode if
Sometimes, you may be interested in
no value appears more frequent than
knowing the most typical value of a any other value in the distribution. For
series or the value around which example, in a series 1, 1, 2, 2, 3, 3, 4,
maximum concentration of items 4, there is no mode.
occurs. For example, a manufacturer
would like to know the size of shoes
that has maximum demand or style of
the shirt that is more frequently
demanded. Here, Mode is the most
Unimodal Data Bimodal Data
appropriate measure. The word mode

2019-20
MEASURES OF CENTRAL TENDENCY 69

Continuous Series Less than 25 30


Less than 20 12
In case of continuous frequency
Less than 15 4
distribution, modal class is the class
with largest frequency. Mode can be As you can see this is a case of
calculated by using the formula: cumulative frequency distribution. In
order to calculate mode, you will have
to convert it into an exclusive series. In
this example, the series is in the
Where L = lower limit of the modal class
descending order. This table should be
D1= difference between the frequency converted into an ordinary frequency
of the modal class and the frequency of table (Table 5.7) to determine the
the class preceding the modal class modal class.
(ignoring signs).
Income Group Frequency
D2 = difference between the frequency (in ’000 Rs)
of the modal class and the frequency of
45–50 97 – 95 = 2
the class succeeding the modal class 40–45 95 – 90 = 5
(ignoring signs). 35–40 90 – 80 = 10
30–35 80 – 60 = 20
h = class interval of the distribution.
25–30 60 – 30 = 30
You may note that in case of 20–25 30 – 12 = 18
continuous series, class intervals 15–20 12 – 4 = 8
10–15 4
should be equal and series should be
exclusive to calculate the mode. If mid
points are given, class intervals are to The value of the mode lies in
be obtained. 25–30 class interval. By inspection
also, it can be seen that this is a modal
Example 11 class.
Calculate the value of modal worker Now L = 25, D1 = (30 – 18) = 12, D2 =
(30 – 20) = 10, h = 5
family’s monthly income from the
Using the formula, you can obtain
following data:
the value of the mode as:
Less than cumulative frequency distribution MO (in ’000 Rs)
of income per month (in ’000 Rs)
Income per month Cumulative
(in '000 Rs) Frequency
Less than 50 97
12
Less than 45 95 = 25 + × 5 = 27.273
Less than 40 90 12 +10
Less than 35 80 Thus the modal worker family’s
Less than 30 60 monthly income is Rs 27.273.

2019-20
70 STATISTICS FOR ECONOMICS

Activities are Me>Mi>Mo or Me<Mi<Mo (suffixes


occurring in alphabetical order). The
• A shoe company, making shoes
for adults only, wants to know median is always between the
the most popular size of shoes. arithmetic mean and the mode.
Which average will be most
appropriate for it? 7. CONCLUSION
• Which average will be most
appropriate for the companies Measures of central tendency or
producing the following goods? averages are used to summarise the
Why? data. It specifies a single most
(i) Diaries and notebooks representative value to describe the
(ii) School bags data set. Arithmetic mean is the most
(iii) Jeans and T -Shirts
• Take a small survey in your
commonly used average. It is simple to
class to know the students' calculate and is based on all the
preference for Chinese food observations. But it is unduly affected
using appropriate measure of by the presence of extreme items.
central tendency. Median is a better summary for such
• Can mode be located graphically?
data. Mode is generally used to describe
the qualitative data. Median and mode
6. RELATIVE POSITION OF ARITHMETIC can be easily computed graphically. In
MEAN, MEDIAN AND MODE case of open-ended distribution they
Suppose we express, can also be easily computed. Thus, it
Arithmetic Mean = Me is important to select an appropriate
Median = Mi average depending upon the purpose
Mode = Mo of analysis and the nature of the
The relative magnitude of the three distribution.

2019-20
MEASURES OF CENTRAL TENDENCY 71

Recap
• The measure of central tendency summarises the data with a single
value, which can represent the entire data.
• Arithmetic mean is defined as the sum of the values of all observations
divided by the number of observations.
• The sum of deviations of items from the arithmetic mean is always
equal to zero.
• Sometimes, it is important to assign weights to various items
according to their importance.
• Median is the central value of the distribution in the sense that the
number of values less than the median is equal to the number greater
than the median.
• Quartiles divide the total set of values into four equal parts.
• Mode is the value which occurs most frequently.

EXERCISES

1. Which average would be suitable in the following cases?


(i) Average size of readymade garments.
(ii) Average intelligence of students in a class.
(iii) Average production in a factory per shift.
(iv) Average wage in an industrial concern.
(v) When the sum of absolute deviations from average is least.
(vi) When quantities of the variable are in ratios.
(vii) In case of open-ended frequency distribution.
2. Indicate the most appropriate alternative from the multiple choices
provided against each question.
(i) The most suitable average for qualitative measurement is
(a) arithmetic mean
(b) median
(c) mode
(d) geometric mean
(e) none of the above
(ii) Which average is affected most by the presence of extreme items?
(a) median
(b) mode
(c) arithmetic mean
(d) none of the above
(iii) The algebraic sum of deviation of a set of n values from A.M. is
(a) n
(b) 0
(c) 1

2019-20
72 STATISTICS FOR ECONOMICS

(d) none of the above


[Ans. (i) b (ii) c (iii) b]
3. Comment whether the following statements are true or false.
(i) The sum of deviation of items from median is zero.
(ii) An average alone is not enough to compare series.
(iii) Arithmetic mean is a positional value.
(iv) Upper quartile is the lowest value of top 25% of items.
(v) Median is unduly affected by extreme observations.
[Ans. (i) False (ii) True (iii) False (iv) True (v) False]
4. If the arithmetic mean of the data given below is 28, find (a) the missing
frequency, and (b) the median of the series:
Profit per retail shop (in Rs) 0-10 10-20 20-30 30-40 40-50 50-60
Number of retail shops 12 18 27 - 17 6
(Ans. The value of missing frequency is 20 and value of the median is
Rs 27.41)
5. The following table gives the daily income of ten workers in a factory.
Find the arithmetic mean.
Workers A B C D E F G H I J
Daily Income (in Rs) 120 150 180 200 250 300 220 350 370 260
(Ans. Rs 240)
6. Following information pertains to the daily income of 150 families.
Calculate the arithmetic mean.
Income (in Rs) Number of families
More than 75 150
,, 85 140
,, 95 115
,, 105 95
,, 115 70
,, 125 60
,, 135 40
,, 145 25
(Ans. Rs 116.3)
7. The size of land holdings of 380 families in a village is given below. Find
the median size of land holdings.
Size of Land Holdings (in acres)
Less than 100 100–200 200 – 300 300–400 400 and above.
Number of families
40 89 148 64 39
(Ans. 241.22 acres)
8. The following series relates to the daily income of workers employed in
a firm. Compute (a) highest income of lowest 50% workers (b) minimum
income earned by the top 25% workers and (c) maximum income earned
by lowest 25% workers.

2019-20
MEASURES OF CENTRAL TENDENCY 73

Daily Income (in Rs) 10–14 15–19 20–24 25–29 30–34 35–39
Number of workers 5 10 15 20 10 5
(Hint: compute median, lower quartile and upper quartile.)
[Ans. (a) Rs 25.11 (b) Rs 19.92 (c) Rs 29.19]
9. The following table gives production yield in kg. per hectare of wheat of
150 farms in a village. Calculate the mean, median and mode values.
Production yield (kg. per hectare)
50–53 53–56 56–59 59–62 62–65 65–68 68–71 71–74 74–77
Number of farms
3 8 14 30 36 28 16 10 5
(Ans. mean = 63.82 kg. per hectare, median = 63.67 kg. per hectare,
mode = 63.29 kg. per hectare)

2019-20
CHAPTER

7 Correlation

Studying this chapter should As the summer heat rises, hill


enable you to: stations, are crowded with more and
• understand the meaning of the
more visitors. Ice-cream sales become
term correlation;
• understand the nature of more brisk. Thus, the temperature is
relationship between two related to number of visitors and sale
variables; of ice-creams. Similarly, as the supply
• calculate the different measures of tomatoes increases in your local
of correlation;
• analyse the degree and direction mandi, its price drops. When the local
of the relationships. harvest starts reaching the market,
the price of tomatoes drops from Rs 40
per kg to Rs 4 per kg or even less. Thus
1. INTRODUCTION
supply is related to price. Correlation
In previous chapters you have learnt analysis is a means for examining such
how to construct summary measures relationships systematically. It deals
out of a mass of data and changes
with questions such as:
among similar variables. Now you will
• Is there any relationship between
learn how to examine the relationship
between two variables. two variables?

2019-20
92 STATISTICS FOR ECONOMICS

given a cause and effect interpretation.


Others may be just coincidence. The
relation between the arrival of
migratory birds in a sanctuary and the
birth rates in the locality cannot be
given any cause and effect
interpretation. The relationships are
• It the value of one variable changes,
simple coincidence. The relationship
does the value of the other also
between size of the shoes and money
change?
in your pocket is another such
example. Even if relationships exist,
they are difficult to explain it.
In another instance a third
variable’s impact on two variables
may give rise to a relation between the
two variables. Brisk sale of ice-creams
may be related to higher number of
deaths due to drowning. The victims
are not drowned due to eating of ice-
• Do both the variables move in the creams. Rising temperature leads to
same direction? brisk sale of ice-creams. Moreover, large
number of people start going to
swimming pools to beat the heat. This
might have raised the number of deaths
by drowning. Thus, temperature is
behind the high correlation between
the sale of ice-creams and deaths due
to drowning.

What Does Correlation Measure?


• How strong is the relationship?
Correlation studies and measures
t h e direction and intensity of
2. TYPES OF RELATIONSHIP relationship among variables.
Let us look at various types of Correlation measures covariation, not
relationship. The relation between causation. Correlation should never be
movements in quantity demanded and interpreted as implying cause and
the price of a commodity is an integral effect relation. The presence of
part of the theory of demand, which you correlation between two variables X
will study in Class XII. Low agricultural and Y simply means that when the
productivity is related to low rainfall. value of one variable is found to change
Such examples of relationship may be in one direction, the value of the other

2019-20
CORRELATION 93

variable is found to change either in the A scatter diagram visually presents


same direction (i.e. positive change) or the nature of association without giving
in the opposite direction (i.e. negative any specific numerical value. A
change), but in a definite way. For numerical measure of linear
simplicity we assume here that the relationship between two variables is
correlation, if it exists, is linear, i.e. the given by Karl Pearson’s coefficient of
relative movement of the two variables correlation. A relationship is said to
can be represented by drawing a be linear if it can be represented
straight line on graph paper. by a straight line. Spearman’s
coefficient of correlation measures the
Types of Correlation linear association between ranks
Correlation is commonly classified assigned to indiviual items according
into negative and positive to their attributes. Attributes are those
correlation. The correlation is said to variables which cannot be numerically
be positive when the variables move measured such as intelligence of
together in the same direction. When people, physical appearance, honesty,
the income rises, consumption also etc.
rises. When income falls,
consumption also falls. Sale of ice- Scatter Diagram
cream and temperature move in the
same direction. The correlation is A scatter diagram is a useful
negative when they move in opposite technique for visually examining the
directions. When the price of apples form of relationship, without
falls its demand increases. When the calculating any numerical value. In
prices rise its demand decreases. this technique, the values of the two
When you spend more time in variables are plotted as points on a
studying, chances of your failing graph paper. From a scatter diagram,
decline. When you spend less hours one can get a fairly good idea of the
in your studies, chances of scoring nature of relationship. In a scatter
low marks/grades increase. These diagram the degree of closeness of the
are instances of negative correlation. scatter points and their overall direction
The variables move in opposite enable us to examine the relation-
direction. ship. If all the points lie on a line, the
correlation is perfect and is said to be
3. T E C H N I Q U E S FOR MEASURING in unity. If the scatter points are widely
C ORRELATION dispersed around the line, the
correlation is low. The correlation is
Three important tools used to study said to be linear if the scatter points lie
correlation are scatter diagrams, Karl near a line or on a line.
Pearson’s coefficient of correlation and Scatter diagrams spanning over
Spearman’s rank correlation. Fig. 7.1 to Fig. 7.5 give us an idea of

2019-20
94 STATISTICS FOR ECONOMICS

the relationship between two variables. correlation coefficient. It gives a precise


Fig. 7.1 shows a scatter around an numerical value of the degree of linear
upward rising line indicating the relationship between two variables X
movement of the variables in the same and Y.
direction. When X rises Y will also rise. It is important to note that Karl
This is positive correlation. In Fig. 7.2 Pearson’s coefficient of correlation
the points are found to be scattered should be used only when there is a
around a downward sloping line. This linear relation between the variables.
time the variables move in opposite When there is a non-linear relation
directions. When X rises Y falls and vice between X and Y, then calculating the
versa. This is negative correlation. In Karl Pearson’s coefficient of correlation
Fig.7.3 there is no upward rising or can be misleading. Thus, if the true
downward sloping line around which relation is of the linear type as shown
the points are scattered. This is an by the scatter diagrams in figures 7.1,
example of no correlation. In Fig. 7.4 7.2, 7.4 and 7.5, then the Karl
and Fig. 7.5, the points are no longer Pearson’s coefficient of correlation
scattered around an upward rising or should be calculated and it will tell us
downward falling line. The points the direction and intensity of the
themselves are on the lines. This is relation between the variables. But if
referred to as perfect positive correlation the true relation is of the type shown in
and perfect negative correlation the scatter diagrams in Figures 7.6 or
respectively. 7.7, then it means there is a non-linear
relation between X and Y and we should
Activity not try to use the Karl Pearson’s
• Collect data on height, weight coefficient of correlation.
and marks scored by students It is, therefore, advisable to first
in your class in any two subjects examine the scatter diagram of the
in class X. Draw the scatter
relation between the variables before
diagram of these variables taking
two at a time. What type of
calculating the Karl Pearson’s
relationship do you find? correlation coefficient.
Let X1, X2, ..., XN be N values of X
A careful observation of the scatter and Y1, Y2 ,..., YN be the corresponding
diagram gives an idea of the nature values of Y. In the subsequent
and intensity of the relationship. presentations, the subscripts indicating
the unit are dropped for the sake of
Karl Pearson’s Coefficient of simplicity. The arithmetic means of X
Correlation and Y are defined as
This is also known as product moment ÂX ÂY
X = ; Y =
correlation coefficient or simple N N

2019-20
CORRELATION 95

Fig. 7.1: Positive Correlation Fig. 7.2: Negative Correlation

Fig. 7.3: No Correlation Fig. 7.4: Perfect Positive Correlation

Fig. 7.5: Perfect Negative Correlation Fig. 7.6: Positive non-linear relation

Fig. 7.7: Negative non-linear relation

2019-20
96 STATISTICS FOR ECONOMICS

and their variances are as follows


 XY - (  X )(  Y)
2 2
2 Â (X - X) ÂX 2
N
s x= = -X r =
2 2
N N 2 (Â X ) 2 (Â Y ) ...(3)
ÂX - ÂY -
2 2 N N
2 Â ( Y - Y) ÂY 2
and s y = = -Y or
N N
The standard deviations of X and NΣXY – (∑ X)(∑ Y)
r= ...(4)
Y, respectively, are the positive square NΣX – (ΣX)2 • NΣY 2 – (ΣY)2
2

roots of their variances. Covariance of


X and Y is defined as
Properties of Correlation Coefficient
 (X - X )( Y - Y)  xy Let us now discuss the properties of the
Cov(X,Y) = =
N N correlation coefficient
• r has no unit. It is a pure number.
Where x = X - X and y = Y - Y are the It means units of measurement are
deviations of the i th value of X and Y not part of r. r between height in feet
from their mean values respectively. and weight in kilograms, for
The sign of covariance between X instance, could be say 0.7.
and Y determines the sign of the • A negative value of r indicates an
correlation coefficient. The standard inverse relation. A change in one
deviations are always positive. If the variable is associated with change
covariance is zero, the correlation in the other variable in the
coefficient is always zero. The product opposite direction. When price of
moment correlation or the Karl a commodity rises, its demand
Pearson’s measure of correlation is falls. When the rate of interest
given by rises the demand for funds also
 xy falls. It is because now funds have
r= / Ns x s y become costlier.
...(1) • If r is positive the two variables
or move in the same direction. When
the price of coffee, a substitute of
 (X - X)(Y - Y) tea, rises the demand for tea also
r=
2 2 ...(2) rises. Improvement in irrigation
 (X - X) Â(Y - Y)
facilities is associated with higher
or yield. When temperature rises the
sale of ice-creams becomes brisk.

2019-20
CORRELATION 97

• If r = 1 or r = –1 the correlation is
perfect and there is exact linear
relation.
• A high value of r indicates strong
linear relationship. Its value is said
to be high when it is close to
+1 or –1.
• A low value of r (close to zero)
indicates a weak linear relation. But
there may be a non-linear relation.
As you have read in Chapter 1, the
statistical methods are no substitute for
common sense. Here, is another
example, which highlights the need for
• The value of the correlation understanding the data properly before
coefficient lies between minus one correlation is calculated and
and plus one, –1 ≤ r ≤ 1. If, in any interpreted. An epidemic spreads in
exercise, the value of r is outside some villages and the government
this range it indicates error in sends a team of doctors to the affected
calculation. villages. The correlation between the
• The magnitude of r is unaffected by number of deaths and the number of
the change of origin and change of doctors sent to the villages is found to
scale. Given two variables X and Y be positive. Normally, the healthcare
let us define two new variables. facilities provided by the doctors are
expected to reduce the number of
X–A Y–C deaths showing a negative correlation.
U= ;V =
B D This happened due to other reasons.
where A and C are assumed means The data relate to a specific time period.
of X and Y respectively. B and D are Many of the reported deaths could be
common factors and of same sign. terminal cases where the doctors could
Then do little. Moreover, the benefit of the
rxy = ruv presence of doctors becomes visible
only after some time. It is also possible
This. property is used to calculate
that the reported deaths are not due to
correlation coefficient in a highly
the epidemic. A tsunami suddenly hits
simplified manner, as in the step
the state and death toll rises.
deviation method.
Let us illustrate the calculation of r
• If r = 0 the two variables are
uncorrelated. There is no linear by examining the relationship between
relation between them. However years of schooling of farmers and the
other types of relation may be there. annual yield per acre.

2019-20
98 STATISTICS FOR ECONOMICS

Example 1 Substituting these values in


formula (1)
No. of years Annual yield per
of schooling acre in ’000 (Rs) 42
of farmers r= = 0.644
112 38
0 4 7
2 4 7 7
4 6
The same value can be obtained
6 10
8 10 from formula (2) also.
10 8
12 7 Â (X - X) (Y - Y)
r= ...(2)
2 2
Formula 1 needs the value of  (X - X)  (Y - Y)
∑ Xy, σ x , σ y
42
r = = 0.644
From Table 7.1 we get, 112 38
Thus, years of education of farmers
∑ xy = 42,
and annual yield per acre are
∑ (X − X)
2
112 positively correlated. The value of r is
σx = = , also large. It implies that more the
N 7
number of years farmers invest in
education, higher will be the yield per
2
 (Y - Y) 38 acre. It underlines the importance of
sy = = farmers’ education.
N 7
To use formula (3)

TABLE 7.1
Calculation of r between years of schooling of farmers and annual yield
Years of (X– X ) (X– X )2 Annual yield (Y– Y ) (Y– Y )2 (X– X )(Y– Y )
Education per acre in ’000 Rs
(X) (Y)
0 –6 36 4 –3 9 18
2 –4 16 4 –3 9 12
4 –2 4 6 –1 1 2
6 0 0 10 3 9 0
8 2 4 10 3 9 6
10 4 16 8 1 1 4
12 6 36 7 0 0 0

Σ X=42 Σ (X– X )2=112 Σ Y=49 Σ (Y– Y )2=38 Σ (X– X )(Y– Y )=42

2019-20
CORRELATION 99

taken care of by a good transport


( Â X )( Â Y)
 XY - network transferring it to other markets.
r= N
2
2 (Â X ) 2 ( Â Y)2 ...(3) Activity
ÂX - ÂY -
N N • Look at the following table.
Calculate r between annual
the value of the following expressions growth of national income at
have to be calculated i.e. current price and the Gross
2 2 Domestic Saving as percentage
 XY,  X ,  Y .
of GDP.
Now apply formula (3) to get the value
of r.
Let us know the interpretation of Step deviation method to calculate
different values of r. The correlation correlation coefficient.
coefficient between marks secured in
When the values of the variables
English and Statistics is, say, 0.1. It
are large, the burden of calculation
means that though the marks secured
in the two subjects are positively can be considerably reduced by
correlated, the strength of the using a property of r. It is that r is
relationship is weak. Students with high independent of change in origin and
marks in English may be getting scale. It is also known as step deviation
relatively low marks in statistics. Had the method. It involves the transformation
value of r been, say, 0.9, students with of the variables X and Y as follows:
high marks in English will invariably get
high marks in Statistics. TABLE 7.2
An example of negative correlation Year Annual growth Gross Domestic
is the relation between arrival of of National Saving as
vegetables in the local mandi and Income percentage of GDP
price of vegetables. If r is –0.9, 1992–93 14 24
vegetable supply in the local mandi 1993–94 17 23
will be accompanied by lower price of 1994–95 18 26
vegetables. Had it been –0.1, large 1995–96 17 27
vegetable supply will be accompanied
1996–97 16 25
by lower price, not as low as the price,
1997–98 12 25
when r is –0.9. The extent of price fall
1998–99 16 23
depends on the absolute value of r.
1999–00 11 25
Had it been zero, there would have
been no fall in price, even after large 2000–01 8 24
supplies in the market. This is also a 2001–02 10 23
possibility if the increase in supply is Source: Economic Survey, (2004–05) Pg. 8,9

2019-20
100 STATISTICS FOR ECONOMICS

SV2 = 343; SUV = 378


X-A Y -C Substituting these values in formula (3)
U= ;V =
B D
ΣUV −
( ΣU )( ΣU )
where A and B are assumed means, h
and k are common factors and have r= N
2 2 (3)
same signs.
ΣU 2

( ΣU ) ΣV 2

( ΣV )
Then rUV = rXY
N N
This can be illustrated with the
exercise of analysing the correlation
41 × 35
between price index and money 378 −
supply. = 5
(41) 2 (35) 2
Example 2 423 − 343 −
5 5
Price 120 150 190 220 230
index (X)
Money 1800 2000 2500 2700 3000 = 0.98
supply
in Rs crores (Y) The strong positive correlation
The simplification, using step between price index and money
deviation method is illustrated below. supply is an important premise of
Let A = 100; h = 10; B = 1700 and monetary policy. When the money
k = 100 supply grows the price index also rises.
The table of transformed variables
is as follows: Activity
Calculation of r between price index • Using data related to India’s
and money supply using step deviation population and national income,
method calculate the correlation
between them using step
TABLE 7.3 deviation method.

U V
Spearman’s rank correlation
ÊX - 100 ˆ ÊY - 1700 ˆ
ÁË ˜¯ ÁË ˜¯ U2 V2 UV Spearman’s rank correlation was
10 100
developed by the British psychologist
2 1 4 1 2
C.E. Spearman. It is used in the
5 3 25 9 15 following situations:
9 8 81 64 72 1. Suppose we are trying to estimate
12 10 144 100 120 the correlation between the heights
13 13 169 169 169 and weights of students in a remote
village where neither measuring
SU = 41; SU = 35; SU2 = 423;
rods nor weighing machines are

2019-20
CORRELATION 101

available. In such a situation, we coefficient where individual values have


cannot measure height or weight, been replaced by ranks. These ranks
but we can certainly rank the are used for the calculation of
students according to weight and correlation. This coefficient provides a
height. These ranks can then be measure of linear association between
used to calculate Spearman’s rank ranks assigned to these units, not their
correlation coefficient. values. The Spearman’s rank
2. Suppose we are dealing with things correlation formula is
such as fairness, honesty or beauty. 2
These cannot be measured in the 6Â D
ra = 1 - 3 ...(4)
same way as we measure income, n -n
weight or height. At most, these
things can be measured relatively, where n is the number of observations
for example, we may be able to rank and D the deviation of ranks assigned
people according to beauty (some to a variable from those assigned to
people would argue that even this the other variable.
is not possible because standards All the properties of the simple
and criteria of beauty may differ correlation coefficient are applicable
from person to person and culture here. Like the Pearsonian Coefficient of
to culture). If we wish to find the correlation it lies between 1 and
relation between variables, at least –1. However, generally it is not as
one of which is of this type, then accurate as the ordinary method. This
Spearman’s rank correlation is due the fact that all the information
coefficient is to be used. concerning the data is not utilised.
3. Spearman’s rank correlation The first difference is the difference
coefficient can be used in some cases of consecutive values. The first
where there is a relation whose differences of the values of items in the
direction is clear but which is non- series, arranged in order of magnitude,
linear as shown when the scatter are almost never constant. Usually the
diagrams are of the type shown in data cluster around the central values
Figures 7.6 and 7.7. with smaller differences in the middle
4. Spearman’s correlation coefficient is of the array.
not affected by extreme values. In this If the first differences were constant,
respect, it is better than Karl Pearson’s then r and r k would give identical
correlation coefficient. Thus if the data results. In general rk is less than or
contains some extreme values, equal to r.
Spearman’s correlation coefficient can
be very useful. Calculation of Rank Correlation
Rank correlation coefficient and Coefficient
simple correlation coefficient have the
same interpretation. Its formula has The calculation of rank correlation will
been derived from simple correlation be illustrated under three situations.

2019-20
102 STATISTICS FOR ECONOMICS

1. The ranks are given. 6 × 14 84


2. The ranks are not given. They have =1− =1− = 1 − 0.7 = 0.3
5 −5
3
120
to be worked out from the data.
3. Ranks are repeated. The rank correlation between A and
C is calculated as follows:
Case 1: When the ranks are given
A C D D2
Example 3 1 1 0 0
Five persons are assessed by three 2 3 –1 1
3 5 –2 4
judges in a beauty contest. We have
4 2 2 4
to find out which pair of judges has 5 4 1 1
the nearest approach to common
Total 10
perception of beauty.
Competitors Substituting these values in
Judge 1 2 3 4 5 formula (4) the rank correlation is 0.5.
A 1 2 3 4 5 Similarly, the rank correlation between
B 2 4 1 5 3 the rankings of judges B and C is 0.9.
C 1 3 5 2 4 Thus, the perceptions of judges A and
There are 3 pairs of judges C are the closest. Judges B and C have
necessitating calculation of rank very different tastes.
correlation thrice. Formula (4) will be
used. Case 2: When the ranks are not given
6ΣD2 Example 4
rs = 1 − ...(4)
n3 − n
We are given the percentage of marks,
The rank correlation between A and
secured by 5 students in Economics
B is calculated as follows:
and Statistics. Then the ranking has
A B D D2 to be worked out and the rank
1 2 –1 1 correlation is to be calculated.
2 4 –2 4
3 1 2 4 Student Marks in Marks in
4 5 –1 1 Statistics Economics
5 3 2 4 (X) (Y)

Total 14 A 85 60
B 60 48
Substituting these values in C 55 49
formula (4) D 65 50
E 75 55
6ΣD2
rs = 1 − ...(4)
n3 − n

2019-20
CORRELATION 103

Student Ranking in Ranking in Rank of X Rank of Y Deviation D2


Statistics Economics in Ranks
(Rx) (RY)
1 5.5 –4.5 20.25
A 1 1 2 7 –5 25.00
B 4 5 3 10 –7 49.00
C 5 4 4 1 3 9.00
D 3 3 5 2.5 2.5 6.25
E 2 2
6 4 2 4.00
7 2.5 4.5 20.25
Once the ranking is complete 8 12 –4 16.00
formula (4) is used to calculate rank 9 10 –1 1.00
correlation. 10 8 2 4.00
11 10 1 1.00
Case 3: When the ranks are repeated 12 5.5 6.5 42.25
and ranks of not given 198.00

Example 5 The formula of Spearman’s rank


correlation coefficient when the ranks
The values X and Y are given as follows
are repeated is as follows
X Y
rs = 1 −
1200 75
1150 65  ( m13 − m1 ) ( m 32 − m 2 ) 
6 ΣD2 + + + ...
1000 50  12 12 
990 100
800 90 n( n − 1)
2

780 85 where m1, m2, ..., are the number of


760 90
750 40 m 31 − m1
730 50 repetitions of ranks and ...,
700 60 12
620 50 their corresponding correction factors.
600 75 The necessary correction for this data
In order to work out the rank thus is
correlation, the ranks of the values are 3 3
3 -3 2 -2 30
worked out. Common ranks are given + = 2.5 =
to the repeated items. The common 12 12 12
rank is the mean of the ranks which Substituting the values of these
those items would have assumed if they expressions
were slightly different from each other.
6 (198 + 2.5)
The next item will be assigned the rank rs =1- = (1-0.70)= 0.30
3
next to the rank already assumed. 12 -12
Here Y has the value 50 at the 9th, Thus, there is positive rank correlation
10th and 11th rank. Hence all three are between X and Y. Both X and Y move
given the average rank i.e.10, in the same direction. However, the

2019-20
104 STATISTICS FOR ECONOMICS

relationship cannot be described as 4. CONCLUSION


strong.
We have discussed some techniques
Activity
for studying the relationship between
two variables, particularly the linear
• Collect data on marks scored by relationship. The scatter diagram gives
10 of your classmates in class IX a visual presentation of the relationship
and X examinations. Calculate the and is not confined to linear relations.
rank correlation coefficient
Karl Pearson’s coefficient of correlation
between them. If your data do not
and Spearman’s rank correlation
have any repetition, repeat the
measure linear relationship among
exercise by taking a data set
having repeated ranks. What are variables. When the variables cannot
the circumstances in which rank be measured precisely, rank
correlation coefficient is preferred correlation can be used. These
to simple correlation coefficient? If measures however do not imply
data are precisely measured will causation. The knowledge of
you still prefer rank correlation correlation gives us an idea of the
coefficient to simple correlation? direction and intensity of change in a
When can you be indifferent to the variable when the correlated variable
choice? Discuss in class. changes.

Recap
• Correlation analysis studies the relation between two variables.
• Scatter diagrams give a visual presentation of the nature of
relationship between two variables.
• Karl Pearson’s coefficient of correlation r measures numerically only
linear relationship between two variables. r lies between –1 and 1.
• When the variables cannot be measured precisely Spearman’s rank
correlation can be used to measure the linear relationship
numerically.
• Repeated ranks need correction factors.
• Correlation does not mean causation. It only means
covariation.

2019-20
CORRELATION 105

EXERCISES

1. The unit of correlation coefficient between height in feet and weight in


kgs is
(i) kg/feet
(ii) percentage
(iii) non-existent
2. The range of simple correlation coefficient is
(i) 0 to infinity
(ii) minus one to plus one
(iii) minus infinity to infinity
3. If rxy is positive the relation between X and Y is of the type
(i) When Y increases X increases
(ii) When Y decreases X increases
(iii) When Y increases X does not change
4. If rxy = 0 the variable X and Y are
(i) linearly related
(ii) not linearly related
(iii) independent
5. Of the following three measures which can measure any type of relationship
(i) Karl Pearson’s coefficient of correlation
(ii) Spearman’s rank correlation
(iii) Scatter diagram
6. If precisely measured data are available the simple correlation coefficient is
(i) more accurate than rank correlation coefficient
(ii) less accurate than rank correlation coefficient
(iii) as accurate as the rank correlation coefficient
7. Why is r preferred to covariance as a measure of association?
8. Can r lie outside the –1 and 1 range depending on the type of data?
9. Does correlation imply causation?
10. When is rank correlation more precise than simple correlation
coefficient?
11. Does zero correlation mean independence?
12. Can simple correlation coefficient measure any type of relationship?
13. Collect the price of five vegetables from your local market every day for
a week. Calculate their correlation coefficients. Interpret the result.
14. Measure the height of your classmates. Ask them the height of their

2019-20
106 STATISTICS FOR ECONOMICS

benchmate. Calculate the correlation coefficient of these two variables.


Interpret the result.
15. List some variables where accurate measurement is difficult.
16. Interpret the values of r as 1, –1 and 0.
17. Why does rank correlation coefficient differ from Pearsonian correlation
coefficient?
18. Calculate the correlation coefficient between the heights of fathers in
inches (X) and their sons (Y)
X 65 66 57 67 68 69 70 72
Y 67 56 65 68 72 72 69 71
(Ans. r = 0.603)
19. Calculate the correlation coefficient between X and Y and comment on
their relationship:
X –3 –2 –1 1 2 3
Y 9 4 1 1 4 9
(Ans. r = 0)
20. Calculate the correlation coefficient between X and Y and comment on
their relationship
X 1 3 4 5 7 8
Y 2 6 8 10 14 16
(Ans. r = 1)

Activity
• Use all the formulae discussed here to calculate r between
India’s national income and exports taking at least ten
observations.

2019-20
...if we find any association between two or more variables, we might be interested in
estimating the value of one variable for known value(s) of another variable(s)

5.1 INTRODUCTION
In business, several times it becomes necessary to have some forecast so that the management

can take a decision regarding a product or a particular course of action. In order to make a

forecast, one has to ascertain some relationship between two or more variables relevant to a

particular situation. For example, a company is interested to know how far the demand for

television sets will increase in the next five years, keeping in mind the growth of population

in a certain town. Here, it clearly assumes that the increase in population will lead to an

increased demand for television sets. Thus, to determine the nature and extent of relationship

between these two variables becomes important for the company.

In the preceding lesson, we studied in some depth linear correlation between two variables.

Here we have a similar concern, the association between variables, except that we develop it

further in two respects. First, we learn how to build statistical models of relationships

between the variables to have a better understanding of their features. Second, we extend the

models to consider their use in forecasting.

For this purpose, we have to use the technique - regression analysis - which forms the

subject-matter of this lesson.

5.2 WHAT IS REGRESSION?

In 1889, Sir Francis Galton, a cousin of Charles Darwin published a paper on heredity,

“Natural Inheritance”. He reported his discovery that sizes of seeds of sweet pea plants

appeared to “revert” or “regress”, to the mean size in successive generations. He also reported

results of a study of the relationship between heights of fathers and heights of their sons. A

straight line was fit to the data pairs: height of father versus height of son. Here, too, he found

a “regression to mediocrity” The heights of the sons represented a movement away from their

131
fathers, towards the average height. We credit Sir Galton with the idea of statistical

regression.

While most applications of regression analysis may have little to do with the

“regression to the mean” discovered by Galton, the term “regression” remains. It

now refers to the statistical technique of modeling the relationship between two or

more variables. In general sense, regression analysis means the estimation or

prediction of the unknown value of one variable from the known value(s) of the other

variable(s). It is one of the most important and widely used statistical techniques in

almost all sciences - natural, social or physical.

In this lesson we will focus only on simple regression –linear regression involving only two

variables: a dependent variable and an independent variable. Regression analysis for studying

more than two variables at a time is known as multiple regressions.

5.2.1 INDEPENDENT AND DEPENDENT VARIABLES

Simple regression involves only two variables; one variable is predicted by another variable.

The variable to be predicted is called the dependent variable. The predictor is called the

independent variable, or explanatory variable. For example, when we are trying to predict

the demand for television sets on the basis of population growth, we are using the demand for

television sets as the dependent variable and the population growth as the independent or

predictor variable.

The decision, as to which variable is which sometimes, causes problems. Often the choice is

obvious, as in case of demand for television sets and population growth because it would

make no sense to suggest that population growth could be dependent on TV demand! The

population growth has to be the independent variable and the TV demand the dependent

variable.

132
If we are unsure, here are some points that might be of use:

¾ if we have control over one of the variables then that is the independent. For example,

a manufacturer can decide how much to spend on advertising and expect his sales to

be dependent upon how much he spends

¾ it there is any lapse of time between the two variables being measured, then the latter

must depend upon the former, it cannot be the other way round

¾ if we want to predict the values of one variable from your knowledge of the other

variable, the variable to be predicted must be dependent on the known one

5.3 LINEAR REGRESSION

The task of bringing out linear relationship consists of developing methods of fitting a

straight line, or a regression line as is often called, to the data on two variables.

The line of Regression is the graphical or relationship representation of the best estimate of

one variable for any given value of the other variable. The nomenclature of the line depends

on the independent and dependent variables. If X and Y are two variables of which

relationship is to be indicated, a line that gives best estimate of Y for any value of X, it is

called Regression line of Y on X. If the dependent variable changes to X, then best estimate

of X by any value of Y is called Regression line of X on Y.

5.3.1 REGRESSION LINE OF Y ON X

For purposes of illustration as to how a straight line relationship is obtained, consider the

sample paired data on sales of each of the N = 5 months of a year and the marketing

expenditure incurred in each month, as shown in Table 5-1

Table 5-1
Sales Marketing Expenditure
Month (Rs lac) (Rs thousands)

133
Y X
April 14 10
May 17 12
June 23 15
July 21 20
August 25 23

Let Y, the sales, be the dependent variable and X, the marketing expenditure, the independent

variable. We note that for each value of independent variable X, there is a specific value of

the dependent variable Y, so that each value of X and Y can be seen as paired observations.

5.3.1.1 Scatter Diagram

Before obtaining a straight-line relationship, it is necessary to discover whether the

relationship between the two variables is linear, that is, the one which is best explained by a

straight line. A good way of doing this is to plot the data on X and Y on a graph so as to yield

a scatter diagram, as may be seen in Figure 5-1. A careful reading of the scatter diagram

reveals that:

¾ the overall tendency of the points is to move upward, so the relationship is positive

¾ the general course of movement of the various points on the diagram can be best

explained by a straight line

¾ there is a high degree of correlation between the variables, as the points are very close
to each other

134
Figure 5-1 Scatter Diagram with Line of Best Fit

5.3.1.2 Fitting a Straight Line on the Scatter Diagram

If the movement of various points on the scatter diagram is best described by a straight line,

the next step is to fit a straight line on the scatter diagram. It has to be so fitted that on the

whole it lies as close as possible to every point on the scatter diagram. The necessary

requirement for meeting this condition being that the sum of the squares of the vertical

deviations of the observed Y values from the straight line is minimum.

As shown in Figure 5-1, if dl, d2,..., dN are the vertical deviations' of observed Y values from

the straight line, fitting a straight line requires that

N
d12 + d 22 + ..................... + d N2 = ∑ d 2j
j =1

is the minimum. The deviations dj have to be squared to avoid negative deviations canceling

out the positive deviations. Since a straight line so fitted best approximates all the points on

the scatter diagram, it is better known as the best approximating line or the line of best fit. A

line of best fit can be fitted by means of:

1. Free hand drawing method, and

2. Least square method

Free Hand Drawing:

Free hand drawing is the simplest method of fitting a straight line. After a careful

inspection of the movement and spread of various points on the scatter diagram, a

straight line is drawn through these points by using a transparent ruler such that on the

135
whole it is closest to every point. A straight line so drawn is particularly useful when

future approximations of the dependent variable are promptly required.

Whereas the use of free hand drawing may yield a line nearest to the line of best fit, the major

drawback is that the slope of the line so drawn varies from person to person because of the

influence of subjectivity. Consequently, the values of the dependent variable estimated on the

basis of such a line may not be as accurate and precise as those based on the line of best fit.

Least Square Method:

The least square method of fitting a line of best fit requires minimizing the sum of the

squares of vertical deviations of each observed Y value from the fitted line. These deviations,

such as d1 and d3, are shown in Figure 5-1 and are given by Y - Yc, where Y is the observed

value and Yc the corresponding computed value given by the fitted line

Yc = a + bX i …………(5.1)

for the ith value of X.

The straight line relationship in Eq.(5.1), is stated in terms of two constants a and b

¾ The constant a is the Y-intercept; it indicates the height on the vertical axis from

where the straight line originates, representing the value of Y when X is zero.

¾ Constant b is a measure of the slope of the straight line; it shows the absolute change

in Y for a unit change in X. As the slope may be positive or negative, it indicates the

nature of relationship between Y and X. Accordingly, b is also known as the

regression coefficient of Y on X.

Since a straight line is completely defined by its intercept a and slope b, the task of fitting the

same reduces only to the computation of the values of these two constants. Once these two

values are known, the computed Yc values against each value of X can be easily obtained by

substituting X values in the linear equation.

136
In the method of least squares the values of a and b are obtained by solving simultaneously

the following pair of normal equations

∑ Y = aN + b∑ X …………(5.2)

∑ XY = a∑ X + b∑ X 2
…………(5.2)

The value of the expressions - ∑ X , ∑ Y , ∑ XY and ∑ X 2


can be obtained from the given

observations and then can be substituted in the above equations to obtain the value of a and b.

Since simultaneous solving the two normal equations for a and b may quite often be

cumbersome and time consuming, the two values can be directly obtained as

a = Y − bX …………(5.3)

and
N ∑ XY − ∑ X ∑ Y
b= …………(5.4)
N ∑ X 2 − (∑ X )
2

Note: Eq. (5.3) is obtained simply by dividing both sides of the first of Eqs. (5.2) by N and
Eq.(5.4) is obtained by substituting ( Y − b X ) in place of a in the second of Eqs. (5.2)

Instead of directly computing b, we may first compute value of a as

a=
∑ Y ∑ X − ∑ X ∑ XY
2

…………(5.5)
N ∑ X − (∑ X )
2 2

and
Y −a
b= …………(5.6)
X

N ∑ XY − ∑ X ∑ Y
Note: Eq. (5.5) is obtained by substituting for b in Eq. (5.3) and Eq.
N ∑ X 2 − (∑ X )
2

(5.6) is obtained simply by rearranging Eq. (5.3)

Table 5-2
Computation of a and b
Y X XY X2 Y2

137
14 10 140 100 196
17 12 204 144 289
23 15 345 225 529
21 20 420 400 441
25 23 575 529 625

∑ Y = 100 ∑ X = 80 ∑ XY = 1684 ∑ X 2
= 1398 ∑Y 2
= 2080

So using Eqs. (5.5) and (5.4)

100 x1398 − 80 x1684


a=
5 x1398 − (80 )
2

139800 − 134720
=
6990 − 6400
5080
=
590
= 8.6101695
and
5 x1684 − 80 x100
b=
5 x1398 − (80)
2

8420 − 8000
=
6990 − 6400
420
=
590
= 0.7118644

Now given a = 8.61 and b = 0.71

The regression Eq.(5.1) takes the form

Yc = 8.61 + 0.71X …………(5.1a)

138
Figure 5-2 Regression Line of Y on X

Then, to fit the line of best fit on the scatter diagram, only two computed Yc values are

needed. These can be easily obtained by substituting any two values of X in Eq. (5.1a). When

these are plotted on the diagram against their corresponding values of X, we get two points,

by joining which (by means of a straight line) gives us the required line of best fit, as shown

in Figure 5-2

Some Important Relationships

We can have some important relationships for data analysis, involving other measures such as

X , Y , Sx, Sy and the correlation coefficient rxy.

Substituting Y − b X [from Eq.(5.3)] for a in Eq.(5.1)

Yc = ( Y − b X ) +bX

or Yc - Y = b(X- X ) …………(5.7)

Dividing the numerator and denominator of Eq.(5.4) by N2, we get

∑ XY − ⎛⎜ ∑ X ⎞⎟⎛⎜ ∑Y ⎞⎟
N ⎜ N ⎟⎜ N ⎟⎠
b= ⎝ ⎠⎝
∑ X − ⎛⎜ ∑ X
2
2


N ⎜ N ⎟
⎝ ⎠

∑ XY − X Y
or b= N
S x2
Cov( X , Y )
or b= …………(5.8)
S x2

We know, coefficient of correlation, rxy is given by

Cov( X , Y )
rxy =
Sx Sy

139
or Cov( X , Y ) = rxy S x S y

So Eq. (5.8) becomes

SxSy
b = rxy
S x2
Sy
b = rxy …………(5.9)
Sx

Sy
Substituting rxy for b in Eq.(5.7), we get
Sx

Sy
Yc - Y = rxy (X- X ) …………(5.10)
Sx

These are important relationships for data analysis.

5.3.1.3 Predicting an Estimate and its Preciseness

The main objective of regression analysis is to know the nature of relationship between two

variables and to use it for predicting the most likely value of the dependent variable

corresponding to a given, known value of the independent variable. This can be done by

substituting in Eq.(5.1a) any known value of X corresponding to which the most likely

estimate of Y is to be found.

For example, the estimate of Y (i.e. Yc), corresponding to X = 15 is

Yc = 8.61 + 0.71(15)

= 8.61 + 10.65

= 19.26

It may be appreciated that an estimate of Y derived from a regression equation will not be

exactly the same as the Y value which may actually be observed. The difference between

estimated Yc values and the corresponding observed Y values will depend on the extent of

scatter of various points around the line of best fit.

140
The closer the various paired sample points (Y, X) clustered around the line of best fit, the

smaller the difference between the estimated Yc and observed Y values, and vice-versa. On the

whole, the lesser the scatter of the various points around, and the lesser the vertical distance

by which these deviate from the line of best fit, the more likely it is that an estimated Yc value

is close to the corresponding observed Y value.

The estimated Yc values will coincide the observed Y values only when all the points on the

scatter diagram fall in a straight line. If this were to be so, the sales for a given marketing

expenditure could have been estimated with l00 percent accuracy. But such a situation is too

rare to obtain. Since some of the points must lie above and some below the straight line,

perfect prediction is practically non-existent in the case of most business and economic

situations.

This means that the estimated values of one variable based on the known values of the other

variable are always bound to differ. The smaller the difference, the greater the precision of

the estimate, and vice-versa. Accordingly, the preciseness of an estimate can be obtained only

through a measure of the magnitude of error in the estimates, called the error of estimate.

5.3.1.4 Error of Estimate

A measure of the error of estimate is given by the standard error of estimate of Y on X,

denoted as Syx and defined as

∑ (Y − Y )
2
c
Syx = …………(5.11)
N

Syx measures the average absolute amount by which observed Y values depart from the

corresponding computed Yc values.

Computation of Syx becomes little cumbersome where the number of observations N is large.

In such cases Syx may be computed directly by using the equation:

141
∑Y 2
− a (∑ Y ) − b∑ XY
Syx = …………(5.12)
N

By substituting the values of ∑ Y , ∑ Y , and ∑ XY


2
from the Table 5-2, and the calculated

values of a and b

We have

2080 − 8.61x100 − 0.71x1684


Syx =
5

2080 − 861 − 1195.64


=
5

23.36
=
5

= 4.67
= 2.16

Interpretations of Syx

A careful observation of how the standard error of estimate is computed reveals the

following:

1. Syx is a concept statistically parallel to the standard deviation Sy . The only difference

between the two being that the standard deviation measures the dispersion around the

mean; the standard error of estimate measures the dispersion around the regression

line. Similar to the property of arithmetic mean, the sum of the deviations of different

Y values from their corresponding estimated Yc values is equal to zero. That is

∑( Yi -Y ) = ∑ ( Yi - Yc) = 0 where i = 1, 2, ..., N.

2. Syx tells us the amount by which the estimated Yc values will, on an average, deviate

from the observed Y values. Hence it is an estimate of the average amount of error in

the estimated Yc values. The actual error (the residual of Y and Yc) may, however, be

smaller or larger than the average error. Theoretically, these errors follow a normal

distribution. Thus, assuming that n ≥ 30, Yc ± 1.Syx means that 68.27% of the estimates

142
based on the regression equation will be within 1.Syx Similarly, Yc ± 2.Syx means that

95.45% of the estimates will fall within 2.Syx

Further, for the estimated value of sales against marketing expenditure of Rs 15

thousand being Rs 19.26 lac, one may like to know how good this estimate is. Since

Syx is estimated to be Rs 2.16 lac, it means there are about 68 chances (68.27) out of

100 that this estimate is in error by not more than Rs 2.16 lac above or below Rs

19.26 lac. That is, there are 68% chances that actual sales would fall between (19.26 -

2.16) = Rs 17.10 lac and (19.26 + 2.16) = Rs 21.42 lac.

3. Since Syx measures the closeness of the observed Y values and the estimated Yc values,

it also serves as a measure of the reliability of the estimate. Greater the closeness

between the observed and estimated values of Y, the lesser the error and,

consequently, the more reliable the estimate. And vice-versa.

4. Standard error of estimate Syx can also be seen as a measure of correlation insofar as it

expresses the degree of closeness of scatter of observed Y values about the regression

line. The closer the observed Y values scattered around the regression line, the higher

the correlation between the two variables.

A major difficulty in using Syx as a measure of correlation is that it is expressed in the

same units of measurement as the data on the dependent variable. This creates

problems in situations requiring comparison of two or more sets of data in terms of

correlation. It is mainly due to this limitation that the standard error of estimate is not

generally used as a measure of correlation. However, it does serve as the basis of

evolving the coefficient of determination, denoted as r2, which provides an alternate

method of obtaining a measure of correlation.

5.3.2 REGRESSION LINE OF X ON Y

143
So far we have considered the regression of Y on X, in the sense that Y was in the role of

dependent and X in the role of an independent variable. In their reverse position, such that X

is now the dependent and Y the independent variable, we fit a line of regression of X on Y.

The regression equation in this case will be

Xc = a’ + b’Y …………(5.13)

Where Xc denotes the computed values of X against the corresponding values of Y. a’ is the

X-intercept and b’ is the slope of the straight line.

Two normal equations to solve a’and b’ are

∑ X = a ' N + b' ∑ Y …………(5.14)

∑ XY = a' ∑ Y + b' ∑ Y 2
…………(5.14)

The value of a’ and b’ can also be obtained directly

a’ = X - b’Y …………(5.15)

and
N ∑ XY − ∑ X ∑ Y
b' = …………(5.16)
N ∑ Y 2 − (∑ Y )
2

or

a' =
∑ X ∑ Y − ∑ Y ∑ XY
2

…………(5.17)
N ∑ Y − (∑ Y )
2 2

and

X − a'
b' = …………(5.18)
Y

Cov(Y , X )
b' = …………(5.19)
S y2

Sx
b' = ryx …………(5.20)
Sy

So, Regression equation of X on Y may also be written as

144
Xc - X = b’ (Y- Y ) …………(5.21)

Sx
Xc - X = ryx (Y - Y ) …………(5.22)
Sy

As before, once the values of a’ and b’ have been found, their substitution in Eq.(5.13) will

enable us to get an estimate of X corresponding to a known value of Y

Standard Error of estimate of X on Y i.e. Sxy will be

( X − X c )2
Sxy = …………(5.23)
N
or

Sxy =
∑X 2
− a ' ∑ X − b' ∑ XY
…………(5.24)
N

For example, if we want to estimate the marketing expenditure to achieve a sale target of Rs

40 lac, we have to obtain regression line of X on Y i. e.

Xc = a’ + b’Y

So using Eqs. (5.17) and (5.16), and substituting the values of ∑ X , ∑ Y , ∑ Y and ∑ XY
2

from Table 5-2, we have

80 x 2080 − 100 x1684


a' =
5 x 2080 − (100)
2

166400 − 168400
=
10400 − 10000
− 2000
=
400
= -5.00
and

5 x1684 − 80 x100
b' =
5 x 2080 − (100)
2

8420 − 8000
=
10400 − 10000

145
420
=
400
= 1.05

Now given that a’= -5.00 and b’=1.05, Regression equation (5.13) takes the form

Xc = -5.00 +1.05Y

So when Y = 40(Rs lac), the corresponding X value is

Xc = -5.00+1.05x40

= -5 + 42

= 37

That is to achieve a sale target of Rs 40 lac, there is a need to spend Rs 37 thousand on

marketing.

5.4 PROPERTIES OF REGRESSION COEFFICIENTS


As explained earlier, the slope of regression line is called the regression coefficient. It tells

the effect on dependent variable if there is a unit change in the independent variable. Since

for a paired data on X and Y variables, there are two regression lines: regression line of Y on X

and regression line of X on Y, so we have two regression coefficients:

a. Regression coefficient of Y on X, denoted by byx [b in Eq.(5.1)]

b. Regression coefficient of X on Y, denoted by bxy [b’ in Eq.(5.13)]

The following are the important properties of regression coefficients that are helpful in data

analysis

1. The value of both the regression coefficients cannot be greater than 1. However, value

of both the coefficients can be below 1 or at least one of them must be below 1, so

that the square root of the product of two regression coefficients must lie in the limit

±1.

2. Coefficient of correlation is the geometric mean of the regression coefficients, i.e.

146
r = ± b. b' …………(5.25)

The signs of both the regression coefficients are the same, and so the value of r will

also have the same sign.

3. The mean of both the regression coefficients is either equal to or greater than the

coefficient of correlation, i.e.

b + b'
≥r
2

3. Regression coefficients are independent of change of origin but not of change of

scale. Mathematically, if given variables X and Y are transformed to new variables U

and V by change of origin and scale, i. e.

X−A Y −B
U= and V =
h k

Where A, B, h and k are constants, h > 0, k > 0 then

Regression coefficient of Y on X = k/h (Regression coefficient of V on U)

k
b yx = bvu
h
and

Regression coefficient of X on Y = h/k (Regression coefficient of U on V)

h
bxy = buv
k

5. Coefficient of determination is the product of both the regression coefficients i.e.

r2 = b.b’

5.5 REGRESSION LINES AND COEFFICIENT OF CORRELATION


The two regression lines indicate the nature and extent of correlation between the variables.

The two regression lines can be represented as

Sy Sx
Y- Y = r (X - X ) and X- X = r (Y - Y )
Sx Sy

147
We can write the slope of these lines, as

Sy Sx
b= r and b’ = r
Sx Sy

If θ is the angle between these lines, then

b − b'
tan θ =
1 + bb'

Sx S y ⎛ r 2 −1⎞
= 2 ⎜ ⎟
S x + S y2 ⎜⎝ r ⎟⎠

⎡ SxSy ⎛ r 2 − 1 ⎞⎤
or θ = tan –1 ⎢ 2 ⎜⎜ ⎟⎟⎥ …………(5.26)
⎢⎣ S x + S y
2
⎝ r ⎠⎥⎦

148
Figure 5-3 Regression Lines and Coefficient of Correlation
Eq. (5.26) reveals the following:

¾ In case of perfect positive correlation (r = +1) and in case of perfect negative

correlation (r = -1), θ = 0, so the two regression lines will coincide, i.e. we have only

one line, see (a) and (b) in Figure 5-3.

The farther the two regression lines from each other, lesser will be the degree of

correlation and nearer the two regression lines, more will be the degree of correlation,

see (c) and (d) in Figure 5-3.

¾ If the variables are independent i.e. r = 0, the lines of regression will cut each other at

right angle. See (g) in Figure 5-3.

Note : Both the regression lines cut each other at mean value of X and mean value of Y i.e. at

X and Y .

5.6 COEFFICIENT OF DETERMINATION


Coefficient of determination gives the percentage variation in the dependent variable that is

accounted for by the independent variable. In other words, the coefficient of determination

gives the ratio of the explained variance to the total variance. The coefficient of

determination is given by the square of the correlation coefficient, i.e. r2. Thus,

Coefficient of determination

Explained Variance
r2 =
Total Variance

∑ (Y − Y )
2
2 c
r = …………(5.27)
∑ (Y − Y )
2

149
We can calculate another coefficient K2, known as coefficient of Non-Determination, which

is the ratio of unexplained variance to the total variance.

Un exp lained Variance


K2 =
Total Variance

∑ (Y − Y )
2
2 c
K = …………(5.28)
∑ (Y − Y )
2

Explained Variance
K2 = 1-
Total Variance

= 1 - r2 …………(5.29)

The square root of the coefficient of non-determination, i.e. K gives the coefficient of

alienation

K = ± 1− r2 …………(5.30)

Relation Between Syx and r:

A simple algebraic operation on Eq. (5.30) brings out some interesting points about the

relation between Syx and r. Thus, since

∑ (Y − Y ) ∑ (Y − Y )
2 2
c = N S yx2 and = N S y2

So we have coefficient of Non-determination

∑ (Y − Y )
2
c
K 2
=
∑ (Y − Y )
2

N S yx2
K2 =
N S y2

S yx2
=
S y2

2 S yx2
So 1–r =
S y2

S yx
or = 1− r 2 …………(5.31)
Sy

150
If coefficient of correlation, r, is defined as the under root of the coefficient of determination

r= r2

2 S yx2
r = 1−
S y2

S yx
r = 1− …………(5.32)
S y2

On carefully observing Eq. (5.32), it will be noticed that the ratio Syx/Sy will be large if the

coefficient of determination is small, and it will be small when the coefficient of

determination is large. Thus

9 if r2 = r = 0, Syx/Sy =1, which means that Syx = Sy.

9 if r2 = r = 1, Syx/Sy =0, which means that Syx = 0.

9 when r = 0.865, Syx = 0.427 Sy means that Syx is 42.7% of Sy.

Eq. (5.32) also implies that Syx is generally less than Sy. The two can at the most be equal, but

only in the extreme situation when r = 0.

Interpretations of r2:

1. Even though the coefficient of determination, whose under root measures the degree

of correlation, is based on Syx,; it is expressed as 1 - ( Syx/Sy ). As it is a dimensionless

pure number, the unit in which Syx is measured becomes irrelevant. This facilitates

comparison between the two sets of data in terms of their coefficient of determination

r2 (or the coefficient of correlation r). This was not possible in terms of Sy x as the

units of measurement could be different.

2. The value of r2 can range between 0 and 1. When r2 = 1, all the points on the scatter

diagram fall on the regression line and the entire variations are explained by the

straight line. On the other hand, when r2 = 0, none of the points on the scatter diagram

falls on the regression line, meaning thereby that there is no relationship between the

two variables. However, being always non-negative coefficient of determination does

151
not tell us about the direction of the relationship (whether it is positive or negative)

between the two variables.

3. When r2 = 0.7455 (or any other value), 74.55% of the total variations in sales are

explained by the marketing expenditure used. What remains is the coefficient of non-

determination K2 (= 1 - r2) = 0.2545. It means 25.45% of the total variations remain

unexplained, which are due to factors other than the changes in the marketing

expenditure.

4. r2 provides the necessary link between regression and correlation which are the two

related aspects of a single problem of the analysis of relationship between two

variables. Unlike regression, correlation quantifies the degrees of relationship

between the variables under study, without making a distinction between the

dependent and independent ones. Nor does it, therefore, help in predicting the value of

one variable for a given value of the other.

5. The coefficient of correlation overstates the degree of relationship and it’s meaning is

not as explicit as that of the coefficient of determination. The coefficient of

correlation r = 0.865, as compared to r2 = 0.7455, indicates a higher degree of

correlation between sales and marketing expenditure. Therefore, the coefficient of'

determination is a more objective measure of the degree of relationship.

6. The sum of r and K never adds to one, unless one of the two is zero. That is, r + K can

be unity either when there is no correlation or when there is perfect correlation.

Except in these two extreme situations, (r + K) > 1.

5.7 CORRELATION ANALYSIS VERSUS REGRESSION ANALYSIS


Correlation and Regression are the two related aspects of a single problem of the analysis of

the relationship between the variables. If we have information on more than one variable, we

might be interested in seeing if there is any connection - any association - between them. If

152
we found such a association, we might again be interested in predicting the value of one

variable for the given and known values of other variable(s).

1. Correlation literally means the relationship between two or more variables that vary in

sympathy so that the movements in one tend to be accompanied by the corresponding

movements in the other(s). On the other hand, regression means stepping back or

returning to the average value and is a mathematical measure expressing the average

relationship between the two variables.

2. Correlation coefficient rxy between two variables X and Y is a measure of the direction

and degree of the linear relationship between two variables that is mutual. It is

symmetric, i.e., ryx = rxy and it is immaterial which of X and Y is dependent variable

and which is independent variable.

Regression analysis aims at establishing the functional relationship between the two(

or more) variables under study and then using this relationship to predict or estimate

the value of the dependent variable for any given value of the independent variable(s).

It also reflects upon the nature of the variable, i.e., which is dependent variable and

which is independent variable. Regression coefficient are not symmetric in X and Y,

i.e., byx ≠ bxy.

3. Correlation need not imply cause and effect relationship between the variable under

study. However, regression analysis clearly indicates the cause and effect relationship

between the variables. The variable corresponding to cause is taken as independent

variable and the variable corresponding to effect is taken as dependent variable.

4. Correlation coefficient rxy is a relative measure of the linear relationship between X

and Y and is independent of the units of measurement. It is a pure number lying

between ±1.

153
On the other hand, the regression coefficients, byx and bxy are absolute measures

representing the change in the value of the variable Y (or X), for a unit change in the

value of the variable X (or Y). Once the functional form of regression curve is known;

by substituting the value of the independent variable we can obtain the value of the

dependent variable and this value will be in the units of measurement of the

dependent variable.

5. There may be non-sense correlation between two variables that is due to pure chance

and has no practical relevance, e.g., the correlation, between the size of shoe and the

intelligence of a group of individuals. There is no such thing like non-sense

regression.

5.8 SOLVED PROBLEMS


Example 5-1
The following table shows the number of motor registrations in a certain territory for

a term of 5 years and the sale of motor tyres by a firm in that territory for the same

period.

Year Motor Registrations No. of Tyres Sold


1 600 1,250
2 630 1,100
3 720 1,300
4 750 1,350
5 800 1,500
Find the regression equation to estimate the sale of tyres when the motor registration

is known. Estimate sale of tyres when registration is 850.

Solution: Here the dependent variable is number of tyres; dependent on motor registrations.

Hence we put motor registrations as X and sales of tyres as Y and we have to establish the

regression line of Y on X.

Calculations of values for the regression equation are given below:

154
X Y dx = X- X dy = Y-Y dx2 dx dy

600 1,250 -100 -50 10,000 5,000


630 1,100 -70 -200 4,900 14,000
720 1,300 20 0 400 0
750 1,350 50 50 2,500 2,500
800 1,500 100 200 10,000 20,000

∑X = 3,500 ∑ Y = 6,500 ∑ d x =0 ∑d y =0 ∑d
2
x
= 27,800 ∑d x d y = 41,500

X=
∑X = 3,500
=700 and Y=
∑Y = 6,500
= 1,300
N 5 N 5

byx = Regression coefficient of Y on X

byx =
∑ (X − X )(Y − Y ) = ∑ d d x y
=
4,1500
= 1.4928
∑ (X − X ) ∑d
2 2
x 2,7800

Now we can use these values for the regression line

Y-Y = byx (X- X )

or Y – 1300 = 1.4928 (X - 700)

Y = 1.4928 X + 255.04

When X = 850, the value of Y can be calculated from the above equation, by putting X = 850

in the equation.

Y = 1.4928 x 850 + 255. 04

= 1523.92

= 1,524 Tyres

Example 5-2
A panel of Judges A and B graded seven debators and independently awarded the

following marks:

Debator Marks by A Marks by B


1 40 32
2 34 39

155
3 28 26
4 30 30
5 44 38
6 38 34
7 31 28

An eighth debator was awarded 36 marks by judge A, while Judge B was not present. If

Judge B were also present, how many marks would you expect him to award to the eighth

debator, assuming that the same degree of relationship exists in their judgement?

Solution: Let us use marks from Judge A as X and those from Judge B as Y. Now we have to

work out the regression line of Y on X from the calculation below:

Debtor X Y U = X-35 V = Y-30 U2 V2 UV


1 40 32 5 2 25 4 10
2 34 39 -1 9 1 81 -9
3 28 26 -7 -4 49 16 28
4 30 30 -5 0 25 0 0
5 44 38 9 8 81 64 72
6 38 34 3 4 9 16 12
7 31 28 -4 -2 16 4 8

N=7 ∑ U = 0 ∑ V = 17 ∑U 2
= 206 ∑V 2
= 185 ∑ UV = 121

X = A+
∑ U = 35 + 0
= 35 and Y = A+
∑ V = 30 + 17 = 32.43
N 7 N 7

N ∑ UV − (∑ U ∑ V )
byx = bvu =
N ∑ U 2 − (∑ U )2

7 x121 - 0 x17
= = 0.587
7 x 206 - 0

Hence regression equation can be written as

Y- Y = byx (X- X )

Y – 32.43 = 0.587 (X-35)

156
or Y = 0.587X + 11.87

When X = 36 (awarded by Judge A)

Y = 0.587 x 36 + 11.87

= 33

Thus if Judge B were present, he would have awarded 33 marks to the eighth debator.

Example 5-3
For some bivariate data, the following results were obtained.

Mean value of variable X = 53.2

Mean value of variable Y = 27.9

Regression coefficient of Y on X = - 1.5

Regression coefficient of X on Y = - 0.2

What is the most likely value of Y, when X = 60?

What is the coefficient of correlation between X and Y?

Solution: Given data indicate

X = 53.2 Y = 27.9

byx = -1.5 bxy = -0.2

To obtain value of Y for X = 60, we establish the regression line of Y on X,

Y- Y = byx (X- X )

Y – 27.9 = -1.5 (X-53.2)

or Y = -1.5X + 107.7

Putting value of X = 60, we obtain

Y = -1.5 x 60 + 107.7

= 17.7

Coefficient of correlation between X and Y is given by G.M. of byx and bxy

r2 = byx bxy

157
= (-1.5) x (–0.2)

= 0.3

So r = ± 0.3 = ± 0.5477

Since both the regression coefficients are negative, we assign negative value to the

correlation coefficient

r = - 0.5477

Example 5-4
Write regression equations of X on Y and of Y on X for the following data

X: 45 48 50 55 65 70 75 72 80 85
Y: 25 30 35 30 40 50 45 55 60 65

Solution: We prepare the table for working out the values for the regression lines.

X Y U = X-65 V = Y-45 U2 UV V2
45 25 -20 -20 400 400 400
48 30 -17 -15 289 255 225
50 35 -15 -10 225 150 100
55 30 -10 -15 100 150 225
65 40 0 -5 0 0 25
70 50 5 5 25 25 25
75 45 10 0 100 0 0
72 55 7 5 49 35 25
80 60 15 15 225 225 225
85 65 20 20 400 400 400

∑X = 645∑ Y = 435 ∑ U = 5 ∑ V = −20 ∑ U 2


= 1813 ∑V 2
= 1415 ∑ UV = 1675

We have,

X=
∑X = 645
= 64.5 and Y=
∑Y = 435
= 43.5
N 10 N 10

N ∑ UV − (∑ U ∑ V )
byx =
N ∑ U 2 − (∑ U )
2

158
(10) x 1415 - (5) x (-20)
=
(10) x 1813 - (5) 2

14150 + 100 14250


= = = 0.787
18130 - 25 18105

Regression equation of Y on X is

Y-Y = byx (X- X )

Y – 43.5 = 0.787 (X-64.5)

or Y = 0.787X + 7.26

Similarly bxy can be calculated as

N ∑ UV − (∑ U ∑ V )
bxy =
N ∑ V 2 − (∑ V )
2

(10) x 1415 - (5) x (-20)


=
(10) x 1675 - (-20) 2

14150 + 100 14250


= = = 0.87
16750 - 400 16350

Regression equation of X on Y will be

X-X = bxy (Y-Y )

X – 64.5 = 0.87 (Y-43.5)

or X = 0.87Y + 26.65

Example 5-5
The lines of regression of a bivariate population are

8X – 10Y + 66 = 0

40X – 18Y = 214

The variance of X is 9. Find

(i) The mean value of X and Y

(ii) Correlation coefficient between X and Y

(iii) Standard deviation of Y

159
Solution: The regression lines given are

8X – 10Y + 66 = 0

40X – 18Y = 214

Since both the lines of regression pass through the mean values, the point ( X , Y ) will satisfy

both the equations.

Hence these equations can be written as

8 X - 10 Y + 66 = 0

40 X - 18Y - 214 = 0

Solving these two equations for X and Y , we obtain

X = 13 and Y = 17

(ii) For correlation coefficient between X and Y, we have to calculate the values of byx and

bxy

Rewriting the equations

10Y = 8X + 66

byx = + 8/10 = + 4/5

Similarly, 40X = 18Y + 214

bxy = 18/40 = 9/20

By these values, we can now work out the correlation coefficient.

r2 = byx . bxy

= 4/5 x 9/20 = 9/25

So r = + 9 / 25

= + 0.6

Both the values of the regression coefficients being positive, we have to consider only the

positive value of the correlation coefficient. Hence r = 0.6

(iii) We have been given variance of X i.e Sx2 = 9

160
Sx = ± 3

We consider Sx = 3 as SD is always positive

Since byx = r Sy /Sx

Substituting the values of byx, r and Sx we obtain,

Sy = 4/5 x 3/0.6

= 4

Example 5-6
The height of a child increases at a rate given in the table below. Fit the straight line

using the method of least-square and calculate the average increase and the standard

error of estimate.

Month: 1 2 3 4 5 6 7 8 9 10
Height: 52.5 58.7 65 70.2 75.4 81.1 87.2 95.5 102.2 108.4

Solution: For Regression calculations, we draw the following table

Month (X) Height (Y) X2 XY


1 52.5 1 52.5
2 58.7 4 117.4
3 65.0 9 195.0
4 70.2 16 280.8
5 75.4 25 377.0
6 81.1 36 486.6
7 87.2 49 610.4
8 95.5 64 764.0
9 102.2 81 919.8
10 108.4 100 1084.0

∑X =55 ∑ Y =796.2 ∑X 2
= 385 ∑ XY = 4887.5

Considering the regression line as Y = a + bX, we can obtain the values of a and b from the

above values.

161
a=
∑ Y ∑ X − ∑ X ∑ XY
2

N ∑ X − (∑ X )
2 2

796.2 x 385 - 55 x 4887.5


=
10 x 385 - 55 x 55

= 45.73

N ∑ XY − ∑ X ∑ Y
b=
N ∑ X 2 − (∑ X )
2

10 x 4887.5 - 55 x 796.2
=
10 x 385 - 55 x 55

= 6.16

Hence the regression line can be written as

Y = 45.73 + 6.16X

For standard error of estimation, we note the calculated values of the variable against the

observed values,

When X = 1, Y1 = 45.73 + 6.16 = 51.89

for X = 2, Y2 = 45.73 + 616 x 2 = 58.05

Other values for X = 3 to X = 10 are calculated and are tabulated as follows:

Month (X) Height (Y) Yi Y-Yi (Y-Yi) 2


1 52.5 51.89 0.61 0.372
2 58.7 58.05 0.65 0.423
3 65.0 64.21 0.79 0.624
4 70.2 70.37 -0.17 0.029
5 75.4 76.53 -1.13 1.277
6 81.1 82.69 -1.59 2.528
7 87.2 88.85 -1.65 2.723
8 95.5 95.01 0.49 0.240
9 102.2 101.17 1.03 1.061
10 108.4 107.33 1.07 1.145

162
∑ (Y − Y )i
2
= 10.421

Standard error of estimation

1
S yx =
N
∑ (Y − Y )
i
2

10.421
=
10
= 1.02
Example 5-7
Given X = 4Y+5 and Y = kX + 4 are the lines of regression of X on Y and of Y on X

respectively. If k is positive, prove that it cannot exceed ¼.

If k = 1/16, find the means of the two variables and coefficient of correlation between them.

Solution: Line X = 4Y + 5 is regression line of X on Y

So bxy = 4

Similarly from regression line of Y on X , Y = kX + 4,

We get byx = k

Now

r2 = bxy. byx

= 4k

Since 0 ≤ r 2 ≤ 1, we obtain 0 ≤ 4k ≤ 1,

1
Or 0≤k ≤ ,
4

1
Now for k = ,
16

1 1
r 2 = 4x =
16 4

r=+½

= ½ since byx and byx are positive

163
1
When k = , the regression line of Y on X becomes
16

1
Y= X+4
16

Or X – 16Y + 64 = 0

Since line of regression pass through the mean values of the variables, we obtain revised

equations as

X - 4Y - 5 = 0

X - 16 Y + 64 = 0

Solving these two equations, we get

X = 28 and Y = 5.75

Example 5-8
A firm knows from its past experience that its monthly average expenses (X) on

advertisement are Rs 25,000 with standard deviation of Rs 25.25. Similarly, its average

monthly product sales (Y) have been Rs 45,000 with standard deviation of Rs 50.50. Given

this information and also the coefficient of correlation between sales and advertisement

expenditure as 0.75, estimate

(i) the most appropriate value of sales against an advertisement expenditure of Rs

50,000

(ii) the most appropriate advertisement expenditure for achieving a sales target of

Rs 80,000

Solution: Given the following

X = Rs 25,000 Sx = Rs 25.25

Y = Rs 45,000 Sy = Rs 50.50

r = 0.75

164
Sy
(i) Using equation Yc -Y = r (X- X ), the most appropriate value of sales Yc for an
Sx

advertisement expenditure X = Rs 50,000 is

50.50
Yc – 45,000 = 0.75 (50,000 – 25,000)
25.25

Yc = 45,000 + 37,500

= Rs 82,500

Sx
(ii) Using equation Xc - X = r (Y - Y ), the most appropriate value of advertisement
Sy

expenditure Xc for achieving a sales target Y= Rs 80,000 is

25.25
Xc – 25,000 = 0.75 (80,000 – 45,000)
50.50

Xc = 13,125 + 25,000

= Rs 38,125

1.8 SELF-ASSESSMENT QUESTIONS


1. Explain clearly the concept of Regression. Explain with suitable examples its role in

dealing with business problems.

2. What do you understand by linear regression?

3. What is meant by ‘regression’? Why should there be in general, two lines of

regression for each bivariate distribution? How the two regression lines are useful in

studying correlation between two variables?

4. Why is the regression line known as line of best fit?

5. Write short note on

(i) Regression Coefficients

(ii) Regression Equations

(iii) Standard Error of Estimate

(iv) Coefficient of Determination

165
(v) Coefficient of Non-determination

6. Distinguish clearly between correlation and regression as concept used in statistical

analysis.

7. Fit a least-square line to the following data:

(i) Using X as independent variable

(ii) Using X as dependent variable

X : 1 3 4 8 9 11 14
Y : 1 2 4 5 7 8 9

Hence obtain

c) The regression coefficients of Y on X and of X on Y

d) X and Y

e) Coefficient of correlation between and X and Y

f) What is the estimated value of Y when X = 10 and of X when Y = 5?

8. What are regression coefficients? Show that r2 = byx. bxy where the symbols have their

usual meanings. What can you say about the angle between the regression lines when

(i) r = 0, (ii) r = 1 (iii) r increases from 0 to 1?

9. Obtain the equations of the lines of regression of Y on X from the following data.

X : 12 18 24 30 36 42 48
Y : 5.27 5.68 6.25 7.21 8.02 8.71 8.42

Estimate the most probable value of Y, when X = 40.

10. The following table gives the ages and blood pressure of 9 women.

Age (X) : 56 42 36 47 49 42 60 72 63

Blood Pressure(Y) 147 125 118 128 145 140 155 160 149

Find the correlation coefficient between X and Y.

(i) Determine the least square regression equation of Y on X.

166
(ii) Estimate the blood pressure of a woman whose age is 45 years.

11. Given the following results for the height (X) and weight (Y) in appropriate units of

1,000 students:

X = 68, Y = 150, S x = 2.5, S y = 20 and r = 0.6.

Obtain the equations of the two lines of regression. Estimate the height of a student A

who weighs 200 units and also estimate the weight of the student B whose height is

60 units.

12. From the following data, find out the probable yield when the rainfall is 29”.

Rainfall Yield
Mean 25” 40 units per hectare
Standard Deviation 3” 6 units per hectare

Correlation coefficient between rainfall and production = 0.8.

13. A study of wheat prices at two cities yielded the following data:

City A City B

Average Price Rs 2,463 Rs 2,797


Standard Deviation Rs 0.326 Rs 0.207

Coefficient of correlation r is 0.774. Estimate from the above data the most likely

price of wheat

(i) at City A corresponding to the price of Rs 2,334 at City B

(ii) at city B corresponding to the price of Rs 3.052 at City A

14. Find out the regression equation showing the regression of capacity utilisation on

production from the following data:

Average Standard Deviation


Production (in lakh units) 35.6 10.5
Capacity Utilisation (in percentage) 84.8 8.5

r = 0.62

167
Estimate the production, when capacity utilisation is 70%.

15. The following table shows the mean and standard deviation of the prices of two shares

in a stock exchange.

Share Mean (in Rs) Standard Deviation (in Rs)


A Ltd. 39.5 10.8
B Ltd. 47.5 16.0
If the coefficient of correlation between the prices of two shares is 0.42, find the most

likely price of share A corresponding to a price of Rs 55, observed in the case of share

B.

16. Find out the regression coefficients of Y on X and of X on Y on the basis of following

data:

∑X = 50, X = 5, ∑Y = 60, Y = 6, ∑ XY = 350

Variance of X = 4, Variance of Y = 9

17. Find the regression equation of X and Y and the coefficient of correlation from the

following data:

∑X = 60, ∑Y = 40, ∑ XY = 1150, ∑X 2


= 4160, ∑Y 2
= 1720 and N = 10.

18. By using the following data, find out the two lines of regression and from them
compute the Karl Pearson’s coefficient of correlation.
∑ X = 250, ∑Y = 300, ∑ XY = 7900, ∑ X 2 = 6500, ∑ Y 2 = 10000, N = 10
19. The equations of two regression lines between two variables are expressed as

2X – 3Y = 0 and 4Y – 5X-8 = 0.

(i) Identify which of the two can be called regression line of Y on X and of X on Y.

(ii) Find X and Y and correlation coefficient r from the equations

20. If the two lines of regression are

4X - 5Y + 30 = 0 and 20X – 9Y – 107 = 0

Which of these is the lines of regression of X and Y. Find rxy and Sy when Sx = 3

168
(iii) Researcher can better appreciate only through interpretation why his findings are what
they are and can make others to understand the real significance of his research findings.
(iv) The interpretation of the findings of exploratory research study often results into hypotheses
for experimental research and as such interpretation is involved in the transition from
exploratory to experimental research. Since an exploratory study does not have a hypothesis
to start with, the findings of such a study have to be interpreted on a post-factum basis in
which case the interpretation is technically described as ‘post factum’ interpretation.

SIGNIFICANCE OF REPORT WRITING

Research report is considered a major component of the research study for the research task remains
incomplete till the report has been presented and/or written. As a matter of fact even the most
brilliant hypothesis, highly well designed and conducted research study, and the most striking
generalizations and findings are of little value unless they are effectively communicated to others.
The purpose of research is not well served unless the findings are made known to others. Research
results must invariably enter the general store of knowledge. All this explains the significance of
writing research report. There are people who do not consider writing of report as an integral part of
the research process. But the general opinion is in favour of treating the presentation of research
results or the writing of report as part and parcel of the research project. Writing of report is the last
step in a research study and requires a set of skills somewhat different from those called for in
respect of the earlier stages of research. This task should be accomplished by the researcher with
utmost care; he may seek the assistance and guidance of experts for the purpose.

DIFFERENT STEPS IN WRITING REPORT


Research reports are the product of slow, painstaking, accurate inductive work. The usual steps
involved in writing report are: (a) logical analysis of the subject-matter; (b) preparation of the final
outline; (c) preparation of the rough draft; (d) rewriting and polishing; (c) preparation of the final
bibliography; and (f) writing the final draft. Though all these steps are self explanatory, yet a brief
mention of each one of these will be appropriate for better understanding.
Logical analysis of the subject matter: It is the first step which is primarily concerned with the
development of a subject. There are two ways in which to develop a subject (a) logically and
(b) chronologically. The logical development is made on the basis of mental connections and
associations between the one thing and another by means of analysis. Logical treatment often consists
in developing the material from the simple possible to the most complex structures. Chronological
development is based on a connection or sequence in time or occurrence. The directions for doing or
making something usually follow the chronological order.
Preparation of the final outline: It is the next step in writing the research report “Outlines are the
framework upon which long written works are constructed. They are an aid to the logical organisation
of the material and a reminder of the points to be stressed in the report.”3
Preparation of the rough draft: This follows the logical analysis of the subject and the preparation
of the final outline. Such a step is of utmost importance for the researcher now sits to write down
what he has done in the context of his research study. He will write down the procedure adopted by
him in collecting the material for his study along with various limitations faced by him, the technique
of analysis adopted by him, the broad findings and generalizations and the various suggestions he
wants to offer regarding the problem concerned.
Rewriting and polishing of the rough draft: This step happens to be most difficult part of all
formal writing. Usually this step requires more time than the writing of the rough draft. The careful
revision makes the difference between a mediocre and a good piece of writing. While rewriting and
polishing, one should check the report for weaknesses in logical development or presentation. The
researcher should also “see whether or not the material, as it is presented, has unity and cohesion;
does the report stand upright and firm and exhibit a definite pattern, like a marble arch? Or does it
resemble an old wall of moldering cement and loose brick.”4 In addition the researcher should give
due attention to the fact that in his rough draft he has been consistent or not. He should check the
mechanics of writing—grammar, spelling and usage.
Preparation of the final bibliography: Next in order comes the task of the preparation of the final
bibliography. The bibliography, which is generally appended to the research report, is a list of books
in some way pertinent to the research which has been done. It should contain all those works which
the researcher has consulted. The bibliography should be arranged alphabetically and may be divided
into two parts; the first part may contain the names of books and pamphlets, and the second part may
contain the names of magazine and newspaper articles. Generally, this pattern of bibliography is
considered convenient and satisfactory from the point of view of reader, though it is not the only way
of presenting bibliography. The entries in bibliography should be made adopting the following order:
For books and pamphlets the order may be as under:
1. Name of author, last name first.
2. Title, underlined to indicate italics.
3. Place, publisher, and date of publication.
4. Number of volumes.
Example
Kothari, C.R., Quantitative Techniques, New Delhi, Vikas Publishing House Pvt. Ltd., 1978.
For magazines and newspapers the order may be as under:
1. Name of the author, last name first.
2. Title of article, in quotation marks.
3. Name of periodical, underlined to indicate italics.
4. The volume or volume and number.
5. The date of the issue.
6. The pagination.
Example
Robert V. Roosa, “Coping with Short-term International Money Flows”, The Banker, London,
September, 1971, p. 995.
The above examples are just the samples for bibliography entries and may be used, but one
should also remember that they are not the only acceptable forms. The only thing important is that,
whatever method one selects, it must remain consistent.
Writing the final draft: This constitutes the last step. The final draft should be written in a concise
and objective style and in simple language, avoiding vague expressions such as “it seems”, “there
may be”, and the like ones. While writing the final draft, the researcher must avoid abstract terminology
and technical jargon. Illustrations and examples based on common experiences must be incorporated
in the final draft as they happen to be most effective in communicating the research findings to
others. A research report should not be dull, but must enthuse people and maintain interest and must
show originality. It must be remembered that every report should be an attempt to solve some
intellectual problem and must contribute to the solution of a problem and must add to the knowledge
of both the researcher and the reader.

LAYOUT OF THE RESEARCH REPORT


Anybody, who is reading the research report, must necessarily be conveyed enough about the study
so that he can place it in its general scientific context, judge the adequacy of its methods and thus
form an opinion of how seriously the findings are to be taken. For this purpose there is the need of
proper layout of the report. The layout of the report means as to what the research report should
contain. A comprehensive layout of the research report should comprise (A) preliminary pages; (B)
the main text; and (C) the end matter. Let us deal with them separately.

(A) Preliminary Pages


In its preliminary pages the report should carry a title and date, followed by acknowledgements in
the form of ‘Preface’ or ‘Foreword’. Then there should be a table of contents followed by list of
tables and illustrations so that the decision-maker or anybody interested in reading the report can
easily locate the required information in the report.

(B) Main Text


The main text provides the complete outline of the research report along with all details. Title of the
research study is repeated at the top of the first page of the main text and then follows the other
details on pages numbered consecutively, beginning with the second page. Each main section of the
report should begin on a new page. The main text of the report should have the following sections:
(i) Introduction; (ii) Statement of findings and recommendations; (iii) The results; (iv) The implications
drawn from the results; and (v) The summary.
(i) Introduction: The purpose of introduction is to introduce the research project to the readers. It
should contain a clear statement of the objectives of research i.e., enough background should be
given to make clear to the reader why the problem was considered worth investigating. A brief
summary of other relevant research may also be stated so that the present study can be seen in that
context. The hypotheses of study, if any, and the definitions of the major concepts employed in the
study should be explicitly stated in the introduction of the report.
The methodology adopted in conducting the study must be fully explained. The scientific reader
would like to know in detail about such thing: How was the study carried out? What was its basic
design? If the study was an experimental one, then what were the experimental manipulations? If the
data were collected by means of questionnaires or interviews, then exactly what questions were
asked (The questionnaire or interview schedule is usually given in an appendix)? If measurements
were based on observation, then what instructions were given to the observers? Regarding the
sample used in the study the reader should be told: Who were the subjects? How many were there?
How were they selected? All these questions are crucial for estimating the probable limits of
generalizability of the findings. The statistical analysis adopted must also be clearly stated. In addition
to all this, the scope of the study should be stated and the boundary lines be demarcated. The various
limitations, under which the research project was completed, must also be narrated.
(ii) Statement of findings and recommendations: After introduction, the research report must
contain a statement of findings and recommendations in non-technical language so that it can be
easily understood by all concerned. If the findings happen to be extensive, at this point they should be
put in the summarised form.

(iii) Results: A detailed presentation of the findings of the study, with supporting data in the form of
tables and charts together with a validation of results, is the next step in writing the main text of the
report. This generally comprises the main body of the report, extending over several chapters. The
result section of the report should contain statistical summaries and reductions of the data rather than
the raw data. All the results should be presented in logical sequence and splitted into readily identifiable
sections. All relevant results must find a place in the report. But how one is to decide about what is
relevant is the basic question. Quite often guidance comes primarily from the research problem and
from the hypotheses, if any, with which the study was concerned. But ultimately the researcher must
rely on his own judgement in deciding the outline of his report. “Nevertheless, it is still necessary that
he states clearly the problem with which he was concerned, the procedure by which he worked on
the problem, the conclusions at which he arrived, and the bases for his conclusions.”5
(iv) Implications of the results: Toward the end of the main text, the researcher should again put
down the results of his research clearly and precisely. He should, state the implications that flow
from the results of the study, for the general reader is interested in the implications for understanding
the human behaviour. Such implications may have three aspects as stated below:
(a) A statement of the inferences drawn from the present study which may be expected to
apply in similar circumstances.
(b) The conditions of the present study which may limit the extent of legitimate generalizations
of the inferences drawn from the study.
(c) Thc relevant questions that still remain unanswered or new questions raised by the study
along with suggestions for the kind of research that would provide answers for them.
It is considered a good practice to finish the report with a short conclusion which summarises and
recapitulates the main points of the study. The conclusion drawn from the study should be clearly
related to the hypotheses that were stated in the introductory section. At the same time, a forecast of
the probable future of the subject and an indication of the kind of research which needs to be done in
that particular field is useful and desirable.
(v) Summary: It has become customary to conclude the research report with a very brief summary,
resting in brief the research problem, the methodology, the major findings and the major conclusions
drawn from the research results.

(C) End Matter


At the end of the report, appendices should be enlisted in respect of all technical data such as
questionnaires, sample information, mathematical derivations and the like ones. Bibliography of sources
consulted should also be given. Index (an alphabetical listing of names, places and topics along with
the numbers of the pages in a book or report on which they are mentioned or discussed) should
invariably be given at the end of the report. The value of index lies in the fact that it works as a guide
to the reader for the contents in the report.
.
TYPES OF REPORTS

(A) Technical Report


In the technical report the main emphasis is on (i) the methods employed, (it) assumptions made in
the course of the study, (iii) the detailed presentation of the findings including their limitations and
supporting data.
A general outline of a technical report can be as follows:
1. Summary of results: A brief review of the main findings just in two or three pages.
2. Nature of the study: Description of the general objectives of study, formulation of the problem in
operational terms, the working hypothesis, the type of analysis and data required, etc.
3. Methods employed: Specific methods used in the study and their limitations. For instance, in
sampling studies we should give details of sample design viz., sample size, sample selection, etc.
4. Data: Discussion of data collected, their sources, characteristics and limitations. If secondary
data are used, their suitability to the problem at hand be fully assessed. In case of a survey, the
manner in which data were collected should be fully described.
5. Analysis of data and presentation of findings: The analysis of data and presentation of the
findings of the study with supporting data in the form of tables and charts be fully narrated. This, in
fact, happens to be the main body of the report usually extending over several chapters.
6. Conclusions: A detailed summary of the findings and the policy implications drawn from the
results be explained.
7. Bibliography: Bibliography of various sources consulted be prepared and attached.
8. Technical appendices: Appendices be given for all technical matters relating to questionnaire,
mathematical derivations, elaboration on particular technique of analysis and the like ones.
9. Index: Index must be prepared and be given invariably in the report at the end.
The order presented above only gives a general idea of the nature of a technical report; the order
of presentation may not necessarily be the same in all the technical reports. This, in other words,
means that the presentation may vary in different reports; even the different sections outlined above
will not always be the same, nor will all these sections appear in any particular report.
It should, however, be remembered that even in a technical report, simple presentation and ready
availability of the findings remain an important consideration and as such the liberal use of charts and
diagrams is considered desirable.

(B) Popular Report


The popular report is one which gives emphasis on simplicity and attractiveness. The simplification
should be sought through clear writing, minimization of technical, particularly mathematical, details
and liberal use of charts and diagrams. Attractive layout along with large print, many subheadings,
even an occasional cartoon now and then is another characteristic feature of the popular report.
Besides, in such a report emphasis is given on practical aspects and policy implications.
We give below a general outline of a popular report.
1. The findings and their implications: Emphasis in the report is given on the findings of most
practical interest and on the implications of these findings.
2. Recommendations for action: Recommendations for action on the basis of the findings of the
study is made in this section of the report.
3. Objective of the study: A general review of how the problem arise is presented along with the
specific objectives of the project under study.
4. Methods employed: A brief and non-technical description of the methods and techniques used,
including a short review of the data on which the study is based, is given in this part of the report.
5. Results: This section constitutes the main body of the report wherein the results of the study are
presented in clear and non-technical terms with liberal use of all sorts of illustrations such as charts,
diagrams and the like ones.
6. Technical appendices: More detailed information on methods used, forms, etc. is presented in
the form of appendices. But the appendices are often not detailed if the report is entirely meant for
general public.
There can be several variations of the form in which a popular report can be prepared. The only
important thing about such a report is that it gives emphasis on simplicity and policy implications from
the operational point of view, avoiding the technical details of all sorts to the extent possible.

MECHANICS OF WRITING A RESEARCH REPORT


1. Size and physical design: The manuscript should be written on unruled paper 8 1 2 × 11 in
size. If it is to be written by hand, then black or blue-black ink should be used. A margin of at least
one and one-half inches should be allowed at the left hand and of at least half an inch at the right hand
of the paper. There should also be one-inch margins, top and bottom. The paper should be neat and
legible. If the manuscript is to be typed, then all typing should be double-spaced on one side of the
page only except for the insertion of the long quotations.
2. Procedure: Various steps in writing the report should be strictly adhered (All such steps have
already been explained earlier in this chapter).
3. Layout: Keeping in view the objective and nature of the problem, the layout of the report should
be thought of and decided and accordingly adopted (The layout of the research report and various
types of reports have been described in this chapter earlier which should be taken as a guide for
report-writing in case of a particular problem).
4. Treatment of quotations: Quotations should be placed in quotation marks and double spaced,
forming an immediate part of the text. But if a quotation is of a considerable length (more than four
or five type written lines) then it should be single-spaced and indented at least half an inch to the right
of the normal text margin.
5. The footnotes: Regarding footnotes one should keep in view the followings:
(a) The footnotes serve two purposes viz., the identification of materials used in quotations in
the report and the notice of materials not immediately necessary to the body of the research
text but still of supplemental value. In other words, footnotes are meant for cross references,
citation of authorities and sources, acknowledgement and elucidation or explanation of a
point of view. It should always be kept in view that footnote is not an end nor a means of
the display of scholarship. The modern tendency is to make the minimum use of footnotes
for scholarship does not need to be displayed.
(b) Footnotes are placed at the bottom of the page on which the reference or quotation which
they identify or supplement ends. Footnotes are customarily separated from the textual
material by a space of half an inch and a line about one and a half inches long.
(c) Footnotes should be numbered consecutively, usually beginning with 1 in each chapter
separately. The number should be put slightly above the line, say at the end of a quotation.
At the foot of the page, again, the footnote number should be indented and typed a little
above the line. Thus, consecutive numbers must be used to correlate the reference in the
text with its corresponding note at the bottom of the page, except in case of statistical
tables and other numerical material, where symbols such as the asterisk (*) or the like one
may be used to prevent confusion.
(d) Footnotes are always typed in single space though they are divided from one another by
double space.
6. Documentation style: Regarding documentation, the first footnote reference to any given work
should be complete in its documentation, giving all the essential facts about the edition used. Such
documentary footnotes follow a general sequence. The common order may be described as under:
(i) Regarding the single-volume reference
1. Author’s name in normal order (and not beginning with the last name as in a bibliography)
followed by a comma;
2. Title of work, underlined to indicate italics;
3. Place and date of publication;
4. Pagination references (The page number).
Example
John Gassner, Masters of the Drama, New York: Dover Publications, Inc. 1954, p. 315.
(ii) Regarding multivolumed reference
1. Author’s name in the normal order;
2. Title of work, underlined to indicate italics;
3. Place and date of publication;
4. Number of volume;
5. Pagination references (The page number).
(iii) Regarding works arranged alphabetically
For works arranged alphabetically such as encyclopedias and dictionaries, no pagination
reference is usually needed. In such cases the order is illustrated as under:
Example 1
“Salamanca,” Encyclopaedia Britannica, 14th Edition.
Example 2
“Mary Wollstonecraft Godwin,” Dictionary of national biography.
But if there should be a detailed reference to a long encyclopedia article, volume and
pagination reference may be found necessary.
(iv) Regarding periodicals reference
1. Name of the author in normal order;
2. Title of article, in quotation marks;
3. Name of periodical, underlined to indicate italics;
4. Volume number;
5. Date of issuance;
6. Pagination.
(v) Regarding anthologies and collections reference
Quotations from anthologies or collections of literary works must be acknowledged not
only by author, but also by the name of the collector.
(vi) Regarding second-hand quotations reference
In such cases the documentation should be handled as follows:
1. Original author and title;
2. “quoted or cited in,”;
3. Second author and work.
Example
J.F. Jones, Life in Ploynesia, p. 16, quoted in History of the Pacific Ocean area, by R.B. Abel,
p. 191.
(vii) Case of multiple authorship
If there are more than two authors or editors, then in the documentation the name of only the first
is given and the multiple authorship is indicated by “et al.” or “and others”.
Subsequent references to the same work need not be so detailed as stated above. If the work is
cited again without any other work intervening, it may be indicated as ibid, followed by a comma and
the page number. A single page should be referred to as p., but more than one page be referred to as
pp. If there are several pages referred to at a stretch, the practice is to use often the page number,
for example, pp. 190ff, which means page number 190 and the following pages; but only for page 190
and the following page ‘190f’. Roman numerical is generally used to indicate the number of the
volume of a book. Op. cit. (opera citato, in the work cited) or Loc. cit. (loco citato, in the place cited)
are two of the very convenient abbreviations used in the footnotes. Op. cit. or Loc. cit. after the
writer’s name would suggest that the reference is to work by the writer which has been cited in
detail in an earlier footnote but intervened by some other references.
7. Punctuation and abbreviations in footnotes: The first item after the number in the footnote is
the author’s name, given in the normal signature order. This is followed by a comma. After the
comma, the title of the book is given: the article (such as “A”, “An”, “The” etc.) is omitted and only
the first word and proper nouns and adjectives are capitalized. The title is followed by a comma.
Information concerning the edition is given next. This entry is followed by a comma. The place of
publication is then stated; it may be mentioned in an abbreviated form, if the place happens to be a
famous one such as Lond. for London, N.Y. for New York, N.D. for New Delhi and so on. This
entry is followed by a comma. Then the name of the publisher is mentioned and this entry is closed
by a comma. It is followed by the date of publication if the date is given on the title page. If the date
appears in the copyright notice on the reverse side of the title page or elsewhere in the volume, the
comma should be omitted and the date enclosed in square brackets [c 1978], [1978]. The entry is
followed by a comma. Then follow the volume and page references and are separated by a comma
if both are given. A period closes the complete documentary reference. But one should remember
that the documentation regarding acknowledgements from magazine articles and periodical literature
follow a different form as stated earlier while explaining the entries in the bibliography.
Certain English and Latin abbreviations are quite often used in bibliographies and footnotes to
eliminate tedious repetition. The following is a partial list of the most common abbreviations frequently
used in report-writing (the researcher should learn to recognise them as well as he should learn to
use them):
anon., anonymous
ante., before
art., article
aug., augmented
bk., book
bull., bulletin
cf., compare
ch., chapter
col., column
diss., dissertation
ed., editor, edition, edited.
ed. cit., edition cited
e.g., exempli gratia: for example
eng., enlarged
et.al., and others

et seq., et sequens: and the following


ex., example
f., ff., and the following
fig(s)., figure(s)
fn., footnote
ibid., ibidem: in the same place (when two or more successive footnotes refer to the
same work, it is not necessary to repeat complete reference for the second
footnote. Ibid. may be used. If different pages are referred to, pagination
must be shown).
id., idem: the same
ill., illus., or
illust(s). illustrated, illustration(s)
Intro., intro., introduction
l, or ll, line(s)
loc. cit., in the place cited; used as op.cit., (when new reference
loco citato: is made to the same pagination as cited in the previous note)
MS., MSS., Manuscript or Manuscripts
N.B., nota bene: note well
n.d., no date
n.p., no place
no pub., no publisher
no(s)., number(s)
o.p., out of print
op. cit: in the work cited (If reference has been made to a work
opera citato and new reference is to be made, ibid., may be used, if intervening
reference has been made to different works, op.cit. must be used. The
name of the author must precede.
p. or pp., page(s)
passim: here and there
post: after
rev., revised
tr., trans., translator, translated, translation
vid or vide: see, refer to
viz., namely
vol. or vol(s)., volume(s)
vs., versus: against
8. Use of statistics, charts and graphs: A judicious use of statistics in research reports is often
considered a virtue for it contributes a great deal towards the clarification and simplification of the
material and research results. One may well remember that a good picture is often worth more than
a thousand words. Statistics are usually presented in the form of tables, charts, bars and line-graphs
and pictograms. Such presentation should be self explanatory and complete in itself. It should be
suitable and appropriate looking to the problem at hand. Finally, statistical presentation should be neat
and attractive.
9. The final draft: Revising and rewriting the rough draft of the report should be done with great
care before writing the final draft. For the purpose, the researcher should put to himself questions
like: Are the sentences written in the report clear? Are they grammatically correct? Do they say
what is meant’? Do the various points incorporated in the report fit together logically? “Having at
least one colleague read the report just before the final revision is extremely helpful. Sentences that
seem crystal-clear to the writer may prove quite confusing to other people; a connection that had
seemed self evident may strike others as a non-sequitur. A friendly critic, by pointing out passages
that seem unclear or illogical, and perhaps suggesting ways of remedying the difficulties, can be an
invaluable aid in achieving the goal of adequate communication.”6
10. Bibliography: Bibliography should be prepared and appended to the research report as discussed
earlier.
11. Preparation of the index: At the end of the report, an index should invariably be given, the
value of which lies in the fact that it acts as a good guide, to the reader. Index may be prepared both
as subject index and as author index. The former gives the names of the subject-topics or concepts
along with the number of pages on which they have appeared or discussed in the report, whereas the
latter gives the similar information regarding the names of authors. The index should always be
arranged alphabetically. Some people prefer to prepare only one index common for names of authors,
subject-topics, concepts and the like ones.

ORAL PRESENTATION
At times oral presentation of the results of the study is considered effective, particularly in cases
where policy recommendations are indicated by project results. The merit of this approach lies in the
fact that it provides an opportunity for give-and-take decisions which generally lead to a better
understanding of the findings and their implications. But the main demerit of this sort of presentation
is the lack of any permanent record concerning the research details and it may be just possible that
the findings may fade away from people’s memory even before an action is taken. In order to
overcome this difficulty, a written report may be circulated before the oral presentation and referred
to frequently during the discussion. Oral presentation is effective when supplemented by various
visual devices. Use of slides, wall charts and blackboards is quite helpful in contributing to clarity and
in reducing the boredom, if any. Distributing a board outline, with a few important tables and charts
concerning the research results, makes the listeners attentive who have a ready outline on which to
focus their thinking. This very often happens in academic institutions where the researcher discusses
his research findings and policy implications with others either in a seminar or in a group discussion.
Thus, research results can be reported in more than one ways, but the usual practice adopted, in
academic institutions particularly, is that of writing the Technical Report and then preparing several
research papers to be discussed at various forums in one form or the other. But in practical field and
with problems having policy implications, the technique followed is that of writing a popular report.
Researches done on governmental account or on behalf of some major public or private organisations
are usually presented in the form of technical reports.

PRECAUTIONS FOR WRITING RESEARCH REPORTS


Research report is a channel of communicating the research findings to the readers of the report. A
good research report is one which does this task efficiently and effectively. As such it must be
prepared keeping the following precautions in view:
1. While determining the length of the report (since research reports vary greatly in length),
one should keep in view the fact that it should be long enough to cover the subject but short
enough to maintain interest. In fact, report-writing should not be a means to learning more
and more about less and less.
2. A research report should not, if this can be avoided, be dull; it should be such as to sustain
reader’s interest.
3. Abstract terminology and technical jargon should be avoided in a research report. The
report should be able to convey the matter as simply as possible. This, in other words,
means that report should be written in an objective style in simple language, avoiding
expressions such as “it seems,” “there may be” and the like.
4. Readers are often interested in acquiring a quick knowledge of the main findings and as
such the report must provide a ready availability of the findings. For this purpose, charts,
graphs and the statistical tables may be used for the various results in the main report in
addition to the summary of important findings.
5. The layout of the report should be well thought out and must be appropriate and in accordance
with the objective of the research problem.
6. The reports should be free from grammatical mistakes and must be prepared strictly in
accordance with the techniques of composition of report-writing such as the use of quotations,
footnotes, documentation, proper punctuation and use of abbreviations in footnotes and the
like.
7. The report must present the logical analysis of the subject matter. It must reflect a structure
wherein the different pieces of analysis relating to the research problem fit well.
8. A research report should show originality and should necessarily be an attempt to solve
some intellectual problem. It must contribute to the solution of a problem and must add to
the store of knowledge.
9. Towards the end, the report must also state the policy implications relating to the problem
under consideration. It is usually considered desirable if the report makes a forecast of the
probable future of the subject concerned and indicates the kinds of research still needs to
be done in that particular field.
10. Appendices should be enlisted in respect of all the technical data in the report.
11. Bibliography of sources consulted is a must for a good report and must necessarily be
given.
12. Index is also considered an essential part of a good report and as such must be prepared
and appended at the end.
13. Report must be attractive in appearance, neat and clean, whether typed or printed.
14. Calculated confidence limits must be mentioned and the various constraints experienced in
conducting the research study may also be stated in the report.
15. Objective of the study, the nature of the problem, the methods employed and the analysis
techniques adopted must all be clearly stated in the beginning of the report in the form of
introduction.

You might also like