Module 3 Notes-RM
Module 3 Notes-RM
• Descriptive statistics are numbers that are used to summarize and describe data. The word "data" refers
to the information that has been collected from an experiment, a survey, a historical record, etc.
Hypothesis Testing
CLASSIFICATION OF DATA
• Chronological classification
• Spatial Classification
• Qualitative Classification
• Quantitative Classification
ORGANISATION OF DATA
• Variable
• Discrete
• can take only certain values.
• Continuous
• Can take any value
• May be whole number, fractional values, range of Values etc..
FREQUENCY DISTRIBUTION
Presentation of Data
2019-20
PRESENTATION OF DATA 41
2019-20
42 STATISTICS FOR ECONOMICS
2019-20
PRESENTATION OF DATA 43
2019-20
44 STATISTICS FOR ECONOMICS
stub column. A brief description of the India were non-workers in 2001 (See
row headings may also be given at the Table 4.5).
left hand top in the table. (See Table
4.5). (vi) Unit of Measurement
The unit of measurement of the
(v) Body of the Table figures in the table (actual data)
Body of a table is the main part and it should always be stated alongwith
contains the actual data. Location of the title. If different units are there
any one figure/data in the table is for rows or columns of the table,
fixed and determined by the row and these units must be stated
column of the table. For example, data alongwith ‘stubs’ or ‘captions’. If
in the second row and fourth column figures are large, they should
indicate that 25 crore females in rural be rounded up and the method
Rural
Female 1 0 1 12 13
Total 8 1 9 19 28
Male 24 4 28 25 53
All
Female 7 5 12 37 49
Total 31 9 40 62 102
(Note : Table 4.5 presents the same data in tabular form already presented through case 2 in
textual presentation of data)
2019-20
PRESENTATION OF DATA 45
of rounding should be indicated (See numbers into more concrete and easily
Table 4.5). comprehensible form.
Diagrams may be less accurate but
(vii) Source are much more effective than tables in
It is a brief statement or phrase presenting the data.
indicating the source of data presented There are various kinds of diagrams
in the table. If more than one source is in common use. Amongst them the
there, all the sources are to be written in important ones are the following:
the source. Source is generally written (i) Geometric diagram
at the bottom of the table. (See Table 4.5). (ii) Frequency diagram
(iii) Arithmetic line graph
(viii) Note
Geometric Diagram
Note is the last part of the table. It
Bar diagram and pie diagram come in
explains the specific feature of the data
the category of geometric diagram. The
content of the table which is not self
explanatory and has not been explained bar diagrams are of three types — simple,
earlier. multiple and component bar diagrams.
Bar Diagram
Activities
Simple Bar Diagram
• How many rows and columns
are essentially required to form Bar diagram comprises a group of
a table? equispaced and equiwidth rectangular
• Can the column/row headings bars for each class or category of data.
of a table be quantitative? Height or length of the bar reads the
• Can you present tables 4.2 and magnitude of data. The lower end of the
4.3 after rounding off figures
bar touches the base line such that the
appropriately.
• Present the first two sentences height of a bar starts from the zero unit.
of case 2 on p.41 as a table. Bars of a bar diagram can be visually
Some details for this would be compared by their relative height and
found elsewhere in this chapter. accordingly data are comprehended
quickly. Data for this can be of
5. D IAGRAMMATIC P RESENTATION OF
frequency or non-frequency type. In
D ATA non-frequency type data a particular
This is the third method of presenting characteristic, say production, yield,
data. This method provides the population, etc. at various points of
quickest understanding of the actual time or of different states are noted and
situation to be explained by data in corresponding bars are made of the
comparison to tabular or textual respective heights according to the
presentations. Diagrammatic presenta- values of the characteristic to construct
tion of data translates quite effectively the diagram. The values of the
the highly abstract ideas contained in characteristics (measured or counted)
2019-20
46 STATISTICS FOR ECONOMICS
2019-20
PRESENTATION OF DATA 47
Fig. 4.1: Bar diagram showing male literacy rates of major states of India, 2011. (Literacy rates
relate to population aged 7 years and above)
2019-20
48 STATISTICS FOR ECONOMICS
Fig. 4.2: Multiple bar (column) diagram showing female literacy rates over two census years 2001
and 2011 by major states of India. (Data Source Table 4.6)
Interpretation: It can be very easily derived from Figure 4.2 that female literacy rate over the years
was on increase throughout the country. Similar other interpretations can be made from the figure.
For example, the figure shows that the states of Bihar, Jharkhand and Uttar Pradesh experienced
the sharpest rise in female literacy, etc.
2019-20
PRESENTATION OF DATA 49
TABLE 4.8
Distribution of Indian population (2011)
by their working status (crores)
Status Population Per cent Angular
Component
Marginal Worker 12 9.9 36°
Main Worker 36 29.8 107°
Non-worker 73 60.3 217°
All 102 100.0 360°
2019-20
50 STATISTICS FOR ECONOMICS
when death rates are very high Since histograms are rectangles, a line
compared to deaths at most other parallel to the base line and of the same
higher age segments of the population. magnitude is to be drawn at a vertical
For graphical representation of such distance equal to frequency (or
data, height for area of a rectangle is frequency density) of the class interval.
the quotient of height (here frequency) A histogram is never drawn. Since, for
countinuous variables, the lower class
and base (here width of the class
boundary of a class interval fuses with
interval). When intervals are equal, that
the upper class boundary of the
is, when all rectangles have the same previous interval, equal or unequal, the
base, area can conveniently be rectangles are all adjacent and there is
represented by the frequency of any no open space between two consecutive
interval for purposes of comparison. rectangles. If the classes are not
2019-20
PRESENTATION OF DATA 51
continuous they are first converted into bars (except in multiple bar or
continuous classes as discussed in component bar diagram). Although the
Chapter 3. Sometimes the common bars have the same width, the width of
portion between two adjacent a bar is unimportant for the purpose
rectangles (Fig.4.6) is omitted giving a of comparison. The width in a
better impression of continuity. The histogram is as important as its height.
resulting figure gives the impression of We can have a bar diagram both for
a double staircase. discrete and continuous variables, but
A histogram looks similar to a bar histogram is drawn only for a
diagram. But there are more differences continuous variable. Histogram also
than similarities between the two than gives value of mode of the frequency
it may appear at the first impression. distribution graphically as shown in
The spacing and the width or the area Figure 4.5 and the x-coordinate of the
of bars are all arbitrary. It is the height dotted vertical line gives the mode.
and not the width or the area of the bar
that really matters. A single vertical line Frequency Polygon
could have served the same purpose A frequency polygon is a plane
as a bar of same width. Moreover, in bounded by straight lines, usually four
histogram no space is left between two or more lines. Frequency polygon is an
rectangles, but in a bar diagram some alternative to histogram and is also
space must be left between consecutive derived from histogram itself. A
Fig. 4.5: Histogram for the distribution of 85 daily wage earners in a locality of a town.
2019-20
52 STATISTICS FOR ECONOMICS
Fig. 4.6: Frequency polygon drawn for the data given in Table 4.9
2019-20
PRESENTATION OF DATA 53
2019-20
54 STATISTICS FOR ECONOMICS
TABLE 4.10
Frequency distribution of marks
obtained in mathematics
Total 64
2019-20
PRESENTATION OF DATA 55
Here you can see from Fig. 4.9 that TABLE 4.11
for the period 1993-94 to 2013-14, the Value of Exports and Imports of India
(Rs in 100 crores)
imports were more than the exports all
Year Exports Imports
through the period. You may notice the
value of both exports and imports rising 1993-94 698 731
1994–95 827 900
rapidy after 2001-02. Also the gap 1995–96 1064 1227
between the two (imports and exports) 1996–97 1188 1389
has widened after 2001-02. 1997–98 1301 1542
1998-99 1398 1783
1999-2000 1591 2155
6. CONCLUSION 2000-01 2036 2309
2001-02 2090 2452
By now you must have been able to learn 2002-03 2549 2964
how the data could be presented using 2003-04 2934 3591
various forms of presentation — textual, 2004-05 3753 5011
2005-06 4564 6604
tabular and diagrammatic. You are now
2006-07 5718 8815
also able to make an appropriate choice 2007-08 6559 10123
of the form of data presentation as well 2008-09 8408 13744
2009-10 8455 13637
as the type of diagram to be used for a
2010-11` 11370 16835
given set of data. Thus you can make 2011-12 14660 23455
presentation of data meaningful, 2012-13 16343 26692
comprehensive and purposeful. 2013-14 19050 27154
Source: DGCI&S, Kolkata
Fig. 4.9: Arithmetic line graph for time series data given in Table 4.11
2019-20
56 STATISTICS FOR ECONOMICS
Recap
• Data (even voluminous data) speak meaningfully through
presentation.
• For small data (quantity) textual presentation serves the purpose
better.
• For large quantity of data tabular presentation helps in
accommodating any volume of data for one or more variables.
• Tabulated data can be presented through diagrams which enable
quicker comprehension of the facts presented otherwise.
EXERCISES
2019-20
PRESENTATION OF DATA 57
2019-20
CHAPTER
2019-20
MEASURES OF CENTRAL TENDENCY 59
2019-20
60 STATISTICS FOR ECONOMICS
(HEIGHT IN INCHES)
2019-20
MEASURES OF CENTRAL TENDENCY 61
2019-20
62 STATISTICS FOR ECONOMICS
Calculation of arithmetic mean for Therefore, the mean plot size in the
Grouped data housing colony is 126.92 Sq. metre.
Discrete Series
Assumed Mean Method
Direct Method
As in case of individual series the
In case of discrete series, calculations can be simplified by using
frequency against each observation is assumed mean method, as described
multiplied by the value of the earlier, with a simple modification.
observation. The values, so obtained, Since frequency (f) of each item is
are summed up and divided by the total given here, we multiply each deviation
number of frequencies. Symbolically, (d) by the frequency to get fd. Then we
get Σ fd. The next step is to get the total
ΣfX
X = of all frequencies i.e. Σ f. Then find out
Σf
Σ fd/ Σ f. Finally, the arithmetic mean
Where, Σ fX = sum of the product Σfd
of variables and frequencies. is calculated by X = A + using
Σf
Σ f = sum of frequencies.
assumed mean method.
Example 3
Step Deviation Method
Plots in a housing colony come in only
In this case, the deviations are divided
three sizes: 100 sq. metre, 200 sq.
by the common factor ‘c’ which
meters and 300 sq. metre and the
simplifies the calculation. Here we
number of plots are respectively 200
50 and 10. d X−A
estimate d' = = in order to
c c
TABLE 5.2
reduce the size of numerical figures for
Computation of Arithmetic Mean by
Direct Method easier calculation. Then get fd' and Σ fd'.
Plot size in No. of d' = X–200 The formula for arithmetic mean using
Sq. metre X Plots (f) fX 100 fd' step deviation method is given as,
100 200 20000 –1 –200 Σfd ′
200 50 10000 0 0 X =A + ×c
300 10 3000 +1 10 Σf
260 33000 0 –190 Activity
Arithmetic mean using direct method, • Find the mean plot size for the
data given in example 3, by
∑X 33000 using step deviation and
X= = = 126.92 Sq. metre
N 260 assumed mean methods.
2019-20
MEASURES OF CENTRAL TENDENCY 63
Direct Method
Two interesting properties of A.M.
Marks
0–10 10–20 20–30 30–40 40–50 (i) the sum of deviations of items
50–60 60–70 about arithmetic mean is always equal
No. of Students
5 12 15 25 8 to zero. Symbolically, Σ ( X – X ) = 0.
3 2 (ii) arithmetic mean is affected by
extreme values. Any large value, on
TABLE 5.3
Computation of Average Marks for
either end, can push it up or down.
Exclusive Class Interval by Direct Method
Weighted Arithmetic Mean
Mark No. of Mid fm d'=(m-35) fd'
(x) students value (2)×(3) 10 Sometimes it is important to assign
(f) (m)
weights to various items according to
(1) (2) (3) (4) (5) (6)
their importance when you calculate
0–10 5 5 25 –3 –15
10–20 12 15 180 –2 –24 the arithmetic mean. For example,
20–30 15 25 375 –1 –15 there are two commodities, mangoes
30–40 25 35 875 0 0 and potatoes. You are interested in
2019-20
64 STATISTICS FOR ECONOMICS
finding the average price of mangoes that mean remains the same.
(P1) and potatoes (P2). The arithmetic • Replace the value 12 by 96.
What happens to the arithmetic
mean will be . However, you
mean? Comment.
might want to give more importance to
the rise in price of potatoes (P2). To do 3. MEDIAN
this, you may use as ‘weights’ the share
Median is that positional value of the
of mangoes in the budget of the variable which divides the distribution
consumer (W 1) and the share of into two equal parts, one part
potatoes in the budget (W2). Now the comprises all values greater than or
arithmetic mean weighted by the equal to the median value and the other
shares in the budget would comprises all values less than or equal
to it. The Median is the “middle”
W1 P1 + W2 P2
element when the data set is
be .
W1 + W2 arranged in order of the magnitude.
In general the weighted arithmetic Since the median is determined by the
position of different values, it remains
mean is given by,
unaffected if, say, the size of the
largest value increases.
Computation of median
When the prices rise, you may be
interested in the rise in prices of The median can be easily computed by
commodities that are more important sorting the data from smallest to largest
and finding out the middle value.
to you. You will read more about it in
the discussion of Index Numbers in Example 5
Chapter 8.
Suppose we have the following
observation in a data set: 5, 7, 6, 1, 8,
Activities
10, 12, 4, and 3.
• Check property of arithmetic Arranging the data, in ascending order
mean for the following example: you have:
X: 4 6 8 10 12 1, 3, 4, 5, 6, 7, 8, 10, 12.
• In the above example if mean
is increased by 2, then what
happens to the individual The “middle score” is 6, so the
observations. median is 6. Half of the scores are larger
• If first three items increase by than 6 and half of the scores are smaller.
2, then what should be the If there are even numbers in the
values of the last two items, so data, there will be two observations
2019-20
MEASURES OF CENTRAL TENDENCY 65
2019-20
66 STATISTICS FOR ECONOMICS
2019-20
MEASURES OF CENTRAL TENDENCY 67
2019-20
68 STATISTICS FOR ECONOMICS
formula where N is the number of has been derived from the French word
observations. “la Mode” which signifies the most
(N + 1)th fashionable values of a distribution,
Q1= size of item because it is repeated the highest
4
number of times in the series. Mode is
3(N +1)th the most frequently observed data
Q3 = size of item.
4 value. It is denoted by Mo.
2019-20
MEASURES OF CENTRAL TENDENCY 69
2019-20
70 STATISTICS FOR ECONOMICS
2019-20
MEASURES OF CENTRAL TENDENCY 71
Recap
• The measure of central tendency summarises the data with a single
value, which can represent the entire data.
• Arithmetic mean is defined as the sum of the values of all observations
divided by the number of observations.
• The sum of deviations of items from the arithmetic mean is always
equal to zero.
• Sometimes, it is important to assign weights to various items
according to their importance.
• Median is the central value of the distribution in the sense that the
number of values less than the median is equal to the number greater
than the median.
• Quartiles divide the total set of values into four equal parts.
• Mode is the value which occurs most frequently.
EXERCISES
2019-20
72 STATISTICS FOR ECONOMICS
2019-20
MEASURES OF CENTRAL TENDENCY 73
Daily Income (in Rs) 10–14 15–19 20–24 25–29 30–34 35–39
Number of workers 5 10 15 20 10 5
(Hint: compute median, lower quartile and upper quartile.)
[Ans. (a) Rs 25.11 (b) Rs 19.92 (c) Rs 29.19]
9. The following table gives production yield in kg. per hectare of wheat of
150 farms in a village. Calculate the mean, median and mode values.
Production yield (kg. per hectare)
50–53 53–56 56–59 59–62 62–65 65–68 68–71 71–74 74–77
Number of farms
3 8 14 30 36 28 16 10 5
(Ans. mean = 63.82 kg. per hectare, median = 63.67 kg. per hectare,
mode = 63.29 kg. per hectare)
2019-20
CHAPTER
7 Correlation
2019-20
92 STATISTICS FOR ECONOMICS
2019-20
CORRELATION 93
2019-20
94 STATISTICS FOR ECONOMICS
2019-20
CORRELATION 95
Fig. 7.5: Perfect Negative Correlation Fig. 7.6: Positive non-linear relation
2019-20
96 STATISTICS FOR ECONOMICS
2019-20
CORRELATION 97
• If r = 1 or r = –1 the correlation is
perfect and there is exact linear
relation.
• A high value of r indicates strong
linear relationship. Its value is said
to be high when it is close to
+1 or –1.
• A low value of r (close to zero)
indicates a weak linear relation. But
there may be a non-linear relation.
As you have read in Chapter 1, the
statistical methods are no substitute for
common sense. Here, is another
example, which highlights the need for
• The value of the correlation understanding the data properly before
coefficient lies between minus one correlation is calculated and
and plus one, –1 ≤ r ≤ 1. If, in any interpreted. An epidemic spreads in
exercise, the value of r is outside some villages and the government
this range it indicates error in sends a team of doctors to the affected
calculation. villages. The correlation between the
• The magnitude of r is unaffected by number of deaths and the number of
the change of origin and change of doctors sent to the villages is found to
scale. Given two variables X and Y be positive. Normally, the healthcare
let us define two new variables. facilities provided by the doctors are
expected to reduce the number of
X–A Y–C deaths showing a negative correlation.
U= ;V =
B D This happened due to other reasons.
where A and C are assumed means The data relate to a specific time period.
of X and Y respectively. B and D are Many of the reported deaths could be
common factors and of same sign. terminal cases where the doctors could
Then do little. Moreover, the benefit of the
rxy = ruv presence of doctors becomes visible
only after some time. It is also possible
This. property is used to calculate
that the reported deaths are not due to
correlation coefficient in a highly
the epidemic. A tsunami suddenly hits
simplified manner, as in the step
the state and death toll rises.
deviation method.
Let us illustrate the calculation of r
• If r = 0 the two variables are
uncorrelated. There is no linear by examining the relationship between
relation between them. However years of schooling of farmers and the
other types of relation may be there. annual yield per acre.
2019-20
98 STATISTICS FOR ECONOMICS
TABLE 7.1
Calculation of r between years of schooling of farmers and annual yield
Years of (X– X ) (X– X )2 Annual yield (Y– Y ) (Y– Y )2 (X– X )(Y– Y )
Education per acre in ’000 Rs
(X) (Y)
0 –6 36 4 –3 9 18
2 –4 16 4 –3 9 12
4 –2 4 6 –1 1 2
6 0 0 10 3 9 0
8 2 4 10 3 9 6
10 4 16 8 1 1 4
12 6 36 7 0 0 0
2019-20
CORRELATION 99
2019-20
100 STATISTICS FOR ECONOMICS
U V
Spearman’s rank correlation
ÊX - 100 ˆ ÊY - 1700 ˆ
ÁË ˜¯ ÁË ˜¯ U2 V2 UV Spearman’s rank correlation was
10 100
developed by the British psychologist
2 1 4 1 2
C.E. Spearman. It is used in the
5 3 25 9 15 following situations:
9 8 81 64 72 1. Suppose we are trying to estimate
12 10 144 100 120 the correlation between the heights
13 13 169 169 169 and weights of students in a remote
village where neither measuring
SU = 41; SU = 35; SU2 = 423;
rods nor weighing machines are
2019-20
CORRELATION 101
2019-20
102 STATISTICS FOR ECONOMICS
Total 14 A 85 60
B 60 48
Substituting these values in C 55 49
formula (4) D 65 50
E 75 55
6ΣD2
rs = 1 − ...(4)
n3 − n
2019-20
CORRELATION 103
2019-20
104 STATISTICS FOR ECONOMICS
Recap
• Correlation analysis studies the relation between two variables.
• Scatter diagrams give a visual presentation of the nature of
relationship between two variables.
• Karl Pearson’s coefficient of correlation r measures numerically only
linear relationship between two variables. r lies between –1 and 1.
• When the variables cannot be measured precisely Spearman’s rank
correlation can be used to measure the linear relationship
numerically.
• Repeated ranks need correction factors.
• Correlation does not mean causation. It only means
covariation.
2019-20
CORRELATION 105
EXERCISES
2019-20
106 STATISTICS FOR ECONOMICS
Activity
• Use all the formulae discussed here to calculate r between
India’s national income and exports taking at least ten
observations.
2019-20
...if we find any association between two or more variables, we might be interested in
estimating the value of one variable for known value(s) of another variable(s)
5.1 INTRODUCTION
In business, several times it becomes necessary to have some forecast so that the management
can take a decision regarding a product or a particular course of action. In order to make a
forecast, one has to ascertain some relationship between two or more variables relevant to a
particular situation. For example, a company is interested to know how far the demand for
television sets will increase in the next five years, keeping in mind the growth of population
in a certain town. Here, it clearly assumes that the increase in population will lead to an
increased demand for television sets. Thus, to determine the nature and extent of relationship
In the preceding lesson, we studied in some depth linear correlation between two variables.
Here we have a similar concern, the association between variables, except that we develop it
further in two respects. First, we learn how to build statistical models of relationships
between the variables to have a better understanding of their features. Second, we extend the
For this purpose, we have to use the technique - regression analysis - which forms the
In 1889, Sir Francis Galton, a cousin of Charles Darwin published a paper on heredity,
“Natural Inheritance”. He reported his discovery that sizes of seeds of sweet pea plants
appeared to “revert” or “regress”, to the mean size in successive generations. He also reported
results of a study of the relationship between heights of fathers and heights of their sons. A
straight line was fit to the data pairs: height of father versus height of son. Here, too, he found
a “regression to mediocrity” The heights of the sons represented a movement away from their
131
fathers, towards the average height. We credit Sir Galton with the idea of statistical
regression.
While most applications of regression analysis may have little to do with the
now refers to the statistical technique of modeling the relationship between two or
prediction of the unknown value of one variable from the known value(s) of the other
variable(s). It is one of the most important and widely used statistical techniques in
In this lesson we will focus only on simple regression –linear regression involving only two
variables: a dependent variable and an independent variable. Regression analysis for studying
Simple regression involves only two variables; one variable is predicted by another variable.
The variable to be predicted is called the dependent variable. The predictor is called the
independent variable, or explanatory variable. For example, when we are trying to predict
the demand for television sets on the basis of population growth, we are using the demand for
television sets as the dependent variable and the population growth as the independent or
predictor variable.
The decision, as to which variable is which sometimes, causes problems. Often the choice is
obvious, as in case of demand for television sets and population growth because it would
make no sense to suggest that population growth could be dependent on TV demand! The
population growth has to be the independent variable and the TV demand the dependent
variable.
132
If we are unsure, here are some points that might be of use:
¾ if we have control over one of the variables then that is the independent. For example,
a manufacturer can decide how much to spend on advertising and expect his sales to
¾ it there is any lapse of time between the two variables being measured, then the latter
must depend upon the former, it cannot be the other way round
¾ if we want to predict the values of one variable from your knowledge of the other
The task of bringing out linear relationship consists of developing methods of fitting a
straight line, or a regression line as is often called, to the data on two variables.
The line of Regression is the graphical or relationship representation of the best estimate of
one variable for any given value of the other variable. The nomenclature of the line depends
on the independent and dependent variables. If X and Y are two variables of which
relationship is to be indicated, a line that gives best estimate of Y for any value of X, it is
called Regression line of Y on X. If the dependent variable changes to X, then best estimate
For purposes of illustration as to how a straight line relationship is obtained, consider the
sample paired data on sales of each of the N = 5 months of a year and the marketing
Table 5-1
Sales Marketing Expenditure
Month (Rs lac) (Rs thousands)
133
Y X
April 14 10
May 17 12
June 23 15
July 21 20
August 25 23
Let Y, the sales, be the dependent variable and X, the marketing expenditure, the independent
variable. We note that for each value of independent variable X, there is a specific value of
the dependent variable Y, so that each value of X and Y can be seen as paired observations.
relationship between the two variables is linear, that is, the one which is best explained by a
straight line. A good way of doing this is to plot the data on X and Y on a graph so as to yield
a scatter diagram, as may be seen in Figure 5-1. A careful reading of the scatter diagram
reveals that:
¾ the overall tendency of the points is to move upward, so the relationship is positive
¾ the general course of movement of the various points on the diagram can be best
¾ there is a high degree of correlation between the variables, as the points are very close
to each other
134
Figure 5-1 Scatter Diagram with Line of Best Fit
If the movement of various points on the scatter diagram is best described by a straight line,
the next step is to fit a straight line on the scatter diagram. It has to be so fitted that on the
whole it lies as close as possible to every point on the scatter diagram. The necessary
requirement for meeting this condition being that the sum of the squares of the vertical
As shown in Figure 5-1, if dl, d2,..., dN are the vertical deviations' of observed Y values from
N
d12 + d 22 + ..................... + d N2 = ∑ d 2j
j =1
is the minimum. The deviations dj have to be squared to avoid negative deviations canceling
out the positive deviations. Since a straight line so fitted best approximates all the points on
the scatter diagram, it is better known as the best approximating line or the line of best fit. A
Free hand drawing is the simplest method of fitting a straight line. After a careful
inspection of the movement and spread of various points on the scatter diagram, a
straight line is drawn through these points by using a transparent ruler such that on the
135
whole it is closest to every point. A straight line so drawn is particularly useful when
Whereas the use of free hand drawing may yield a line nearest to the line of best fit, the major
drawback is that the slope of the line so drawn varies from person to person because of the
influence of subjectivity. Consequently, the values of the dependent variable estimated on the
basis of such a line may not be as accurate and precise as those based on the line of best fit.
The least square method of fitting a line of best fit requires minimizing the sum of the
squares of vertical deviations of each observed Y value from the fitted line. These deviations,
such as d1 and d3, are shown in Figure 5-1 and are given by Y - Yc, where Y is the observed
value and Yc the corresponding computed value given by the fitted line
Yc = a + bX i …………(5.1)
The straight line relationship in Eq.(5.1), is stated in terms of two constants a and b
¾ The constant a is the Y-intercept; it indicates the height on the vertical axis from
where the straight line originates, representing the value of Y when X is zero.
¾ Constant b is a measure of the slope of the straight line; it shows the absolute change
in Y for a unit change in X. As the slope may be positive or negative, it indicates the
regression coefficient of Y on X.
Since a straight line is completely defined by its intercept a and slope b, the task of fitting the
same reduces only to the computation of the values of these two constants. Once these two
values are known, the computed Yc values against each value of X can be easily obtained by
136
In the method of least squares the values of a and b are obtained by solving simultaneously
∑ Y = aN + b∑ X …………(5.2)
∑ XY = a∑ X + b∑ X 2
…………(5.2)
observations and then can be substituted in the above equations to obtain the value of a and b.
Since simultaneous solving the two normal equations for a and b may quite often be
cumbersome and time consuming, the two values can be directly obtained as
a = Y − bX …………(5.3)
and
N ∑ XY − ∑ X ∑ Y
b= …………(5.4)
N ∑ X 2 − (∑ X )
2
Note: Eq. (5.3) is obtained simply by dividing both sides of the first of Eqs. (5.2) by N and
Eq.(5.4) is obtained by substituting ( Y − b X ) in place of a in the second of Eqs. (5.2)
a=
∑ Y ∑ X − ∑ X ∑ XY
2
…………(5.5)
N ∑ X − (∑ X )
2 2
and
Y −a
b= …………(5.6)
X
N ∑ XY − ∑ X ∑ Y
Note: Eq. (5.5) is obtained by substituting for b in Eq. (5.3) and Eq.
N ∑ X 2 − (∑ X )
2
Table 5-2
Computation of a and b
Y X XY X2 Y2
137
14 10 140 100 196
17 12 204 144 289
23 15 345 225 529
21 20 420 400 441
25 23 575 529 625
∑ Y = 100 ∑ X = 80 ∑ XY = 1684 ∑ X 2
= 1398 ∑Y 2
= 2080
139800 − 134720
=
6990 − 6400
5080
=
590
= 8.6101695
and
5 x1684 − 80 x100
b=
5 x1398 − (80)
2
8420 − 8000
=
6990 − 6400
420
=
590
= 0.7118644
138
Figure 5-2 Regression Line of Y on X
Then, to fit the line of best fit on the scatter diagram, only two computed Yc values are
needed. These can be easily obtained by substituting any two values of X in Eq. (5.1a). When
these are plotted on the diagram against their corresponding values of X, we get two points,
by joining which (by means of a straight line) gives us the required line of best fit, as shown
in Figure 5-2
We can have some important relationships for data analysis, involving other measures such as
Yc = ( Y − b X ) +bX
or Yc - Y = b(X- X ) …………(5.7)
∑ XY − ⎛⎜ ∑ X ⎞⎟⎛⎜ ∑Y ⎞⎟
N ⎜ N ⎟⎜ N ⎟⎠
b= ⎝ ⎠⎝
∑ X − ⎛⎜ ∑ X
2
2
⎞
⎟
N ⎜ N ⎟
⎝ ⎠
∑ XY − X Y
or b= N
S x2
Cov( X , Y )
or b= …………(5.8)
S x2
Cov( X , Y )
rxy =
Sx Sy
139
or Cov( X , Y ) = rxy S x S y
SxSy
b = rxy
S x2
Sy
b = rxy …………(5.9)
Sx
Sy
Substituting rxy for b in Eq.(5.7), we get
Sx
Sy
Yc - Y = rxy (X- X ) …………(5.10)
Sx
The main objective of regression analysis is to know the nature of relationship between two
variables and to use it for predicting the most likely value of the dependent variable
corresponding to a given, known value of the independent variable. This can be done by
substituting in Eq.(5.1a) any known value of X corresponding to which the most likely
estimate of Y is to be found.
Yc = 8.61 + 0.71(15)
= 8.61 + 10.65
= 19.26
It may be appreciated that an estimate of Y derived from a regression equation will not be
exactly the same as the Y value which may actually be observed. The difference between
estimated Yc values and the corresponding observed Y values will depend on the extent of
140
The closer the various paired sample points (Y, X) clustered around the line of best fit, the
smaller the difference between the estimated Yc and observed Y values, and vice-versa. On the
whole, the lesser the scatter of the various points around, and the lesser the vertical distance
by which these deviate from the line of best fit, the more likely it is that an estimated Yc value
The estimated Yc values will coincide the observed Y values only when all the points on the
scatter diagram fall in a straight line. If this were to be so, the sales for a given marketing
expenditure could have been estimated with l00 percent accuracy. But such a situation is too
rare to obtain. Since some of the points must lie above and some below the straight line,
perfect prediction is practically non-existent in the case of most business and economic
situations.
This means that the estimated values of one variable based on the known values of the other
variable are always bound to differ. The smaller the difference, the greater the precision of
the estimate, and vice-versa. Accordingly, the preciseness of an estimate can be obtained only
through a measure of the magnitude of error in the estimates, called the error of estimate.
∑ (Y − Y )
2
c
Syx = …………(5.11)
N
Syx measures the average absolute amount by which observed Y values depart from the
Computation of Syx becomes little cumbersome where the number of observations N is large.
141
∑Y 2
− a (∑ Y ) − b∑ XY
Syx = …………(5.12)
N
values of a and b
We have
23.36
=
5
= 4.67
= 2.16
Interpretations of Syx
A careful observation of how the standard error of estimate is computed reveals the
following:
1. Syx is a concept statistically parallel to the standard deviation Sy . The only difference
between the two being that the standard deviation measures the dispersion around the
mean; the standard error of estimate measures the dispersion around the regression
line. Similar to the property of arithmetic mean, the sum of the deviations of different
2. Syx tells us the amount by which the estimated Yc values will, on an average, deviate
from the observed Y values. Hence it is an estimate of the average amount of error in
the estimated Yc values. The actual error (the residual of Y and Yc) may, however, be
smaller or larger than the average error. Theoretically, these errors follow a normal
distribution. Thus, assuming that n ≥ 30, Yc ± 1.Syx means that 68.27% of the estimates
142
based on the regression equation will be within 1.Syx Similarly, Yc ± 2.Syx means that
thousand being Rs 19.26 lac, one may like to know how good this estimate is. Since
Syx is estimated to be Rs 2.16 lac, it means there are about 68 chances (68.27) out of
100 that this estimate is in error by not more than Rs 2.16 lac above or below Rs
19.26 lac. That is, there are 68% chances that actual sales would fall between (19.26 -
3. Since Syx measures the closeness of the observed Y values and the estimated Yc values,
it also serves as a measure of the reliability of the estimate. Greater the closeness
between the observed and estimated values of Y, the lesser the error and,
4. Standard error of estimate Syx can also be seen as a measure of correlation insofar as it
expresses the degree of closeness of scatter of observed Y values about the regression
line. The closer the observed Y values scattered around the regression line, the higher
same units of measurement as the data on the dependent variable. This creates
correlation. It is mainly due to this limitation that the standard error of estimate is not
143
So far we have considered the regression of Y on X, in the sense that Y was in the role of
dependent and X in the role of an independent variable. In their reverse position, such that X
is now the dependent and Y the independent variable, we fit a line of regression of X on Y.
Xc = a’ + b’Y …………(5.13)
Where Xc denotes the computed values of X against the corresponding values of Y. a’ is the
∑ XY = a' ∑ Y + b' ∑ Y 2
…………(5.14)
a’ = X - b’Y …………(5.15)
and
N ∑ XY − ∑ X ∑ Y
b' = …………(5.16)
N ∑ Y 2 − (∑ Y )
2
or
a' =
∑ X ∑ Y − ∑ Y ∑ XY
2
…………(5.17)
N ∑ Y − (∑ Y )
2 2
and
X − a'
b' = …………(5.18)
Y
Cov(Y , X )
b' = …………(5.19)
S y2
Sx
b' = ryx …………(5.20)
Sy
144
Xc - X = b’ (Y- Y ) …………(5.21)
Sx
Xc - X = ryx (Y - Y ) …………(5.22)
Sy
As before, once the values of a’ and b’ have been found, their substitution in Eq.(5.13) will
( X − X c )2
Sxy = …………(5.23)
N
or
Sxy =
∑X 2
− a ' ∑ X − b' ∑ XY
…………(5.24)
N
For example, if we want to estimate the marketing expenditure to achieve a sale target of Rs
Xc = a’ + b’Y
So using Eqs. (5.17) and (5.16), and substituting the values of ∑ X , ∑ Y , ∑ Y and ∑ XY
2
166400 − 168400
=
10400 − 10000
− 2000
=
400
= -5.00
and
5 x1684 − 80 x100
b' =
5 x 2080 − (100)
2
8420 − 8000
=
10400 − 10000
145
420
=
400
= 1.05
Now given that a’= -5.00 and b’=1.05, Regression equation (5.13) takes the form
Xc = -5.00 +1.05Y
Xc = -5.00+1.05x40
= -5 + 42
= 37
marketing.
the effect on dependent variable if there is a unit change in the independent variable. Since
for a paired data on X and Y variables, there are two regression lines: regression line of Y on X
The following are the important properties of regression coefficients that are helpful in data
analysis
1. The value of both the regression coefficients cannot be greater than 1. However, value
of both the coefficients can be below 1 or at least one of them must be below 1, so
that the square root of the product of two regression coefficients must lie in the limit
±1.
146
r = ± b. b' …………(5.25)
The signs of both the regression coefficients are the same, and so the value of r will
3. The mean of both the regression coefficients is either equal to or greater than the
b + b'
≥r
2
X−A Y −B
U= and V =
h k
k
b yx = bvu
h
and
h
bxy = buv
k
r2 = b.b’
Sy Sx
Y- Y = r (X - X ) and X- X = r (Y - Y )
Sx Sy
147
We can write the slope of these lines, as
Sy Sx
b= r and b’ = r
Sx Sy
b − b'
tan θ =
1 + bb'
Sx S y ⎛ r 2 −1⎞
= 2 ⎜ ⎟
S x + S y2 ⎜⎝ r ⎟⎠
⎡ SxSy ⎛ r 2 − 1 ⎞⎤
or θ = tan –1 ⎢ 2 ⎜⎜ ⎟⎟⎥ …………(5.26)
⎢⎣ S x + S y
2
⎝ r ⎠⎥⎦
148
Figure 5-3 Regression Lines and Coefficient of Correlation
Eq. (5.26) reveals the following:
correlation (r = -1), θ = 0, so the two regression lines will coincide, i.e. we have only
The farther the two regression lines from each other, lesser will be the degree of
correlation and nearer the two regression lines, more will be the degree of correlation,
¾ If the variables are independent i.e. r = 0, the lines of regression will cut each other at
Note : Both the regression lines cut each other at mean value of X and mean value of Y i.e. at
X and Y .
accounted for by the independent variable. In other words, the coefficient of determination
gives the ratio of the explained variance to the total variance. The coefficient of
determination is given by the square of the correlation coefficient, i.e. r2. Thus,
Coefficient of determination
Explained Variance
r2 =
Total Variance
∑ (Y − Y )
2
2 c
r = …………(5.27)
∑ (Y − Y )
2
149
We can calculate another coefficient K2, known as coefficient of Non-Determination, which
∑ (Y − Y )
2
2 c
K = …………(5.28)
∑ (Y − Y )
2
Explained Variance
K2 = 1-
Total Variance
= 1 - r2 …………(5.29)
The square root of the coefficient of non-determination, i.e. K gives the coefficient of
alienation
K = ± 1− r2 …………(5.30)
A simple algebraic operation on Eq. (5.30) brings out some interesting points about the
∑ (Y − Y ) ∑ (Y − Y )
2 2
c = N S yx2 and = N S y2
∑ (Y − Y )
2
c
K 2
=
∑ (Y − Y )
2
N S yx2
K2 =
N S y2
S yx2
=
S y2
2 S yx2
So 1–r =
S y2
S yx
or = 1− r 2 …………(5.31)
Sy
150
If coefficient of correlation, r, is defined as the under root of the coefficient of determination
r= r2
2 S yx2
r = 1−
S y2
S yx
r = 1− …………(5.32)
S y2
On carefully observing Eq. (5.32), it will be noticed that the ratio Syx/Sy will be large if the
Eq. (5.32) also implies that Syx is generally less than Sy. The two can at the most be equal, but
Interpretations of r2:
1. Even though the coefficient of determination, whose under root measures the degree
pure number, the unit in which Syx is measured becomes irrelevant. This facilitates
comparison between the two sets of data in terms of their coefficient of determination
r2 (or the coefficient of correlation r). This was not possible in terms of Sy x as the
2. The value of r2 can range between 0 and 1. When r2 = 1, all the points on the scatter
diagram fall on the regression line and the entire variations are explained by the
straight line. On the other hand, when r2 = 0, none of the points on the scatter diagram
falls on the regression line, meaning thereby that there is no relationship between the
151
not tell us about the direction of the relationship (whether it is positive or negative)
3. When r2 = 0.7455 (or any other value), 74.55% of the total variations in sales are
explained by the marketing expenditure used. What remains is the coefficient of non-
unexplained, which are due to factors other than the changes in the marketing
expenditure.
4. r2 provides the necessary link between regression and correlation which are the two
between the variables under study, without making a distinction between the
dependent and independent ones. Nor does it, therefore, help in predicting the value of
5. The coefficient of correlation overstates the degree of relationship and it’s meaning is
correlation between sales and marketing expenditure. Therefore, the coefficient of'
6. The sum of r and K never adds to one, unless one of the two is zero. That is, r + K can
the relationship between the variables. If we have information on more than one variable, we
might be interested in seeing if there is any connection - any association - between them. If
152
we found such a association, we might again be interested in predicting the value of one
1. Correlation literally means the relationship between two or more variables that vary in
movements in the other(s). On the other hand, regression means stepping back or
returning to the average value and is a mathematical measure expressing the average
2. Correlation coefficient rxy between two variables X and Y is a measure of the direction
and degree of the linear relationship between two variables that is mutual. It is
symmetric, i.e., ryx = rxy and it is immaterial which of X and Y is dependent variable
Regression analysis aims at establishing the functional relationship between the two(
or more) variables under study and then using this relationship to predict or estimate
the value of the dependent variable for any given value of the independent variable(s).
It also reflects upon the nature of the variable, i.e., which is dependent variable and
3. Correlation need not imply cause and effect relationship between the variable under
study. However, regression analysis clearly indicates the cause and effect relationship
between ±1.
153
On the other hand, the regression coefficients, byx and bxy are absolute measures
representing the change in the value of the variable Y (or X), for a unit change in the
value of the variable X (or Y). Once the functional form of regression curve is known;
by substituting the value of the independent variable we can obtain the value of the
dependent variable and this value will be in the units of measurement of the
dependent variable.
5. There may be non-sense correlation between two variables that is due to pure chance
and has no practical relevance, e.g., the correlation, between the size of shoe and the
regression.
a term of 5 years and the sale of motor tyres by a firm in that territory for the same
period.
Solution: Here the dependent variable is number of tyres; dependent on motor registrations.
Hence we put motor registrations as X and sales of tyres as Y and we have to establish the
regression line of Y on X.
154
X Y dx = X- X dy = Y-Y dx2 dx dy
∑X = 3,500 ∑ Y = 6,500 ∑ d x =0 ∑d y =0 ∑d
2
x
= 27,800 ∑d x d y = 41,500
X=
∑X = 3,500
=700 and Y=
∑Y = 6,500
= 1,300
N 5 N 5
byx =
∑ (X − X )(Y − Y ) = ∑ d d x y
=
4,1500
= 1.4928
∑ (X − X ) ∑d
2 2
x 2,7800
Y = 1.4928 X + 255.04
When X = 850, the value of Y can be calculated from the above equation, by putting X = 850
in the equation.
= 1523.92
= 1,524 Tyres
Example 5-2
A panel of Judges A and B graded seven debators and independently awarded the
following marks:
155
3 28 26
4 30 30
5 44 38
6 38 34
7 31 28
An eighth debator was awarded 36 marks by judge A, while Judge B was not present. If
Judge B were also present, how many marks would you expect him to award to the eighth
debator, assuming that the same degree of relationship exists in their judgement?
Solution: Let us use marks from Judge A as X and those from Judge B as Y. Now we have to
N=7 ∑ U = 0 ∑ V = 17 ∑U 2
= 206 ∑V 2
= 185 ∑ UV = 121
X = A+
∑ U = 35 + 0
= 35 and Y = A+
∑ V = 30 + 17 = 32.43
N 7 N 7
N ∑ UV − (∑ U ∑ V )
byx = bvu =
N ∑ U 2 − (∑ U )2
7 x121 - 0 x17
= = 0.587
7 x 206 - 0
Y- Y = byx (X- X )
156
or Y = 0.587X + 11.87
Y = 0.587 x 36 + 11.87
= 33
Thus if Judge B were present, he would have awarded 33 marks to the eighth debator.
Example 5-3
For some bivariate data, the following results were obtained.
X = 53.2 Y = 27.9
Y- Y = byx (X- X )
or Y = -1.5X + 107.7
Y = -1.5 x 60 + 107.7
= 17.7
r2 = byx bxy
157
= (-1.5) x (–0.2)
= 0.3
So r = ± 0.3 = ± 0.5477
Since both the regression coefficients are negative, we assign negative value to the
correlation coefficient
r = - 0.5477
Example 5-4
Write regression equations of X on Y and of Y on X for the following data
X: 45 48 50 55 65 70 75 72 80 85
Y: 25 30 35 30 40 50 45 55 60 65
Solution: We prepare the table for working out the values for the regression lines.
X Y U = X-65 V = Y-45 U2 UV V2
45 25 -20 -20 400 400 400
48 30 -17 -15 289 255 225
50 35 -15 -10 225 150 100
55 30 -10 -15 100 150 225
65 40 0 -5 0 0 25
70 50 5 5 25 25 25
75 45 10 0 100 0 0
72 55 7 5 49 35 25
80 60 15 15 225 225 225
85 65 20 20 400 400 400
We have,
X=
∑X = 645
= 64.5 and Y=
∑Y = 435
= 43.5
N 10 N 10
N ∑ UV − (∑ U ∑ V )
byx =
N ∑ U 2 − (∑ U )
2
158
(10) x 1415 - (5) x (-20)
=
(10) x 1813 - (5) 2
Regression equation of Y on X is
or Y = 0.787X + 7.26
N ∑ UV − (∑ U ∑ V )
bxy =
N ∑ V 2 − (∑ V )
2
or X = 0.87Y + 26.65
Example 5-5
The lines of regression of a bivariate population are
8X – 10Y + 66 = 0
159
Solution: The regression lines given are
8X – 10Y + 66 = 0
Since both the lines of regression pass through the mean values, the point ( X , Y ) will satisfy
8 X - 10 Y + 66 = 0
40 X - 18Y - 214 = 0
X = 13 and Y = 17
(ii) For correlation coefficient between X and Y, we have to calculate the values of byx and
bxy
10Y = 8X + 66
r2 = byx . bxy
So r = + 9 / 25
= + 0.6
Both the values of the regression coefficients being positive, we have to consider only the
160
Sx = ± 3
Sy = 4/5 x 3/0.6
= 4
Example 5-6
The height of a child increases at a rate given in the table below. Fit the straight line
using the method of least-square and calculate the average increase and the standard
error of estimate.
Month: 1 2 3 4 5 6 7 8 9 10
Height: 52.5 58.7 65 70.2 75.4 81.1 87.2 95.5 102.2 108.4
∑X =55 ∑ Y =796.2 ∑X 2
= 385 ∑ XY = 4887.5
Considering the regression line as Y = a + bX, we can obtain the values of a and b from the
above values.
161
a=
∑ Y ∑ X − ∑ X ∑ XY
2
N ∑ X − (∑ X )
2 2
= 45.73
N ∑ XY − ∑ X ∑ Y
b=
N ∑ X 2 − (∑ X )
2
10 x 4887.5 - 55 x 796.2
=
10 x 385 - 55 x 55
= 6.16
Y = 45.73 + 6.16X
For standard error of estimation, we note the calculated values of the variable against the
observed values,
162
∑ (Y − Y )i
2
= 10.421
1
S yx =
N
∑ (Y − Y )
i
2
10.421
=
10
= 1.02
Example 5-7
Given X = 4Y+5 and Y = kX + 4 are the lines of regression of X on Y and of Y on X
If k = 1/16, find the means of the two variables and coefficient of correlation between them.
So bxy = 4
We get byx = k
Now
r2 = bxy. byx
= 4k
Since 0 ≤ r 2 ≤ 1, we obtain 0 ≤ 4k ≤ 1,
1
Or 0≤k ≤ ,
4
1
Now for k = ,
16
1 1
r 2 = 4x =
16 4
r=+½
163
1
When k = , the regression line of Y on X becomes
16
1
Y= X+4
16
Or X – 16Y + 64 = 0
Since line of regression pass through the mean values of the variables, we obtain revised
equations as
X - 4Y - 5 = 0
X - 16 Y + 64 = 0
X = 28 and Y = 5.75
Example 5-8
A firm knows from its past experience that its monthly average expenses (X) on
advertisement are Rs 25,000 with standard deviation of Rs 25.25. Similarly, its average
monthly product sales (Y) have been Rs 45,000 with standard deviation of Rs 50.50. Given
this information and also the coefficient of correlation between sales and advertisement
50,000
(ii) the most appropriate advertisement expenditure for achieving a sales target of
Rs 80,000
X = Rs 25,000 Sx = Rs 25.25
Y = Rs 45,000 Sy = Rs 50.50
r = 0.75
164
Sy
(i) Using equation Yc -Y = r (X- X ), the most appropriate value of sales Yc for an
Sx
50.50
Yc – 45,000 = 0.75 (50,000 – 25,000)
25.25
Yc = 45,000 + 37,500
= Rs 82,500
Sx
(ii) Using equation Xc - X = r (Y - Y ), the most appropriate value of advertisement
Sy
25.25
Xc – 25,000 = 0.75 (80,000 – 45,000)
50.50
Xc = 13,125 + 25,000
= Rs 38,125
regression for each bivariate distribution? How the two regression lines are useful in
165
(v) Coefficient of Non-determination
analysis.
X : 1 3 4 8 9 11 14
Y : 1 2 4 5 7 8 9
Hence obtain
d) X and Y
8. What are regression coefficients? Show that r2 = byx. bxy where the symbols have their
usual meanings. What can you say about the angle between the regression lines when
9. Obtain the equations of the lines of regression of Y on X from the following data.
X : 12 18 24 30 36 42 48
Y : 5.27 5.68 6.25 7.21 8.02 8.71 8.42
10. The following table gives the ages and blood pressure of 9 women.
Age (X) : 56 42 36 47 49 42 60 72 63
Blood Pressure(Y) 147 125 118 128 145 140 155 160 149
166
(ii) Estimate the blood pressure of a woman whose age is 45 years.
11. Given the following results for the height (X) and weight (Y) in appropriate units of
1,000 students:
Obtain the equations of the two lines of regression. Estimate the height of a student A
who weighs 200 units and also estimate the weight of the student B whose height is
60 units.
12. From the following data, find out the probable yield when the rainfall is 29”.
Rainfall Yield
Mean 25” 40 units per hectare
Standard Deviation 3” 6 units per hectare
13. A study of wheat prices at two cities yielded the following data:
City A City B
Coefficient of correlation r is 0.774. Estimate from the above data the most likely
price of wheat
14. Find out the regression equation showing the regression of capacity utilisation on
r = 0.62
167
Estimate the production, when capacity utilisation is 70%.
15. The following table shows the mean and standard deviation of the prices of two shares
in a stock exchange.
likely price of share A corresponding to a price of Rs 55, observed in the case of share
B.
16. Find out the regression coefficients of Y on X and of X on Y on the basis of following
data:
Variance of X = 4, Variance of Y = 9
17. Find the regression equation of X and Y and the coefficient of correlation from the
following data:
18. By using the following data, find out the two lines of regression and from them
compute the Karl Pearson’s coefficient of correlation.
∑ X = 250, ∑Y = 300, ∑ XY = 7900, ∑ X 2 = 6500, ∑ Y 2 = 10000, N = 10
19. The equations of two regression lines between two variables are expressed as
2X – 3Y = 0 and 4Y – 5X-8 = 0.
(i) Identify which of the two can be called regression line of Y on X and of X on Y.
Which of these is the lines of regression of X and Y. Find rxy and Sy when Sx = 3
168
(iii) Researcher can better appreciate only through interpretation why his findings are what
they are and can make others to understand the real significance of his research findings.
(iv) The interpretation of the findings of exploratory research study often results into hypotheses
for experimental research and as such interpretation is involved in the transition from
exploratory to experimental research. Since an exploratory study does not have a hypothesis
to start with, the findings of such a study have to be interpreted on a post-factum basis in
which case the interpretation is technically described as ‘post factum’ interpretation.
Research report is considered a major component of the research study for the research task remains
incomplete till the report has been presented and/or written. As a matter of fact even the most
brilliant hypothesis, highly well designed and conducted research study, and the most striking
generalizations and findings are of little value unless they are effectively communicated to others.
The purpose of research is not well served unless the findings are made known to others. Research
results must invariably enter the general store of knowledge. All this explains the significance of
writing research report. There are people who do not consider writing of report as an integral part of
the research process. But the general opinion is in favour of treating the presentation of research
results or the writing of report as part and parcel of the research project. Writing of report is the last
step in a research study and requires a set of skills somewhat different from those called for in
respect of the earlier stages of research. This task should be accomplished by the researcher with
utmost care; he may seek the assistance and guidance of experts for the purpose.
(iii) Results: A detailed presentation of the findings of the study, with supporting data in the form of
tables and charts together with a validation of results, is the next step in writing the main text of the
report. This generally comprises the main body of the report, extending over several chapters. The
result section of the report should contain statistical summaries and reductions of the data rather than
the raw data. All the results should be presented in logical sequence and splitted into readily identifiable
sections. All relevant results must find a place in the report. But how one is to decide about what is
relevant is the basic question. Quite often guidance comes primarily from the research problem and
from the hypotheses, if any, with which the study was concerned. But ultimately the researcher must
rely on his own judgement in deciding the outline of his report. “Nevertheless, it is still necessary that
he states clearly the problem with which he was concerned, the procedure by which he worked on
the problem, the conclusions at which he arrived, and the bases for his conclusions.”5
(iv) Implications of the results: Toward the end of the main text, the researcher should again put
down the results of his research clearly and precisely. He should, state the implications that flow
from the results of the study, for the general reader is interested in the implications for understanding
the human behaviour. Such implications may have three aspects as stated below:
(a) A statement of the inferences drawn from the present study which may be expected to
apply in similar circumstances.
(b) The conditions of the present study which may limit the extent of legitimate generalizations
of the inferences drawn from the study.
(c) Thc relevant questions that still remain unanswered or new questions raised by the study
along with suggestions for the kind of research that would provide answers for them.
It is considered a good practice to finish the report with a short conclusion which summarises and
recapitulates the main points of the study. The conclusion drawn from the study should be clearly
related to the hypotheses that were stated in the introductory section. At the same time, a forecast of
the probable future of the subject and an indication of the kind of research which needs to be done in
that particular field is useful and desirable.
(v) Summary: It has become customary to conclude the research report with a very brief summary,
resting in brief the research problem, the methodology, the major findings and the major conclusions
drawn from the research results.
ORAL PRESENTATION
At times oral presentation of the results of the study is considered effective, particularly in cases
where policy recommendations are indicated by project results. The merit of this approach lies in the
fact that it provides an opportunity for give-and-take decisions which generally lead to a better
understanding of the findings and their implications. But the main demerit of this sort of presentation
is the lack of any permanent record concerning the research details and it may be just possible that
the findings may fade away from people’s memory even before an action is taken. In order to
overcome this difficulty, a written report may be circulated before the oral presentation and referred
to frequently during the discussion. Oral presentation is effective when supplemented by various
visual devices. Use of slides, wall charts and blackboards is quite helpful in contributing to clarity and
in reducing the boredom, if any. Distributing a board outline, with a few important tables and charts
concerning the research results, makes the listeners attentive who have a ready outline on which to
focus their thinking. This very often happens in academic institutions where the researcher discusses
his research findings and policy implications with others either in a seminar or in a group discussion.
Thus, research results can be reported in more than one ways, but the usual practice adopted, in
academic institutions particularly, is that of writing the Technical Report and then preparing several
research papers to be discussed at various forums in one form or the other. But in practical field and
with problems having policy implications, the technique followed is that of writing a popular report.
Researches done on governmental account or on behalf of some major public or private organisations
are usually presented in the form of technical reports.