DT Notes Unit 1 & 2 Part 1
DT Notes Unit 1 & 2 Part 1
Unit- 1
1.1 INTRODUCTION
and scope. At the macro level, these are data on gross national product and shares of
1
At the micro level, individual firms, howsoever small or large, produce extensive
statistics on their operations. The annual reports of companies contain variety of data
These data are often field data, collected by employing scientific survey techniques.
Unless regularly updated, such data are the product of a one-time effort and have limited
use beyond the situation that may have called for their collection. A student knows
physics, and others. It is a discipline, which scientifically deals with data, and is often
described as the science of data. In dealing with statistics as data, statistics has
In the beginning, it may be noted that the word ‘statistics’ is used rather curiously in
two senses plural and singular. In the plural sense, it refers to a set of figures or data. In
the singular sense, statistics refers to the whole body of tools that are used to
collect data, organise and interpret them and, finally, to draw conclusions from them.
It should be noted that both the aspects of statistics are important if the quantitative data
are to serve their purpose. If statistics, as a subject, is inadequate and consists of poor
methodology, we could not know the right procedure to extract from the data the
information they contain. Similarly, if our data are defective or that they are inadequate
or inaccurate, we could not reach the right conclusions even though our subject is well
developed.
A.L. Bowley has defined statistics as: (i) statistics is the science of counting, (ii)
Statistics may rightly be called the science of averages, and (iii) statistics is the
2
festations. Boddington defined as: Statistics is the science of estimates and probabilities.
Further, W.I. King has defined Statistics in a wider context, the science of Statistics is
the method of judging collective, natural or social phenomena from the results obtained
Seligman explored that statistics is a science that deals with the methods of collecting,
some light on any sphere of enquiry. Spiegal defines statistics highlighting its role in
scientific method for collecting, organising, summa rising, presenting and analyzing
data as well as drawing valid conclusions and making reasonable decisions on the basis
of such analysis. According to Prof. Horace Secrist, Statistics is the aggregate of facts,
systematic manner for a pre-determined purpose, and placed in relation to each other.
From the above definitions, we can highlight the major characteristics of statistics as
follows:
(i) Statistics are the aggregates of facts. It means a single figure is not statistics.
For example, national income of a country for a single year is not statistics but
(ii) Statistics are affected by a number of factors. For example, sale of a product
3
(iii) Statistics must be reasonably accurate. Wrong figures, if analysed, will lead to
accurate figures.
haphazard manner, they will not be reliable and will lead to misleading
conclusions.
(vi) Lastly, Statistics should be placed in relation to each other. If one collects data
unrelated to each other, then such data will be confusing and will not lead to any
logical conclusions. Data should be comparable over time and over space.
Statistical data are the basic raw material of statistics. Data may relate to an activity of
result of the process of measuring, counting and/or observing. Statistical data, therefore,
refer to those aspects of a problem situation that can be measured, quantified, counted,
or classified. Any object subject phenomenon, or activity that generates data through
this process is termed as a variable. In other words, a variableis one that shows a degree
classified into two broad categories: quantitative data and qualitative data. This
Quantitative data are those that can be quantified in definite units of measurement.
4
Obviously, a variable may be a continuous variable or a discrete variable.
continuous variable is the one that can assume any value between any two points
precise and close to each other, yet distinguishably different. All characteristics
etc., represent continuous variables. Thus, the data recorded on these and
similar other characteristics are called continuous data. It may be noted that a
(ii) Discrete data are the values assumed by a discrete variable. A discrete variable
is the one whose outcomes are measured in fixed numbers. Such data are
essentially count data. These are derived from a process of counting, such as the
flights at an airport, and the defective items in a consignment received for sale,
characteristic is qualitative in nature when its observations are defined and noted in
terms of the presence or absence of a certain attribute in discrete numbers. These data
(i) Nominal data are the outcome of classification into two or more categories of
5
females), of workers according to skill (as skilled, semi-skilled, and unskilled),
undergraduates, and post-graduates), all result into nominal data. Given any
particular class and make a summation of items belonging to each class. The
(ii) Rank data, on the other hand, are the result of assigning ranks to specify order
in terms of the integers 1,2,3, ..., n. Ranks may be assigned according to the
Data sources could be seen as of two types, viz., secondary and primary. The two can
be defined as under:
(i) Secondary data: They already exist in some form: published or unpublished -
(ii) Primary data: Those data which do not already exist in any form, and thus have
to be collected for the first time from the primary source(s). By their verynature,
these data require fresh and first-time collection covering the whole population
There are two major divisions of statistics such as descriptive statistics and inferential
statistics. The term descriptive statistics deals with collecting, summarizing, and
6
simplifying data, which are otherwise quite unwieldy and voluminous. It seeks to
achieve this in a manner that meaningful conclusions can be readily drawn from the
data. Descriptive statistics may thus be seen as comprising methods of bringing out and
and also makes them amenable to further discussion, analysis, and interpretations.
The first step in any scientific inquiry is to collect data relevant to the problem in hand.
When the inquiry relates to physical and/or biological sciences, data collection is
normally an integral part of the experiment itself. In fact, the very manner in which an
experiment is designed, determines the kind of data it would require and/or generate.
The problem of identifying the nature and the kind of the relevant data is thus
the case of physical sciences. In the case of social sciences, where the required data are
respondents, the problem is not that simply resolved. For one thing, designing the
questionnaire itself is a critical initial problem. For another, the number of respondents
to be accessed for data collection and the criteria for selecting themhas their own
implications and importance for the quality of results obtained. Further, the data have
been collected, these are assembled, organized, and presented in the form of
appropriate tables to make them readable. Wherever needed, figures, diagrams, charts,
and graphs are also used for better presentation of the data. A useful tabular and graphic
presentation of data will require that the raw data be properly classified in accordance
with the objectives of investigation and the relational analysisto be carried out. .
7
A well thought-out and sharp data classification facilitates easy description of the
measures of central tendency, dispersion, skewness, and kurtosis, which constitute the
essential scope of descriptive statistics. These form a large part of the subject matter
of any basic textbook on the subject, and thus they are being discussed in that order
here as well.
presenting the related data. Instead, it consists of methods that are used for drawing
of knowledge about a part of that totality. The totality of observations about which an
The part of totality, which is observed for data collection and analysis to gain
The desired information about a given population of our interest; may also be collected
even by observing all the units comprising the population. This total coverage is called
census. Getting the desired value for the population through census is not always
feasible and practical for various reasons. Apart from time and money considerations
making the census operations prohibitive, observing each individual unit of the
population with reference to any data characteristic may at times involve even
destructive testing. In such cases, obviously, the only recourse available is to employ
the partial or incomplete information gathered through a sample for thepurpose. This is
precisely what inferential statistics does. Thus, obtaining a particular value from the
sample information and using it for drawing an inference about the entire population
8
situation in which one is required to know the average body weight of all the college
students in a given cosmopolitan city during a certain year. A quick and easy way to do
this is to record the weight of only 500 students, from out of a total strength of,say,
10000, or an unknown total strength, take the average, and use this average based on
incomplete weight data to represent the average body weight of all the college students.
In a different situation, one may have to repeat this exercise for some future year and
use the quick estimate of average body weight for a comparison. This maybe needed,
for example, to decide whether the weight of the college students has undergone a
example, an inspection of a sample of five battery cells drawn from a given lot may
reveal that all the five cells are in perfectly good condition. This information may be
used to conclude that the entire lot is good enough to buy or not.
Since this inference is based on the examination of a sample of limited number of cells,
it is equally likely that all the cells in the lot are not in order. It is also possible that all
the items that may be included in the sample are unsatisfactory. This may be used to
conclude that the entire lot is of unsatisfactory quality, whereas the fact may indeed be
otherwise. It may, thus, be noticed that there is always a risk of an inference about a
population being incorrect when based on the knowledge of a limited sample. The
rescue in such situations lies in evaluating such risks. For this, statistics provides the
decisions taken on the basis of sample information being incorrect. This requires an
understanding of the what, why, and how of probability and probability distributions
to equip ourselves with methods of drawing statistical inferences and estimating the
9
degree of reliability of these inferences.
Apart from the methods comprising the scope of descriptive and inferential branches of
statistics, statistics also consists of methods of dealing with a few other issues of specific
nature. Since these methods are essentially descriptive in nature, they have been
discussed here as part of the descriptive statistics. These are mainly concerned with the
following:
(i) It often becomes necessary to examine how two paired data sets are related.
For example, we may have data on the sales of a product and the expenditure
incurred on its advertisement for a specified number of years. Given that sales
the nature of relationship between the two and quantify the degree of that
(ii) Situations occur quite often when we require averaging (or totalling) of data
example, price of cloth may be quoted per meter of length and that of wheat per
apply to such price/quantity data, special techniques needed for the purpose are
activity with a view to determining its future behaviour. For example, when
analysis of relevant sales data over time. The more complex the activity, the
10
more varied the data requirements. For profit maximising and future sales
planning, forecast of likely sales growth rate is crucial. This needs careful
collection and analysis of past sales data. All such concerns are taken care of
(iv) Obtaining the most likely future estimates on any aspect(s) relating to a business
or economic activity has indeed been engaging the minds of allconcerned. This
regression, correlation, and time series analyses together help develop the basic
Keeping in view the importance of inferential statistics, the scope of statistics may
under conditions of uncertainty. While the term statistical methods is often used to
statistical data are analysed, interpreted, and the inferences drawn for decision- making.
Though generic in nature and versatile in their applications, statistical methods have
come to be widely used, especially in all matters concerning business and economics.
These are also being increasingly used in biology, medicine, agriculture, psychology,
and education. The scope of application of these methods has started opening and
finds them of increasing relevance for examining the political behaviour and it is, of
course, no surprise to find even historians statistical data, for history is essentially past
11
data presented in certain actual format.
There are three major functions in any business enterprise in which the statistical
(i) The planning of operations: This may relate to either special projects or to the
(ii) The setting up of standards: This may relate to the size of employment,
achieved against the norm or target set earlier. In case the production has
fallen short of the target, it gives remedial measures so that such a deficiency
setting standards, and control-are separate, but in practice they are very much
interrelated.
Different authors have highlighted the importance of Statistics in business. For instance,
Croxton and Cowden give numerous uses of Statistics in business such as project
planning, budgetary planning and control, inventory planning and control, quality
control, marketing, production and personnel administration. Within these also they
have specified certain areas where Statistics is very relevant. Another author, Irwing
number of areas where statistics is extremely useful. These are: customer wants and
12
production, inspection, packaging and shipping, sales and complaints, inventory and
such, one may do no more than highlight some of the more important ones to emphasis
the relevance of statistics to the business world. In the sphere of production, for
Statistical quality control methods are used to ensure the production of quality goods.
Identifying and rejecting defective or substandard goods achieve this. The sale targets
can be fixed on the basis of sale forecasts, which are done by using varying methods
of forecasting. Analysis of sales affected against the targets set earlier would indicate
the deficiency in achievement, which may be on account of several causes: (i) targets
were too high and unrealistic (ii) salesmen's performance has been poor (iii) emergence
of increase in competition (iv) poor quality of company's product, and so on. These
management. Here, one is concerned with the fixation of wage rates, incentive norms
very relevant here. On the basis of measurement of productivity, the productivity bonus
Statistical methods could also be used to ascertain the efficacy of a certain product, say,
13
patients. One group is given this new medicine for a specified period and the other
one is treated with the usual medicines. Records are maintained for the two groups for
the specified period. This record is then analysed to ascertain if there is any significant
difference in the recovery of the two groups. If the difference is really significant
(i) There are certain phenomena or concepts where statistics cannot be used. This
(ii) Statistics reveal the average behaviour, the normal or the general trend. An
situation may lead to a wrong conclusion and sometimes may be disastrous. For
example, one may be misguided when told that the average depth of ariver
from one bank to the other is four feet, when there may be some points in
between where its depth is far more than four feet. On this understanding, one
may enter those points having greater depth, which may be hazardous.
(iii) Since statistics are collected for a particular purpose, such data may not be
(i.e., data originally collected by someone else) may not be useful for the other
person.
(iv) Statistics are not 100 per cent precise as is Mathematics or Accountancy.
14
(v) In statistical surveys, sampling is generally used as it is not physically possible
to cover all the units or elements comprising the universe. The results may not
based on the same size of sample but different sample units may yield different
results.
in statistics, but such a relationship does not indicate cause and effect'
the two variables. In such cases, it is the user who has to interpret the results
(vii) A major limitation of statistics is that it does not reveal all pertaining to a certain
cover. Similarly, there are some other aspects related to the problem on hand,
which are also not covered. The user of Statistics has to be well informed and
should interpret Statistics keeping in mind all other aspects having relevance on
Apart from the limitations of statistics mentioned above, there are misuses of it. Many
people, knowingly or unknowingly, use statistical data in wrong manner. Let us see
what the main misuses of statistics are so that the same could be avoided when one has
to use statistical data. The misuse of Statistics may take several forms some of which
(i) Sources of data not given: At times, the source of data is not given. In the
absence of the source, the reader does not know how far the data are reliable.
15
(ii) Defective data: Another misuse is that sometimes one gives defective data.
particular point. This apart, the definition used to denote a certain phenomenon
may be defective. For example, in case of data relating to unem- ployed persons,
the definition may include even those who are employed, though partially. The
universe. The sample may turn out to be unrepresentative of the universe. One
may choose a sample just on the basis of convenience. He may collect the
sample.
(iv) Inadequate sample: Earlier, we have seen that a sample that is unrepresentative
of the universe is a major misuse of statistics. This apart, at times one may
a city we may find that there are 1, 00,000 households. When we have to conduct
only 0.1 per cent of the universe. A survey based on such a small sample may
comparisons from the data collected. For instance, one may construct an index
of production choosing the base year where the production was much less. Then
he may compare the subsequent year's production from this low base.
16
Such a comparison will undoubtedly give a rosy picture of the production
attempted. Such a comparison is wrong. Likewise, when data are not properly
classified or when changes in the composition of population in the two years are
not taken into consideration, comparisons of such data would be unfair as they
For example, while making projections of population in the next five years,
one may assume a lower rate of growth though the past two years indicate
otherwise. Sometimes one may not be sure about the changes in business
environment in the near future. In such a case, one may use an assumption that
the use of wrong average. Suppose in a series there are extreme values, one is
too high while the other is too low, such as 800 and 50. The use of an
arithmetic average in such a case may give a wrong idea. Instead, harmonic
(vii) Confusion of correlation and causation: In statistics, several times one has
to examine the relationship between two variables. A close relationship between the
two variables may not establish a cause-and-effect-relationship in the sense that one
17
variable is the cause and the other is the effect. It should be taken as something that
measures degree of association rather than try to find out causal relationship..
1.8 SUMMARY
coverage, and scope. At the macro level, these are data on gross national product and
At the micro level, individual firms, howsoever small or large, produce extensive
statistics on their operations. The annual reports of companies contain variety of data
These data are often field data, collected by employing scientific survey techniques.
Unless regularly updated, such data are the product of a one-time effort and have limited
use beyond the situation that may have called for their collection. A student knows
physics, and others. It is a discipline, which scientifically deals with data, and is often
described as the science of data. In dealing with statistics as data, statistics has
1. Define Statistics. Explain its types, and importance to trade, commerce and
business.
4. What are the major limitations of Statistics? Explain with suitable examples.
18
19
AN OVERVIEW OF CENTRAL TENDENCY
OBJECTIVE: The present lesson imparts understanding of the calculations and main
STRUCTURE:
2.1 Introduction
2.2 Arithmetic Mean
2.3 Median
2.4 Mode
2.5 Relationships of the Mean, Median and Mode
2.6 The Best Measure of Central Tendency
2.7 Geometric Mean
2.8 Harmonic Mean
2.9 Quadratic Mean
2.10 Summary
2.11 Self-Test Questions
2.12 Suggested Readings
2.1 INTRODUCTION
The description of statistical data may be quite elaborate or quite brief depending on
two factors: the nature of data and the purpose for which the same data have been
collected. While describing data statistically or verbally, one must ensure that the
description is neither too brief nor too lengthy. The measures of central tendency enable
us to compare two or more distributions pertaining to the same time period or within
the same distribution over time. For example, the average consumption of teain two
different territories for the same period or in a territory for two years, say, 2003and
20
2.2 ARITHMETIC MEAN
Adding all the observations and dividing the sum by the number of observations
These are seven observations. Symbolically, the arithmetic mean, also called simply
mean is
10 + 15 + 30 + 7 + 42 + 79 + 83
=
7
266
= = 38
7
It may be noted that the Greek letter is used to denote the mean of the population
and n to denote the total number of observations in a population. Thus the population
mean = x/n. The formula given above is the basic formula that forms the definition
of arithmetic mean and is used in case of ungrouped data where weights are not
involved.
In case of ungrouped data where weights are involved, our approach for calculating
Example 2.1: Suppose a student has secured the following marks in three tests:
Mid-term test 30
Laboratory 25
Final 20
30 + 25 + 20
The simple arithmetic mean will be = 25
3
21
However, this will be wrong if the three tests carry different weights on the basis of
their relative importance. Assuming that the weights assigned to the three tests are:
Laboratory 3 points
Final 5 points
Solution: On the basis of this information, we can now calculate a weighted mean as
shown below:
Mid-term 2 30 60
Laboratory 3 25 75
Final 5 20 100
Total w = 10 235
wx w1 x1 + w2 x2 + w3 x3
x= =
w w1 + w2 + w3
60 + 75 + 100
= = 23.5 marks
2+3+5
It will be seen that weighted mean gives a more realistic picture than the simple or
unweighted mean.
falling prices in the stock exchange, a stock is sold at Rs 120 per share on one day, Rs
105 on the next and Rs 90 on the third day. The investor has purchased 50 shares on the
first day, 80 shares on the second day and 100 shares on the third' day. What average
22
Solution:
Day Price per Share (Rs) (x) No of Shares Purchased (w) Amount Paid (wx)
1 120 50 6000
2 105 80 8400
3 90 100 9000
w1 x1 + w2 x2 + w3 x3 wx
Weighted average = =
w1 + w2 + w3 w
It will be seen that if merely prices of the shares for the three days (regardless of the
number of shares purchased) were taken into consideration, then the average price
would be
120 + 105 + 90
Rs. = 105
3
purchased, it fails to give a correct picture. A simple average, it may be noted, is also
a weighted average where weight in each case is the same, that is, only 1. When we use
the term average alone, we always mean that it is an unweighted or simple average.
For grouped data, arithmetic mean may be calculated by applying any of the following
methods:
23
In the case of direct method, the formula x = fm/n is used. Here m is mid-point of
various classes, f is the frequency of each class and n is the total number of
frequencies. The calculation of arithmetic mean by the direct method is shown below.
Example 2.3: The following table gives the marks of 58 students in Statistics.
Solution:
No. of Students
Marks Mid-point m fm
f
0-10 5 4 20
10-20 15 8 120
20-30 25 11 275
30-40 35 15 525
40-50 45 12 540
50-60 55 6 330
60-70 65 2 130
fm = 1940
Where,
x=
fm = 1940 = 33.45 marks or 33 marks approximately.
n 58
It may be noted that the mid-point of each class is taken as a good approximation of the
true mean of the class. This is based on the assumption that the values are distributed
fairly evenly throughout the interval. When large numbers of frequency occur, this
24
In the case of short-cut method, the concept of arbitrary mean is followed. The
formula for calculation of the arithmetic mean by the short-cut method is givenbelow:
x= A+
fd
n
f = frequency
When the values are extremely large and/or in fractions, the use of the direct method
would be very cumbersome. In such cases, the short-cut method is preferable. This is
particularly for calculation of the product of values and their respective frequencies.
However, when calculations are not made manually but by a machine calculator, it may
not be necessary to resort to the short-cut method, as the use of the direct method may
As can be seen from the formula used in the short-cut method, an arbitrary or assumed
mean is used. The second term in the formula (fd n) is the correction factor for the
difference between the actual mean and the assumed mean. If the assumed mean turns
out to be equal to the actual mean, (fd n) will be zero. The use of the short-cut
method is based on the principle that the total of deviations taken from an actual mean
is equal to zero. As such, the deviations taken from any other figure will depend on how
the assumed mean is related to the actual mean. While one may choose any value as
assumed mean, it would be proper to avoid extreme values, that is, too small or too high
chosen.
25
For the figures given earlier pertaining to marks obtained by 58 students, we calculate
Example 2.4:
Mid-point
Marks f d fd
m
0-10 5 4 -30 -120
10-20 15 8 -20 -160
20-30 25 11 -10 -110
30-40 35 15 0 0
40-50 45 12 10 120
50-60 55 6 20 120
60-70 65 2 30 60
fd = -90
It may be noted that we have taken arbitrary mean as 35 and deviations from midpoints.
In other words, the arbitrary mean has been subtracted from each value of mid-point
x= A+
fd
n
− 90
= 35 +
58
Now we take up the calculation of arithmetic mean for the same set of data using the
26
x = A+
fd ' C
n
− 9 10
= 35 + = 33.45 or 33 marks approximately.
58
It will be seen that the answer in each of the three cases is the same. The step- deviation
noted that if we select a different arbitrary mean and recalculate deviations from that
Now that we have learnt how the arithmetic mean can be calculated by using different
methods, we are in a position to handle any problem where calculation of the arithmetic
mean is involved.
Example 2.6: The mean of the following frequency distribution was found to be 1.46.
Solution:
Here we are given the total number of frequencies and the arithmetic mean. We have to
determine the two frequencies that are missing. Let us assume that the frequency against
27
x + 2y + 140
1.46 = 200
x + 2y = 152
x + y = 200 - 86
x + y = 114
x + 2y = 152
x+y = 114
- - -
y = 38
Therefore, x = 114 - 38 = 76
Against accident 1 : 76
Against accident 2 : 38
1. The sum of the deviations of the individual items from the arithmetic mean is
the arithmetic mean. Since the sum of the deviations in the positive direction
is equal to the sum of the deviations in the negative direction, the arithmetic
2. The sum of the squared deviations of the individual items from the arithmetic
mean is always minimum. In other words, the sum of the squared deviations
taken from any value other than the arithmetic mean will be higher.
28
3. As the arithmetic mean is based on all the items in a series, a change in the value
of any item will lead to a change in the value of the arithmetic mean.
4. In the case of highly skewed distribution, the arithmetic mean may get distorted
on account of a few items with extreme values. In such a case, itmay cease
2.3 MEDIAN
Median is defined as the value of the middle item (or the mean of the values of the
two middle items) when the data are arranged in an ascending or descending order of
ascending or descending order of magnitude, the median is the middle value if n is odd.
When n is even, the median is the mean of the two middle values.
We have to first arrange it in either ascending or descending order. These figures are
5,7,10,15,18,19,21,25,33
Now as the series consists of odd number of items, to find out the value of the middle
n +1
Where
2
n +1
Where n is the number of items. In this case, n is 9, as such = 5, that is, the size
2
Suppose the series consists of one more items 23. We may, therefore, have to include
23 in the above series at an appropriate place, that is, between 21 and 25. Thus, the
series is now 5, 7, 10, 15, 18, 19, and 21,23,25,33. Applying the above formula, the
29
median is the size of 5.5th item. Here, we have to take the average of the values of 5th
and 6th item. This means an average of 18 and 19, which gives the median as 18.5.
n +1
It may be noted that the formula itself is not the formula for the median; it
2
merely indicates the position of the median, namely, the number of items we have to
count until we arrive at the item whose value is the median. In the case of the even
number of items in the series, we identify the two items whose values have to be
averaged to obtain the median. In the case of a grouped series, the median is calculated
l2 + l1
M = l1 (m − c)
f
items
c = the cumulative frequency of the class preceding the one in which the median lies
Example 2.7:
Total 143
frequency to the table. Thus, the table with the cumulative frequency is written as:
30
Cumulative Frequency
Monthly Wages Frequency
800 -1,000 18 18
1,000 -1,200 25 43
1,200 -1,400 30 73
1,400 -1,600 34 107
1,600 -1,800 26 133
1.800 -2,000 10 143
l2 + l1
M = l1 (m − c)
f
M = n + 1 = 143 + 1 = 72
2 2
200
= 1200 + (29)
30
= Rs 1393.3
At this stage, let us introduce two other concepts viz. quartile and decile. To understand
these, we should first know that the median belongs to a general class of statistical
descriptions called fractiles. A fractile is a value below that lays a given fraction of a
set of data. In the case of the median, this fraction is one-half (1/2). Likewise, a quartile
has a fraction one-fourth (1/4). The three quartiles Q1, Q2 and Q3 are such that 25 percent
of the data fall below Q1, 25 percent fall between Q1 and Q2, 25 percent fall between
Q2 and Q3 and 25 percent fall above Q3 It will be seen that Q2 is the median. We can
use the above formula for the calculation of quartiles as well. The only difference will
be in the value of m. Let us calculate both Q1 and Q3 in respect of the table given in
Example 2.7.
l2 − l1
Q1 = l1 (m − c)
f
31
n + 1 = 143 + 1 = 36
Here, m will be = 4 4
1200 − 1000
Q = 1000 + (36 − 18)
1
25
200
= 1000 + (18)
25
= Rs. 1,144
n + 1 3144
In the case of Q3, m will be 3 = = = 108
4 4
1800 − 1600
Q = 1600 + (108 − 107)
1
26
200
= 1600 + (1)
26
In the same manner, we can calculate deciles (where the series is divided into 10
parts) and percentiles (where the series is divided into 100 parts). It may be noted that
happens to be skewed. Another point that goes in favour of median is that it can be
computed when a distribution has open-end classes. Yet, another merit of median is that
when a distribution contains qualitative data, it is the only average that can be used. No
other average is suitable in case of such a distribution. Let us take a couple of examples
32
Example 2.8:Calculate the most suitable average for the following data:
Size of the Item Below 50 50-100 100-150 150-200 200 and above
Frequency 15 20 36 40 10
Solution: Since the data have two open-end classes-one in the beginning (below 50) and the
other at the end (200 and above), median should be the right choice as a measure of central
tendency.
n +1
Median is the size of th item
2
121 + 1
= = 61st item
2
l2 − l1
Median = 11 = l1 (m − c)
f
150 − 100
= 100 + (61 − 35)
36
Example 2.9: The following data give the savings bank accounts balances of nine sample
(a) Find the mean and the median for these data; (b) Do these data contain an outlier? If so,
exclude this value and recalculate the mean and median. Which of these summary measures
33
has a greater change when an outlier is dropped?; (c) Which of these two summary measures
Solution:
Rs 83,600
= = Rs 9,289
9
n + 1
Median = Size of th item
2
9 + 1
= = 5th item
2
Arranging the data in an ascending order, we find that the median is Rs 1,800.
exclude this figure and recalculate both the mean and the median.
83,600 − 68,000
Mean = Rs.
8
15,600
= Rs = Rs. 1,950
8
n + 1
Median = Size of th item
2
8 + 1
= = 4.5th item.
2
1,500 − 1,800
= Rs. = Rs. 1,650
2
It will be seen that the mean shows a far greater change than the median when the
(c) As far as these data are concerned, the median will be a more appropriate measure
34
than the mean.
35
Example 2.10: Suppose we are given the following series:
Frequency 6 12 22 37 17 8 5
We are asked to draw both types of ogive from these data and to determine the
median.
Solution:
First of all, we transform the given data into two cumulative frequency distributions,
Table A
Frequency
Less than 10 6
Less than 20 18
Less than 30 40
Less than 40 77
Less than 50 94
Less than 60 102
Less than 70 107
Table B
Frequency
More than 0 107
More than 10 101
More than 20 89
More than 30 67
More than 40 30
More than 50 13
More than 60 5
36
meet the X-axis at M. Thus, from the point of origin to the point at M gives the value
applying the formula, then the answer comes to 33.8, or 34, approximately. It may be
pointed out that even a single ogive can be used to determine the median. As we have
determined the median graphically, so also we can find the values of quartiles, deciles
or percentiles graphically. For example, to determine we have to take size of {3(n + 1)}
/4 = 81st item. From this point on the Y-axis, we can draw a perpendicular to meet the
'less than' ogive from which another straight line is to be drawn to meet the X-axis. This
point will give us the value of the upper quartile. In the same manner, other values of
1. Unlike the arithmetic mean, the median can be computed from open-ended
2. The median can also be determined graphically whereas the arithmetic mean
4. In case of the qualitative data where the items are not counted or measured but
2.4 MODE
The mode is another measure of central tendency. It is the value at the point around
which the items are most heavily concentrated. As an example, consider the following
37
There are ten observations in the series wherein the figure 15 occurs maximumnumber
of times three. The mode is therefore 15. The series given above is a discrete series; as
such, the variable cannot be in fraction. If the series were continuous, we could say that
Mode= l1 + f1 − f 0
i
( f1 − f 0 ) + ( f 1 − f 2 )
Where, l1 = the lower value of the class in which the mode lies
While applying the above formula, we should ensure that the class-intervals are uniform
throughout. If the class-intervals are not uniform, then they should be made uniform on
the assumption that the frequencies are evenly distributed throughout the class. In the
case of inequal class-intervals, the application of the above formula will give misleading
results.
Solution: We can see from Column (2) of the table that the maximum frequency of
12 lies in the class-interval of 60-70. This suggests that the mode lies in this class-
4
= 60 + 10
4+3
= 65.7 approx.
In several cases, just by inspection one can identify the class-interval in which the mode
lies. One should see which the highest frequency is and then identify to which class-
interval this frequency belongs. Having done this, the formula given for calculating the
At times, it is not possible to identify by inspection the class where the mode lies. In
such cases, it becomes necessary to use the method of grouping. This method consists
of two parts:
(i) Preparation of a grouping table: A grouping table has six columns, the first
frequencies grouped in two's, starting from the top. Leaving the first frequency,
of the first three items, then second to fourth item and so on. Column 5 leaves
the first frequency and groups the remaining items in three's. Column 6 leaves
the first two frequencies and then groups the remaining in three's. Now, the
bold type.
analysis table is prepared. On the left-hand side, provide the first column for
column numbers and on the right-hand side the different possible values of
mode. The highest values marked in the grouping table are shown here by a
39
they represent. The last row of this table will show the number of times a
particular value has occurred in the grouping table. The highest value in the
analysis table will indicate the class-interval in which the mode lies. The
procedure of preparing both the grouping and analysis tables to locate the modal
10-20 10
20-30 18
30-40 25
40-50 26
50-60 17
60-70 4
Solution:
Grouping Table
Size of item 1 2 3 4 5 6
10-20 10
28
20-30 18 53
43
30-40 25 69
51
40-50 26 68
43
50-60 17 47
21
60-70 4
Analysis table
Size of item
Col. No. 10-20 20-30 30-40 40-50 50-60
1 1
2 1 1
3 1 1 1 1
4 1 1 1
5 1 1 1
40
6 1 1 1
Total 1 3 5 5 2
This is a bi-modal series as is evident from the analysis table, which shows that the two
classes 30-40 and 40-50 have occurred five times each in the grouping. In such a
Median = Size of (n + l)/2th item, that is, 101/2 = 50.5th item. This lies in the class 30-
40. Applying the formula for the median, as given earlier, we get
40 - 30
= 30 + (50.5 − 28)
25
= 30 + 9 = 39
Mean = A+
fd ' i
n
34
= 35 + 10
100
= 38.4
= (3 x 39) - (2 x 38.4)
= 117 -76.8
41
= 40.2
This formula, Mode = 3 Median-2 Mean, is an empirical formula only. And it can
give only approximate results. As such, its frequent use should be avoided. However,
when mode is ill defined or the series is bimodal (as is the case in the present
Having discussed mean, median and mode, we now turn to the relationship amongst
these three measures of central tendency. We shall discuss the relationship assuming
(i) When a distribution is symmetrical, the mean, median and mode are the same,
In case, a distribution is
bution is skewed to the right where a large number of families have relatively
low income and a small number of families have extremely high income. In such
a case, the mean is pulled up by the extreme high incomes and the relation
among these three measures is as shown in Fig. 6.3. Here, we find thatmean>
median> mode.
42
shown as in the figure.
(iii) Given the mean and median of a unimodal distribution, we can determine
is skewed to the left. It may be noted that the median is always in the middle
At this stage, one may ask as to which of these three measures of central tendency the
best is. There is no simple answer to this question. It is because these three measures
are based upon different concepts. The arithmetic mean is the sum of the values divided
by the total number of observations in the series. The median is the value of the middle
observation that divides the series into two equal parts. Mode is the value around which
the observations tend to concentrate. As such, the use of a particular measure will
largely depend on the purpose of the study and the nature of the data; For example,
television sets or different kinds of advertising, the choice should go in favour of mode.
The use of mean and median would not be proper. However,the median can sometimes
be used in the case of qualitative data when such data can be arranged in an ascending
or descending order. Let us take another example. Suppose we invite applications for a
certain vacancy in our company. A large number of candidates apply for that post. We
are now interested to know as to which age or age group has the largest concentration
of applicants. Here, obviously the mode will be the most appropriate choice. The
43
be influenced by some extreme values. However, the mean happens to be the most
commonly used measure of central tendency as will be evident from the discussion in
Apart from the three measures of central tendency as discussed above, there are two
other means that are used sometimes in business and economics. These are the
geometric mean and the harmonic mean. The geometric mean is more important than
the harmonic mean. We discuss below both these means. First, we take up the geometric
mean. Geometric mean is defined at the nth root of the product of n observations of a
distribution.
have to calculate the cube root of the product of these three observations; and so on.
When the number of items is large, it becomes extremely difficult to multiply the
numbers and to calculate the root. To simplify calculations, logarithms are used.
Example 2.13: If we have to find out the geometric mean of 2, 4 and 8, then we find
Log GM =
log x i
1.8062
= = 0.60206
3
GM = Antilog 0.60206
=4
44
When the data are given in the form of a frequency distribution, then the geometric
f1 + f 2 + ........... fn
=
f .log x
f1 + f 2 +........... fn
Then, GM = Antilog n
3. Discounting, capitalization.
Example 2.14: A person has invested Rs 5,000 in the stock market. At the end of the
first year the amount has grown to Rs 6,250; he has had a 25 percent profit. If at the end
of the second year his principal has grown to Rs 8,750, the rate of increase is 40 percent
for the year. What is the average rate of increase of his investment during the two years?
Solution:
The average rate of increase in the value of investment is therefore 1.323 - 1 = 0.323,
Example 2.15: We can also derive a compound interest formula from the above set of
Solution: Now, 1.25 x 1.40 = 1.75. This can be written as 1.75 = (1 + 0.323)2.
Let P2 = 1.75, P0 = 1, and r = 0.323, then the above equation can be written as P2 = (1
+ r)2 or P2 = P0 (1 + r)2.
45
Where P2 is the value of investment at the end of the second year, P0 is the initial
investment and r is the rate of increase in the two years. This, in fact, is the familiar
r)n. In our case Po is Rs 5,000 and the rate of increase in investment is 32.3 percent. Let
us apply this formula to ascertain the value of Pn, that is, investment at the end of the
second year.
Pn = 5,000 (1 + 0.323)2
= 5,000 x 1.75
= Rs 8,750
It may be noted that in the above example, if the arithmetic mean is used, the resultant
25 + 40
figure will be wrong. In this case, the average rate for the two years is percent
2
165
per year, which comes to 32.5. Applying this rate, we get Pn = x 5,000
100
= Rs 8,250
Example 2.16: An economy has grown at 5 percent in the first year, 6 percent in the
second year, 4.5 percent in the third year, 3 percent in the fourth year and 7.5 percent
in the fifth year. What is the average rate of growth of the economy during the five
years?
Solution:
46
log x
GM = Antilog
n
= Antilog
10.10987
5
= Antilog 2.021974
= 105.19
Hence, the average rate of growth during the five-year period is 105.19 - 100 = 5.19
percent per annum. In case of a simple arithmetic average, the corresponding rate of
2.7.1 DISCOUNTING
Pn
Pn=P0(1+r)n This can be written as P0 =
(1 + r) n
If the future income is Pn rupees and the present rate of interest is 100 r percent, then
the present value of P n rupees will be P0 rupees. For example, if we have a machine
that has a life of 20 years and is expected to yield a net income of Rs 50,000 per year,
and at the end of 20 years it will be obsolete and cannot be used, then the machine's
present value is
(1 + r) n (1 + r) 2 (1 + r)3 (1 + r) 20
This process of ascertaining the present value of future income by using the interest rate
is known as discounting.
In conclusion, it may be said that when there are extreme values in a series, geometric
mean should be used as it is much less affected by such values. The arithmetic mean
2.7.2 ADVANTAGES OF G. M.
1. Geometric mean is based on each and every observation in the data set.
2. It is rigidly defined.
growth rates.
4. As compared to the arithmetic mean, it gives more weight to small values and
mean, it is generally less than the arithmetic mean. At times it may be equal to
understand.
2. Both computation of the geometric mean and its interpretation are rather
difficult.
In view of the limitations mentioned above, the geometric mean is not frequently
used.
48
The harmonic mean is defined as the reciprocal of the arithmetic mean of the
= Re ciprocal
n 1/ x
HM=
1/ x1 + 1/ x 2 + 1/ x 3 + . .. + 1/ x n n
The calculation of harmonic mean becomes very tedious when a distribution has a large
number of observations. In the case of grouped data, the harmonic mean is calculated
HM = Reciprocal of f i
i −1
xi
or
n
n
1
f i x
i −1 i
Here, each reciprocal of the original figure is weighted by the corresponding frequency
(f).
The main advantage of the harmonic mean is that it is based on all observations in a
greater weight to smaller observations and less weight to the larger observations, then
the use of harmonic mean will be more suitable. As against these advantages, there
are certain limitations of the harmonic mean. First, it is difficult to understand as well
or negative. Third, it is only a summary figure, which may not be an actual observation
in the distribution.
It is worth noting that the harmonic mean is always lower than the geometric mean,
which is lower than the arithmetic mean. This is because the harmonic mean assigns
49
lesser importance to higher values. Since the harmonic mean is based on reciprocals,
it becomes clear that as reciprocals of higher values are lower than those of lower
values, it is a lower average than the arithmetic mean as well as the geometric mean.
Example 2.17: Suppose we have three observations 4, 8 and 16. We are required to
1
calculate the harmonic mean. Reciprocals of 4,8 and 16 are: 1 , 1 , respectively
4 8 16
Since HM = n
1/ x1 + 1/ x 2 + 1/ x 3
3
=
1/ 4 + 1/ 8 + 1/ 16
3
=
0.25 + 0.125 + 0.0625
= 6.857 approx.
Frequency 20 40 30 10
Solution:
n
1
f i x
i −1 i
=
n
100
= = 4.984 approx.
20.0641
50
Example 2.19: In a small company, two typists are employed. Typist A types one page
in ten minutes while typist B takes twenty minutes for the same. (i) Both are asked to
type 10 pages. What is the average time taken for typing one page? (ii) Both are asked
to type for one hour. What is the average time taken by them for typing one page?
= 15 minutes
60 (min utes)
HM =
60 / 10 + 60 / 20( pages)
120 40
= = = 13 min utes and 20 seconds.
120 + 60 3
20
Example 2.20: It takes ship A 10 days to cross the Pacific Ocean; ship B takes 15
days and ship C takes 20 days. (i) What is the average number of days taken by a ship
to cross the Pacific Ocean? (ii) What is the average number of days taken by a cargo
to cross the Pacific Ocean when the ships are hired for 60 days?
Solution: Here again Q-(i) pertains to simple arithmetic mean while Q-(ii) is concerned
10 + 15 + 20
(i) M = = 15 days
3
60 3(days) _
(ii) HM =
60 / 10 + 60 / 15 + 60 / 20
180
=
360 + 240 + 180
60
51
= 13.8 days approx.
We have seen earlier that the geometric mean is the antilogarithm of the arithmetic
mean of the logarithms, and the harmonic mean is the reciprocal of the arithmetic mean
of the reciprocals. Likewise, the quadratic mean (Q) is the square root of the arithmetic
1 2 n
Q=
n
Instead of using original values, the quadratic mean can be used while averaging
deviations when the standard deviation is to be calculated. This will be used in the
Q> x >G>H provided that all the individual observations in a series are positive and
should use the same method of averaging that was employed in calculating the original
averages. Thus, we should calculate the arithmetic mean of several values of x, the
geometric mean of several values of GM, and the harmonic mean of several values of
2.10 SUMMARY
It is the most important objective of statistical analysis is to get one single value that
describes the characteristics of the entire mass of cumbersome data. Such a value is
52
2.11 SELF-TEST QUESTIONS
1. What are the desiderata (requirements) of a good average? Compare the mean,
the median and the mode in the light of these desiderata? Why averages are
2. "Every average has its own peculiar characteristics. It is difficult to say which
3. What do you understand .by 'Central Tendency'? Under what conditions is the
4. The average monthly salary paid to all employees in a company was Rs 8,000.
The average monthly salaries paid to male and female employees of the
company were Rs 10,600 and Rs 7,500 respectively. Find out the percentages
Frequency 2 4 9 11 12 6 4 2
6. Calculate the mean, median and mode from the following data:
62-63 2
63-64 6
64-65 14
65-66 16
66-67 8
67-68 3
68-69 1
Total 50
After drying for two weeks, the same articles have again been weighed and
similarly classified. It is known that the median weight in the first weighing
53
was 20.83 gm while in the second weighing it was 17.35 gm. Some frequencies
a and b in the first weighing and x and y in the second are missing.It is known
that a = 1/3x and b = 1/2 y. Find out the values of the missing frequencies.
Class Frequencies
0- 5 a z
5-10 b y
10-15 11 40
15-20 52 50
20-25 75 30
25-30 22 28
8 Cities A, Band C are equidistant from each other. A motorist travels from A to
Frequency 20 40 30 10
While coming down it runs 12 km per litre. Find its average consumption for
to and fro travel between two places situated at the two ends of 25 Ian long
gradient.
54
55
DISPERSION AND SKEWNESS
3.1 INTRODUCTION
In the previous chapter, we have explained the measures of central tendency. It may
be noted that these measures do not indicate the extent of dispersion or variability in a
distribution. The dispersion or variability provides us one more step in increasing our
understanding of the pattern of the data. Further, a high degree of uniformity (i.e. low
variability in the raw material, then it could not find mass production economical.
56
Suppose an investor is looking for a suitable equity share for investment. While
examining the movement of share prices, he should avoid those shares that are highly
fluctuating-having sometimes very high prices and at other times going very low.
Such extreme fluctuations mean that there is a high risk in the investment in shares. The
investor should, therefore, prefer those shares where risk is not so high.
The various measures of central value give us one single figure that represents the
entire data. But the average alone cannot adequately describe a set of observations,
unless all the observations are the same. It is necessary to describe the variability or
dispersion of the observations. In two or more distributions the central value may be
the same but still there can be wide disparities in the formation of distribution.
distribution.
It is clear from above that dispersion (also known as scatter, spread or variation)
measures the extent to which the items vary from some central value. Since measures
they are also called averages of the second order. An average is more meaningful
when it is examined in the light of dispersion. For example, if the average wage of the
57
workers of factory A is Rs. 3885 and that of factory B Rs. 3900, we cannot necessarily
conclude that the workers of factory B are better off because in factory B there may be
much greater dispersion in the distribution of wages. The study of dispersion is of great
100 100 1
100 102 2
100 103 3
100 90 5
58
arithmetic mean and hence there is no dispersion. In series B, only one item isperfectly
represented by the arithmetic mean and the other items vary but the variation is very
arithmetic mean and the items vary widely from one another. In series C, dispersion is
much greater compared to series B. Similarly, we may have two groups of labourers
with the same mean salary and yet their distributions may differ widely. The mean
salary may not be so important a characteristic as the variation of the items from the
mean. To the student of social affairs the mean income is not so vitally important as to
know how this income is distributed. Are a large number receiving themean income or
are there a few with enormous incomes and millions with incomes farbelow the mean?
The three figures given in Box 3.1 represent frequency distributions with some of the
distractions with the same mean X , but with different dispersions. The two curves in
(b) represent two distributions with the same dispersion but with unequal means X l
and X 2, (c) represents two distributions with unequal dispersion. The measures of
central tendency are, therefore insufficient. They must be supported and supplemented
In the present chapter, we shall be especially concerned with the measures of variability
extent to which there are differences between individual observation and some central
variation or its degree but not in the direction. For example, a measure of 6 inches below
the mean has just as much dispersion as a measure of six inches above the mean.
59
Literally meaning of dispersion is ‘scatteredness’. Average or the measures of central
tendency gives us an idea of the concentration of the observations about the central part
of the distribution. If we know the average alone, we cannot form a complete ideaabout
the distribution. But with the help of dispersion, we have an idea about homogeneity or
VARIATION
the mass. When dispersion is small, the average is a typical value in the sense
that it closely represents the individual value and it is reliable in the sense that
hand, when dispersion is large, the average is not so typical, and unless the
in body temperature, pulse beat and blood pressure are the basic guides to
the causes of which are sought through inspection is basic to the control of
with regard to their variability. The study of variation may also be looked
60
upon as a means of determining uniformity of consistency. A high degree of
Deviation, Mean deviation, Standard Deviation, and Lorenz curve. Among them, the
first four are mathematical methods and the last one is the graphical method. These
3.5 RANGE
The simplest measure of dispersion is the range, which is the difference between the
Example 3.1: Find the range for the following three sets of data:
Set 1: 05 15 15 05 15 05 15 15 15 15
Set 2: 8 7 15 11 12 5 13 11 15 9
61
Set 3: 5 5 5 5 5 5 5 5 5 5
Solution: In each of these three sets, the highest number is 15 and the lowest number
is 5. Since the range is the difference between the maximum value and the minimum
value of the data, it is 10 in each case. But the range fails to give any idea about the
dispersal or spread of the series between the highest and the lowest value. Thisbecomes
upper limit of the highest class and the lower limit of the lowest class.
Example 3.2: Find the range for the following frequency distribution:
Solution: Here, the upper limit of the highest class is 120 and the lower limit of the
lowest class is 20. Hence, the range is 120 - 20 = 100. Note that the range is not
S, where L is the largest value and S is the smallest value in a distribution. The
coefficient of range is calculated by the formula: (L-S)/ (L+S). This is the relative
measure. The coefficient of the range in respect of the earlier example having three sets
of data is: 0.5.The coefficient of range is more appropriate for purposes ofcomparison
Example 3.3: Calculate the coefficient of range separately for the two sets of data
given below:
Set 1 8 10 20 9 15 10 13 28
Set 2 30 35 42 50 32 49 39 33
62
Solution: It can be seen that the range in both the sets of data is the same:
Set 1 28 - 8 = 20
Set 2 50 - 30 = 20
28 – 8 = 0.55
28+8
Coefficient of range in set 2 is:
50 – 30
= 0.25
50 +30
1. It is based only on two items and does not cover all the items in a distribution.
population.
3. It fails to give any idea about the pattern of distribution. This was evident from
the range.
Despite these limitations of the range, it is mainly used in situations where one wants
to quickly have some idea of the variability or' a set of data. When the sample size is
very small, the range is considered quite adequate measure of the variability. Thus, it
is widely used in quality control where a continuous check on the variability of raw
materials or finished products is needed. The range is also a suitable measure in weather
forecast. The meteorological department uses the range by giving the maximum and the
minimum temperatures. This information is quite useful to the common man, as he can
63
3.6 INTERQUARTILE RANGE OR QUARTILE DEVIATION
distribution than the range. Here, avoiding the 25 percent of the distribution at both
the ends uses the middle 50 percent of the distribution. In other words, the interquartile
range denotes the difference between the third quartile and the firstquartile.
Many times the interquartile range is reduced in the form of semi-interquartile range
When quartile deviation is small, it means that there is a small deviation in the central
50 percent items. In contrast, if the quartile deviation is high, it shows that the central
distribution, the two quartiles, that is, Q3 and QI are equidistant from the median.
Symbolically,
M-QI = Q3-M
However, this is seldom the case as most of the business and economic data are
asymmetrical. But, one can assume that approximately 50 percent of the observations
are contained in the interquartile range. It may be noted that interquartile range or the
Q3 –Q1
Coefficient of QD = Q3 +Q1
upper and lower quartiles. As the computation of the two quartiles has already been
64
3.6.1 MERITS OF QUARTILE DEVIATION
The mean deviation is also known as the average deviation. As the name implies, it is
the average of absolute amounts by which the individual items deviate from the mean.
Since the positive deviations from the mean are equal to the negative deviations, while
Symbolically,
from the mean ignoring positive and negative signs, n = the total number of
observations.
65
Example 3.4:
Solution:
2-4 3 20 60 -2.6 52
4-6 5 40 200 -0.6 24
6-8 7 30 210 1.4 42
8-10 9 10 90 3.4 34
Total 100 560 152
x =
fm = 560 = 5.6
n 100
f |d |
=
152
= 1.52
MD ( x ) =
n 100
easy to calculate.
2. It takes into consideration each and every item in the distribution. As a result,
a change in the value of any item will have its effect on the magnitude of mean
deviation.
3. The values of extreme items have less effect on the value of the mean deviation.
66
2. At times it may fail to give accurate results. The mean deviation gives best
results when deviations are taken from the median instead of from the mean.
But in a series, which has wide variations in the items, median is not a
satisfactory measure.
algebraic signs when the deviations are taken from the mean.
The standard deviation is similar to the mean deviation in that here too the deviations
are measured from the mean. At the same time, the standard deviation is preferred to
the mean deviation or the quartile deviation or the range because it has desirable
mathematical properties.
Before defining the concept of the standard deviation, we introduce another concept
viz. variance.
Example 3.5:
X X- (X-)2
20 20-18=12 4
15 15-18= -3 9
19 19-18 = 1 1
24 24-18 = 6 36
16 16-18 = -2 4
14 14-18 = -4 16
108 Total 70
Solution:
108
Mean = = 18
6
67
The second column shows the deviations from the mean. The third or the last column
shows the squared deviations, the sum of which is 70. The arithmetic mean of the
(x − ) 2
= 70/6=11.67 approx.
N
This mean of the squared deviations is known as the variance. It may be noted that
this variance is described by different terms that are used interchangeably: the variance
It is also written as 2 = (x i − )
2
Although the variance is a measure of dispersion, the unit of its measurement is (points).
If a distribution relates to income of families then the variance is (Rs)2 and not rupees.
variance is (marks)2. To overcome this inadequacy, the square root of variance is taken,
which yields a better measure of dispersion known as the standard deviation. Taking
our earlier example of individual observations, we take the square root of the variance
Symbolically, = 11.67
In applied Statistics, the standard deviation is more frequently used than the variance.
68
=
We use this formula to calculate the standard deviation from the individual
Example 7.6:
X X2
20 400
15 225
19 361
24 576
16 256
14 196
108 2014
Solution:
x 2 = 2014
i x i = 108 N=6
2014 − 2014 −
11664
= Or, =
12084 −11664
=
6 Or, =
= Or, =
= 3.42
Example 3.7:
69
60- 70 6
70- 80 3
80- 90 2
90-100 1
Solution:
=
N
Where mi is the mid-point of the class intervals is the mean of the distribution, fi is
the frequency of each class; N is the total number of frequency and K is the number of
classes. This formula requires that the mean be calculated and that deviations (mi -
) be obtained for each class. To avoid this inconvenience, the above formula can be
modified as:
fid fdi i
= i =1 i =1
Where C is the class interval: fi is the frequency of the ith class and di is the deviation
of the of item from an assumed origin; and N is the total number of observations.
= 10
− 45
−
70
=10 4.2 − 0.669421
=18.8 marks
When it becomes clear that the actual mean would turn out to be in fraction, calculating
deviations from the mean would be too cumbersome. In such cases, an assumed
mean is used and the deviations from it are calculated. While mid- point of any
class can be taken as an assumed mean, it is advisable to choosethe mid-point
of that class that would make calculations least cumbersome. Guided by this
consideration, in Example 3.7 we have decided to choose 55 as the mid-point
and, accordingly, deviations have been taken from it. It will be seen from the
calculations that they are considerably simplified.
3.8.1 USES OF THE STANDARD DEVIATION
determine as to how far individual items in a distribution deviate from its mean. In a
(i) About 68 percent of the values in the population fall within: + 1 standard
(ii) About 95 percent of the values will fall within +2 standard deviations from the
mean.
(iii) About 99 percent of the values will fall within + 3 standard deviations from
the mean.
the same units as the original data. As such, it cannot be a suitable measure while
comparing two or more distributions. For this purpose, we should use a relative measure
which relates the standard deviation and the mean such that the standard deviation is
expressed as a percentage of mean. Thus, the specific unit in which the standard
deviation is measured is done away with and the new unit becomes percent.
71
Symbolically, CV (coefficient of variation) = x 100
Example 3.8: In a small business firm, two typists are employed-typist A and typist
B. Typist A types out, on an average, 30 pages per day with a standard deviation of 6.
Typist B, on an average, types out 45 pages with a standard deviation of 10. Which
Solution: Coefficient of variation for A = x 100
6
Or A = x 100
30
Or 20% and
Coefficient of variation for B = x 100
10
B= x 100
45
or 22.2 %
These calculations clearly indicate that although typist B types out more pages, there
is a greater variation in his output as compared to that of typist A. We can say this in a
different way: Though typist A's daily output is much less, he is more consistent than
two groups of data having different means, as has been the case in the above example.
in units of the standard deviation, is called a standardised variable. Since both the
numerator and the denominator are in the same units, a standardised variable is
independent of units used. If deviations from the mean are given in units of the standard
72
Through this concept of standardised variable, proper comparisons can be made
compositions differ.
Example 3.9: A student has scored 68 marks in Statistics for which the average
marks were 60 and the standard deviation was 10. In the paper on Marketing, he scored
74 marks for which the average marks were 68 and the standard deviation was
15. In which paper, Statistics or Marketing, was his relative standing higher?
the mean x in terms of standard deviation s. For Statistics, Z = (68 - 60) 10 = 0.8
Since the standard score is 0.8 in Statistics as compared to 0.4 in Marketing, his
Example 3.10: Convert the set of numbers 6, 7, 5, 10 and 12 into standard scores:
Solution:
X X2
6 36
7 49
5 25
10 100
12 144
X = 40 X
2
= 354
x = x N = 40 5 = 8
354 −
=
x2 − or, = 5
73
x−x 6−8
Z= = = -0.77 (Standard score)
2.61
7−8
(i) = -0.38
2.61
5−8
(ii) = -1.15
2.61
10 − 8
(iii) = 0.77
2.61
12 − 8
(iv) (iv) = 1.53
2.61
Thus the standard scores for 6,7,5,10 and 12 are -0.77, -0.38, -1.15, 0.77 and 1.53,
respectively.
This measure of dispersion is graphical. It is known as the Lorenz curve named after
Dr. Max Lorenz. It is generally used to show the extent of concentration of income
and wealth. The steps involved in plotting the Lorenz curve are:
2. Calculate percentage for each item taking the total equal to 100.
3. Choose a suitable scale and plot the cumulative percentages of the persons and
income. Use the horizontal axis of X to depict percentages of persons and the
4. Show the line of equal distribution, which will join 0 of X-axis with 100 of Y-
axis.
5. The curve obtained in (3) above can now be compared with the straight line of
equal distribution obtained in (4) above. If the Lorenz curve is close to the line
of equal distribution, then it implies that the dispersion is much less. If, on the
74
contrary, the Lorenz curve is farther away from the line of equal distribution,
The Lorenz curve is a simple graphical device to show the disparities of distribution
Figure 3.1 shows two Lorenz curves by way of illustration. The straight line AB is a
line of equal distribution, whereas AEB shows complete inequality. Curve ACB and
A F
As curve ACB is nearer to the line of equal distribution, it has more equitable
distribution of income than curve ADB. Assuming that these two curves are for the
same company, this may be interpreted in a different manner. Prior to taxation, the curve
ADB showed greater inequality in the income of its employees. After thetaxation, the
company’s data resulted into ACB curve, which is closer to the line of equal
be repeated here that frequency distributions differ in three ways: Average value,
Variability or dispersion, and Shape. Since the first two, that is, average value and
75
variability or dispersion have already been discussed in previous chapters, here our
main spotlight will be on the shape of frequency distribution. Generally, there are two
distribution. Two distributions may have the same mean and standard deviation but may
differ widely in their overall appearance as can be seen from the following:
distributions.
symmetrical distribution the mean, median and mode are identical. The more
the mean moves away from the mode, the larger the asymmetry or skewness."
4. "A distribution is said to be 'skewed' when the mean and the median fall at
different points in the distribution, and the balance (or centre of gravity) is
76
The above definitions show that the term 'skewness' refers to lack of symmetry" i.e.,
distribution.
The concept of skewness will be clear from the following three diagrams showing a
distribution.
metrical distribution the values of mean, median and mode coincide. The spread
2. Asymmetrical Distribution. A
the mean is maximum and that of mode least-the median lies in between the two
maximum and that of mean least-the median lies in between the two. In the
positively skewed distribution the frequencies are spread out over a greater
77
range of values on the high-value end of the curve (the right-hand side) than
they are on the low-value end. In the negatively skewed distribution the position
is reversed, i.e. the excess tail is on the left-hand side. It should be noted that in
moderately symmetrical distributions the interval between the mean and the
median is approximately one-third of the interval between the mean and the
of skewness.
In order to ascertain whether a distribution is skewed or not the following tests may
be applied. Skewness is present if:
1. The values of mean, median and mode do not coincide.
2. When the data are plotted on a graph they do not give the normal bell-
shaped form i.e. when cut along a vertical line through the centre the two
3. The sum of the positive deviations from the median is not equal to the sum
the mode.
3. Sum of the positive deviations from the median is equal to the sum of the
negative deviations.
78
4. Quartiles are equidistant from the median.
mode.
There are four measures of skewness, each divided into absolute and relative measures.
The relative measure is known as the coefficient of skewness and is more frequently
used than the absolute measure of skewness. Further, when a comparison between two
The measures of skewness are: (i) Karl Pearson's measure, (ii) Bowley’s measure, (iii)
Kelly’s measure, and (iv) Moment’s measure. These measures are discussed briefly
below:
than the mode or less than the mode. If it is greater than the mode, then skewness is
79
positive. But when the mean is less than the mode, it is negative. The difference between
the mean and mode indicates the extent of departure from symmetry. It is measured in
measurement. It may be recalled that this observation was made in the preceding
zero, when the distribution is symmetrical. Normally, this coefficient of skewness lies
between +1. If the mean is greater than the mode, then the coefficient of skewness will
Example 3.11: Given the following data, calculate the Karl Pearson's coefficient of
Solution:
Standard deviation
Mean ( x )=
X = 452 = 45.2
N 10
x x
2 2
SD
( ) =
x2
− ( ) =
x2
−
N N
Applying the values of mean, mode and standard deviation in the above formula,
This shows that there is a positive skewness though the extent of skewness is
marginal.
Example 3.12: From the following data, calculate the measure of skewness using the
X 10 - 20 20 - 30 30 - 40 40 - 50 50-60 60 - 70 70 - 80
f 18 30 40 55 38 20 16
80
Solution:
2
x MVx dx f fdx fdX cf
10 - 20 15 -3 18 -54 162 18
20 - 30 25 -2 30 -60 120 48
30 - 40 35 -1 40 -40 40 88
40-50 45=a 0 55 0 0 143
50 - 60 55 1 38 38 38 181
60 - 70 65 2 20 40 80 201
70 - 80 75 3 16 48 144 217
Total 217 -28 584
a = Assumed mean = 45, cf = Cumulative frequency, dx = Deviation from assumed
mean, and i = 10
x=a+
fdx i
N
28
= 45 − 10 = 43.71
217
l2 − l 1
Median= l1 + (m − c)
f1
50 − 40
Median = 40 − (109 − 88)
55
10
= 40 + 21
55
= 43.82
fd x 584
SD = − 10 = − 10
x
f f 217 217
= 3 (43.71 - 43.82)
= 3 x -0.011
81
= -0.33
Coefficient of skewness
Skewness or
SD
= -0.33
16.4
= -0.02
The result shows that the distribution is negatively skewed, but the extent of skewness
is extremely negligible.
Where Q3 and Q1 are upper and lower quartiles and M is the median. The value of this
skewness varies between +1. In the case of open-ended distribution as well as where
extreme values are found in the series, this measure is particularly useful. In a
symmetrical distribution, skewness is zero. This means that Q3 and Q1 are positioned
when the distribution is skewed, then Q3 - Q2 will be different from Q2 – Q1' When Q3
than Q2 – Q1' then skewness is negative. Bowley’s measure of skewness can- be written
as:
comparing two distributions where the units of measurement are different. In view of
82
(Q3 − Q2 ) − (Q2 − Q1 )
Relative Skewness =
(Q3 − Q2 ) + (Q2 − Q1 )
Q3 − Q2 − Q2 − Q1
= Q3 − Q2 + Q2 − Q1
Q3 − Q1 − 2Q2
= Q3 − Q 1
Q3 − Q1 − 2M
= Q3 − Q 1
Solution:
Q3 − Q1 − 2M
Bowley's coefficient of skewness is: SkB =
Q3 − Q1
Q3 + 16.4 - (2 x 24.2)
SkB =
Q3 − 16.4
Q3 + 16.4 - 48.4
− 0.56 =
Q3 − 16.4
- 1.56 Q3 = - 41.184
− 41.184
Q3 = = 26.4
1.56
Now, we have the values of both the upper and the lower quartiles.
Q3 − Q1
Coefficient of quartile deviation =
Q3 + Q1
26.4 − 16.4
= = 10 = 0.234 Approx.
26.4 + 16.4 42.8
data:
83
Value in Rs Frequency
Less than 50 40
50 - 100 80
150 – 200 60
Solution: It should be noted that the series given in the question is an open-ended series.
most appropriate measure of skewness in this case. In order to calculate the quartiles
and the median, we have to use the cumulative frequency. The table is reproduced below
Less than 50 40 40
50 - 100 80 120
l2 − l1
Q1 = l1 + (m − c)
f1
n +1 341
Now m=( ) item = = 85.25, which lies in 50 - 100 class
4 4
100 − 50
Q1 = 50 + (85.25 − 40) = 78.28
80
n +1 341
M=( ) item = = 170.25, which lies in 100 - 150 class
4 4
84
150 − 100
M= 100 + (170.5 − 120) = 119.4
130
l2 − l1
Q3 = l1 + (m − c)
f1
m = 3(341) 4 = 255.75
200 − 150
Q3 = 150 + (255.75 − 250) = 154.79
60
= - 0.075 approx.
This shows that there is a negative skewness, which has a very negligible magnitude.
D1 + D9 − 2M
Or,
D9 − D1
Where P and D stand for percentile and decile respectively. In order to calculate the
coefficient of skewness by this formula, we have to ascertain the values of 10th, 50th
and 90th percentiles. Somehow, this measure of skewness is seldom used. All the
Class Intervals f cf
10 - 20 18 18
85
20 - 30 30 48
86
30- 40 40 88
40- 50 55 143
50 - 60 38 181
60 – 70 20 201
70 - 80. 16 217
l2 − l 1
PIO = l1 + (m − c) , where m = (n + 1)/10th item
f1
217 + 1
= 21.8th item
10
217 + 1
P50 (median): where m = (n + 1)/2th item = = 109th item
2
Kelley's skewness
88.87 - 87.64
=
46.63
= 0.027
87
This shows that the series is positively skewed though the extent of skewness is
extremely negligible. It may be recalled that if there is a perfectly symmetrical
distribution, then the skewness will be zero. One can see that the above answer
is very close to zero.
3.13 SUMMARY
The average value cannot adequately describe a set of observations, unless all the
observations are the same. It is necessary to describe the variability or dispersion
of the observations. In two or more distributions the central value may be the
same but still there can be wide disparities in the formation of distribution.
Therefore, we have to use the measures of dispersion.
Further, two distributions may have the same mean and standard deviation but may
88
distinguish between different types of distributions, we may use the measures of
skewness.
2. “Variability is not an important factor because even though the outcome is more certain,
you still have an equal chance of falling either above or below the median.
Therefore, on an average, the outcome will be the same.” Do you agree with this
3. Why is the standard deviation the most widely used measure of dispersion? Explain.
5. What are the different measures of skewness? Which one is repeatedly used?
89
90
Correlation Analysis
91
...if we have information on more than one variables, we might be interested in seeing if
there is any connection - any association - between them.
4.1 INTRODUCTION
Statistical methods of measures of central tendency, dispersion, skewness and kurtosis are
helpful for the purpose of comparison and analysis of distributions involving only onevariable
i.e. univariate distributions. However, describing the relationship between two or more
In many business research situations, the key to decision making lies in understanding the
relationships between two or more variables. For example, in an effort to predict the behavior
of the bond market, a broker might find it useful to know whether the interest rate of bonds is
related to the prime interest rate. While studying the effect of advertising on sales, an account
executive may find it useful to know whether there is a strong relationship between advertising
The statistical methods of Correlation (discussed in the present lesson) and Regression (to be
discussed in the next lesson) are helpful in knowing the relationship between two or more
variables which may be related in same way, like interest rate of bonds and prime interest
rate; advertising expenditure and sales; income and consumption; crop-yield and fertilizer used;
In all these cases involving two or more variables, we may be interested in seeing:
➢ if so, what form the relationship between the two variables takes;
➢ how we can make use of that relationship for predictive purposes, that is, forecasting;
and
92
Since these issues are inter related, correlation and regression analysis, as two sides of a
single process, consists of methods of examining the relationship between two or more
variables. If two (or more) variables are correlated, we can use information about one (or
more) variable(s) to predict the value of the other variable(s), and can measure the error
Correlation is a measure of association between two or more variables. When two or more
exploratory research when the objective is to locate variables that might be related in some way
Correlation can be classified in several ways. The important ways of classifying correlation
are:
If both the variables move in the same direction, we say that there is a positive correlation, i.e.,
if one variable increases, the other variable also increases on an average or if one variable
93
On the other hand, if the variables are varying in opposite direction, we say that it is a case of
If the change in one variable is accompanied by change in another variable in a constant ratio,
X : 10 20 30 40 50
Y : 25 50 75 100 125
The ratio of change in the above example is the same. It is, thus, a case of linear correlation.
If we plot these variables on graph paper, all the points will fall on the same straight line.
On the other hand, if the amount of change in one variable does not follow a constant ratio with
of figures in either series X or series Y are changed, it would give a non-linear correlation.
The distinction amongst these three types of correlation depends upon the number of variables
involved in a study. If only two variables are involved in a study, then the correlation is said to
be simple correlation. When three or more variables are involved in a study, then it is a problem
of either partial or multiple correlation. In multiple correlation, three or more variables are
Suppose we have a problem comprising three variables X, Y and Z. X is the number of hours
studied, Y is I.Q. and Z is the number of marks obtained in the examination. In a multiple
correlation, we will study the relationship between the marks obtained (Z) and the two
variables, number of hours studied (X) and I.Q. (Y). In contrast, when we study the
94
relationship between X and Z, keeping an average I.Q. (Y) as constant, it is said to be a study
The correlation analysis, in discovering the nature and degree of relationship between variables,
does not necessarily imply any cause and effect relationship between the variables. Two
variables may be related to each other but this does not mean that one variable causesthe
other. For example, we may find that logical reasoning and creativity are correlated, but that
does not mean if we could increase peoples’ logical reasoning ability, we would produce greater
relationship. But if it is true that influencing someones’ logical reasoning ability does influence
their creativity, then the two variables must be correlated with each other. In other words,
1. The correlation may be due to chance particularly when the data pertain to a small
sample. A small sample bivariate series may show the relationship but such a
2. It is possible that both the variables are influenced by one or more other variables.
households show a positive relationship because both have increased over time. But,
this is due to rise in family incomes over the same period. In other words, the two
95
3. There may be another situation where both the variables may be influencing each
other so that we cannot say which is the cause and which is the effect. For example,
take the case of price and demand. The rise in price of a commodity may lead to a
decline in the demand for it. Here, price is the cause and the demand is the effect.
In yet another situation, an increase in demand may lead to a rise in price. Here, the
demand is the cause while price is the effect, which is just the reverse of the earlier
The foregoing discussion clearly shows that correlation does not indicate any causation or
has nothing to do with cause and effect relation. It only reveals co-variation between two
variables. Even when there is no cause-and-effect relationship in bivariate series and one
correlation. Obviously, this will be misleading. As such, one has to be very careful in
correlation exercises and look into other relevant factors before concluding a cause-and-effect
relationship.
Correlation Analysis is a statistical technique used to indicate the nature and degree of
relationship existing between one variable and the other(s). It is also used along with regression
analysis to measure how well the regression line explains the variations of the dependent
The commonly used methods for studying linear relationship between two variables involve
both graphic and algebraic methods. Some of the widely used methods include:
1. Scatter Diagram
2. Correlation Graph
96
3. Pearson’s Coefficient of Correlation
This method is also known as Dotogram or Dot diagram. Scatter diagram is one of the simplest
the variables are plotted on the graph paper by putting dots. The diagram so obtained is called
"Scatter Diagram". By studying diagram, we can have rough idea about the nature and degree
of relationship between two variables. The term scatter refers to thespreading of dots on the
graph. We should keep the following points in mind while interpreting correlation:
➢ if the plotted points are very close to each other, it indicates high degree of correlation.
If the plotted points are away from each other, it indicates low degree of correlation.
101
Figure 4-1 Scatter Diagrams
➢ if the points on the diagram reveal any trend (either upward or downward), thevariables
➢ if there is an upward trend rising from lower left hand corner and going upward to the
upper right hand corner, the correlation is positive since this reveals that the values of
the two variables move in the same direction. If, on the other hand, the points depict a
downward trend from the upper left hand corner to the lower right hand corner, the
correlation is negative since in this case the values of the two variables move in the
opposite directions.
➢ in particular, if all the points lie on a straight line starting from the left bottom and going
up towards the right top, the correlation is perfect and positive, and if all the points like
on a straight line starting from left top and coming down to right bottom, the correlation
The various diagrams of the scattered data in Figure 4-1 depict different forms of correlation.
Example 4-1
102
Given the following data on sales (in thousand units) and expenses (in thousand rupees) of a
Month : J F M A M J J A S O
Sales: 50 50 55 60 62 65 68 60 60 50
Expenses: 11 13 14 16 16 15 15 14 13 13
a) Make a Scatter Diagram
b) Do you think that there is a correlation between sales and expenses of the
Solution:(a) The Scatter Diagram of the given data is shown in Figure 4-2
Expenses
Sales
(a) Figure 4-2 shows that the plotted points are close to each other and reveal an upward
trend. So there is a high degree of positive correlation between sales and expenses of the firm.
This method, also known as Correlogram is very simple. The data pertaining to two series are
plotted on a graph sheet. We can find out the correlation by examining the direction and
closeness of two curves. If both the curves drawn on the graph are moving in the same direction,
it is a case of positive correlation. On the other hand, if both the curves are moving in opposite
direction, correlation is said to be negative. If the graph does not show anydefinite pattern
103
Example 4-2
Find out graphically, if there is any correlation between price yield per plot (qtls); denoted by
Plot No.: 1 2 3 4 5 6 7 8 9 10
Y: 3.5 4.3 5.2 5.8 6.4 7.3 7.2 7.5 7.8 8.3
X: 6 8 9 12 10 15 17 20 18 24
Figure 4-3 shows that the two curves move in the same direction and, moreover, they are very
close to each other, suggesting a close relationship between price yield per plot (qtls) and
Remark: Both the Graphic methods - scatter diagram and correlation graph provide a ‘feel
for’ of the data – by providing visual representation of the association between the variables.
These are readily comprehensible and enable us to form a fairly good, thoughrough idea
of the nature and degree of the relationship between the two variables. However, these methods
are unable to quantify the relationship between them. To quantify the extent of correlation, we
A mathematical method for measuring the intensity or the magnitude of linear relationship
104
between two variables was suggested by Karl Pearson (1867-1936), a great British
Biometrician and Statistician and, it is by far the most widely used method in practice.
Karl Pearson’s measure, known as Pearsonian correlation coefficient between two variables X
and Y, usually denoted by r(X,Y) or rxy or simply r is a numerical measure of linear relationship
between them and is defined as the ratio of the covariance between X and Y, tothe product
Symbolically
Cov( X ,Y ) …………(4.1)
rxy =
S x .S y
when, ( X 1 ,Y1 );( X 2 ,Y2 );. ................ ( X n ,Yn ) are N pairs of observations of the variables X and
Y in a bivariate distribution,
Cov( X ,Y ) =
( X − X )(Y − Y ) …………(4.2a)
N
Sx = …………(4.2b)
N
and Sy = …………(4.2c)
N
Thus by substituting Eqs. (4.2) in Eq. (4.1), we can write the Pearsonian correlation
coefficient as
1
( X − X )(Y − Y )
rxy = N
105
If we denote, d x = X − X and d y = Y − Y
Then rxy =
d x dy
…………(4.3a)
2 2
dx d y
We have
1
Cov( X ,Y ) = ( X − X )(Y − Y )
N
1
=
N
XY − XY
XY −
1 X Y
=
N N N
=
N2
1
N XY − X Y …………(4.4)
1
and S 2 = ( X − X )2
x
N
1
= −( X ) 2
X
2
N
2
X 2 − X
1
=
N
N
=
N
1
2
N X 2 − ( X )
2
…………(4.5a)
Similarly, we have
S2 =
1
N Y 2 − ( Y )
2
…………(4.5b)
y
N2
So Pearsonian correlation coefficient may be found as
1
2
N XY − X Y
rxy = N
1 N X 2 − ( X ) 1 N Y 2 − ( Y )
2 2
N2 N2
N XY − X Y
or rxy = …………(4.6)
N X − X N Y − Y
2 2
106
Remark: Eq. (4.3) or Eq. (4.3a) is quite convenient to apply if the means X and
Y come out to be integers. If X or/and Y is (are) fractional then the Eq. (4.3) or Eq. (4.3a) is
used provided the values of X or/ and Y are small. But if X and Y assume large values, the
Thus if (i) X and Y are fractional and (ii) X and Y assume large values, the Eq. (4.3) and Eq.
(4.6) are not generally used for numerical problems. In such cases, the step deviation method
where we take the deviations of the variables X and Y from any arbitrary points is used. We
-1 ≤ r ≤1
Remarks: (i) This property provides us a check on our calculations. If in any problem,the
obtained value of r lies outside the limits + 1, this implies that there is some mistake in our
calculations.
(ii) The sign of r indicate the nature of the correlation. Positive value of r indicates positive
correlation.
(iii) The following table sums up the degrees of correlation corresponding to various
values of r:
107
Value of r Degree of correlation
±1 perfect correlation
±0.90 or more very high degree of correlation
sufficiently high degree of
±0.75 to ±0.90 correlation
±0.60 to ±0.75 moderate degree of correlation
only the possibility of a
±0.30 to ±0.60 correlation
less than ±0.30 possibly no correlation
0 absence of correlation
X−A Y −B
U= and V=
h k
Where A, B, h and k are constants and h > 0, k > 0; then the correlation coefficient
Remark: This is one of the very important properties of the correlation coefficient andis
extremely helpful in numerical computation of r. We had already stated that Eq. (4.3) and
Eq.(4.6) become quite tedious to use in numerical problems if X and/or Y are in fractions or if
X and Y are large. In such cases we can conveniently change the origin and scale (if possible)
in X or/and Y to get new variables U and V and compute the correlation between U and V by
N UV − U V
rxy = ruv = …………(4.7)
N U − U N V − V
2 2
3. Two independent variables are uncorrelated but the converse is not true
108
If X and Y are independent variables then
rxy = 0
However, the converse of the theorem is not true i.e., uncorrelated variables need not
distribution.
X : 1 2 3 -3 -2 -1
Y : 1 4 9 9 4 1
Hence in the above example the variable X and Y are uncorrelated. But if we examine
the data carefully we find that X and Y are not independent but are connected by the
relation Y = X2. The above example illustrates that uncorrelated variables need not be
independent.
Remarks: One should not be confused with the words uncorrelation and independence.
rxy = 0 i.e., uncorrelation between the variables X and Y simply implies the absence of any linear
(straight line) relationship between them. They may, however, be related in some other form
other than straight line e.g., quadratic (as we have seen in the above example), logarithmic or
trigonometric form.
coefficients, i.e.
The signs of both the regression coefficients are the same, and so the value of r will
This property will be dealt with in detail in the next lesson on Regression Analysis.
109
5. The square of Pearsonian correlation coefficient is known as the coefficient of
determination.
variable that is accounted for by the independent variable, is a much better and useful
measure for interpreting the value of r. This property will also be dealtwith in detail
The correlation coefficient establishes the relationship of the two variables. After ascertaining
this level of relationship, we may be interested to find the extent upto which this coefficient is
dependable. Probable error of the correlation coefficient is such a measure of testing the
reliability of the observed value of the correlation coefficient, when we consider it as satisfying
for the two variables under consideration, then the Probable Error, denoted by PE (r) is
expressed as
1− r2
or PE(r) = 0.6745
N
PE(r), implying that if we take another random sample of the size N from the same
population, then the observed value of the correlation coefficient in the secondsample
can be expected to lie within the limits given above, with 0.5 probability. When sample
110
conclusions. Hence to use the concept of PE effectively, sample size N it should be
fairly large.
correlation.
Example 4-3
Find the Pearsonian correlation coefficient between sales (in thousand units) and expenses (in
Firm: 1 2 3 4 5 6 7 8 9 10
Sales: 50 50 55 60 65 65 65 60 60 50
Expenses: 11 13 14 16 16 15 15 14 13 13
1 50 11 -8 -3 64 9 24
2 50 13 -8 -1 64 1 8
3 55 14 -3 0 9 0 0
4 60 16 2 2 4 4 4
5 65 16 7 2 49 4 14
6 65 15 7 1 49 1 7
7 65 15 7 1 49 1 7
8 60 14 2 0 4 0 0
9 60 13 2 -1 4 1 -2
10 50 13 -8 -1 64 1 8
X Y d
2
x d
2
y
d x dy
111
= = =360 =22 =70
580 140
X=
X = 580 = 58 and Y=
Y = 140 = 14
N 10 N 10
rxy =
d x dy
d d
70
rxy =
360x22
70
rxy =
7920
rxy = 0.78
The value of rxy = 0.78 , indicate a high degree of positive correlation between sales and expenses.
Example 4-4
The data on price and quantity purchased relating to a commodity for 5 months is given
below:
Find the Pearsonian correlation coefficient between prices and quantity and comment on its
112
X =55 Y =21 X 2 = 609 Y 2 = 95 XY = 226
rxy =
− −
5x226 − 55x21
rxy =
(5x609 − 55x55)(5x95 − 21x21)
1130 −1155
rxy =
20x34
− 25
rxy =
680
rxy = −0.98
The negative sign of r indicate negative correlation and its large magnitude indicate a very high
degree of correlation. So there is a high degree of negative correlation between prices and
quantity demanded.
Example 4-5
Find the Pearsonian correlation coefficient from the following series of marks obtained by 10
X: 45 70 65 30 90 40 50 75 85 60
Y: 35 90 70 40 95 40 60 80 80 50
Solution:
Calculations for Coefficient of Correlation
{Using Eq. (4.7)}
X Y U V U2 V2 UV
45 35 -3 -6 9 36 18
70 90 2 5 4 25 10
65 70 1 1 1 1 1
113
30 40 -6 -5 36 25 30
90 95 6 6 36 36 36
40 40 -4 -5 16 25 20
50 60 -2 -1 4 1 2
75 80 3 3 9 9 9
85 80 5 3 25 9 15
60 50 0 -3 0 9 0
X − 60 Y − 65
U= and V =
5 5
N UV − (U V )
rxy = ruv =
− −
10x141 − 2x(−2)
=
10x140 − 2x2 10x176 − (−2)x(−2)
1410 + 4
=
1400 − 4 1760 − 4
1414
=
2451376
= 0.9
So there is a high degree of positive correlation between marks obtained in Mathematics and
in Statistics.
1− r2
PE(r) = 0.6745
114
1 − 0.9
PE(r) = 0.6745
10
PE(r) = 0.0405
Sometimes we come across statistical series in which the variables under consideration are
not capable of quantitative measurement but can be arranged in serial order. This happens when
we are dealing with qualitative characteristics (attributes) such as honesty, beauty, character,
morality, etc., which cannot be measured quantitatively but can be arranged serially. In such
situations Karl Pearson’s coefficient of correlation cannot be used as such. Charles Edward
Spearman, a British Psychologist, developed a formula in 1904, which consists in obtaining the
correlation coefficient between the ranks of N individuals in the two attributes under study.
Suppose we want to find if two characteristics A, say, intelligence and B, say, beauty are related
or not. Both the characteristics are incapable of quantitative measurements but we can arrange
a group of N individuals in order of merit (ranks) w.r.t. proficiency in the two characteristics.
Let the random variables X and Y denote the ranks of the individuals in the characteristics A
and B respectively. If we assume that there is no tie, i.e., if no two individuals get the same
rank in a characteristic then, obviously, X and Y assume numerical values ranging from 1 to N.
The Pearsonian correlation coefficient between the ranks X and Y is called the rank correlation
Spearman’s rank correlation coefficient, usually denoted by ρ(Rho) is given by the equation
6 d 2
ρ =1 − …………(4.8)
N (N 2 − 1)
115
Where d is the difference between the pair of ranks of the same individual in the two
Example 4-6
Ten entries are submitted for a competition. Three judges study each entry and list the ten in
Entry: A B C D E F G H I J
Judge J1: 9 3 7 5 1 6 2 4 10 8
Judge J2: 9 1 10 4 3 8 5 2 7 6
Judge J3: 6 3 8 7 2 4 1 5 9 10
Calculate the appropriate rank correlation to help you answer the following questions:
6 d 2
(J1 & J2) = 1 −
N (N 2 − 1)
116
6 x 48
=1 −
10(102 −1)
288
=1 −
990
=1 – 0.29
= +0.71
6 d 2
(J1 & J3) =1 −
N (N 2 − 1)
=1 − 6 x 26
10(102 −1)
156
=1 −
990
=1 – 0.1575
= +0.8425
6 d 2
(J2 & J3) =1 −
N (N 2 − 1)
=1 − 6 x 88
10(102 −1)
528
=1 −
990
=1 – 0.53
= +0.47
Spearman’s rank correlation Eq.(4.8) can also be used even if we are dealing with variables,
which are measured quantitatively, i.e. when the actual data but not the ranks relating to two
variables are given. In such a case we shall have to convert the data into ranks. The highest
(or the smallest) observation is given the rank 1. The next highest (or the next lowest)
observation is given rank 2 and so on. It is immaterial in which way (descending or ascending)
the ranks are assigned. However, the same approach should be followed for all the variables
under consideration.
117
Example 4-7
Calculate the rank coefficient of correlation from the following data:
X: 75 88 95 70 60 80 81 50
Y: 120 134 150 115 110 140 142 100
Solution:
Calculations for Coefficient of Rank Correlation
{Using Eq.(4.8)}
X Ranks RX Y Ranks RY d = RX -RY d2
75 5 120 5 0 0
88 2 134 4 -2 4
95 1 150 1 0 0
70 6 115 6 0 0
60 7 110 7 0 0
80 4 140 3 +1 1
81 3 142 2 +1 1
50 8 100 8 0 0
d2 = 6
6 d 2
= 1−
N (N 2 − 1)
6x6
= 1−
8(82 −1)
36
= 1−
504
= 1 – 0.07
= + 0.93
Hence, there is a high degree of positive correlation between X and Y
Repeated Ranks
In case of attributes if there is a tie i.e., if any two or more individuals are placed together in
any classification w.r.t. an attribute or if in case of variable data there is more than one item
with the same value in either or both the series then Spearman’s Eq.(4.8) for calculating the
rank correlation coefficient breaks down, since in this case the variables X [the ranks of
118
individuals in characteristic A (1st series)] and Y [the ranks of individuals in characteristic B
In this case common ranks are assigned to the repeated items. These common ranks are the
arithmetic mean of the ranks, which these items would have got if they were different from
each other and the next item will get the rank next to the rank used in computing the common
rank. For example, suppose an item is repeated at rank 4. Then the common rank to be assigned
to each item is (4+5)/2, i.e., 4.5 which is the average of 4 and 5, the ranks which these
observations would have assumed if they were different. The next item will be assigned the
rank 6. If an item is repeated thrice at rank 7, then the common rank to be assigned toeach
value will be (7+8+9)/3, i.e., 8 which is the arithmetic mean of 7,8 and 9 viz., the ranks these
observations would have got if they were different from each other. The next rank to be
If only a small proportion of the ranks are tied, this technique may be applied together with
m(m 2 −1)
…………(4.8a)
12
d
2
to ; where m is the number of times an item is repeated. This correction factor is to be
Example 4-8
For a certain joint stock company, the prices of preference shares (X) and debentures (Y) are
given below:
X: 73.2 85.8 78.9 75.8 77.2 81.2 83.8
Y: 97.8 99.2 98.8 98.3 98.3 96.7 97.1
119
Use the method of rank correlation to determine the relationship between preference prices
and debentures prices.
Solution:
Calculations for Coefficient of Rank Correlation
{Using Eq. (4.8) and (4.8a)}
X Y Rank of X (XR) Rank of Y (YR) d = XR – YR d2
73.2 97.8 7 5 2 4
85.8 99.2 1 1 0 0
78.9 98.8 4 2 2 4
75.8 98.3 6 3.5 2.5 6.25
77.2 98.3 5 3.5 1.5 2.25
81.2 96.7 3 7 -4 16
83.8 97.1 2 6 -4 16
d =0 d
2
= 48.50
In this case, due to repeated values of Y, we have to apply ranking as average of 2 ranks, which
could have been allotted, if they were different values. Thus ranks 3 and 4 have been allotted
as 3.5 to both the values of Y = 98.3. Now we also have to apply correction factor
m(m 2 −1)
d
2
to , where m in the number of times the value is repeated, here m = 2.
12
2 m(m 2 − 1)
6 d +
2
=
N (N 2 − 1)
2(4 − 1)
648.5 +
12
=
7(72 − 1)
6 x 49
= 1-
7 x 48
= 0.125
Hence, there is a very low degree of positive correlation, probably no correlation,
120
Remarks on Spearman’s Rank Correlation Coefficient
correlation coefficient, r, between the ranks, it can be interpreted in the same way
3. Karl Pearson’s correlation coefficient assumes that the parent population from
which sample observations are drawn is normal. If this assumption is violated then
is such a distribution free measure, since no strict assumption are made about the
formula. The values obtained by the two formulae, viz Pearsonian r and Spearman’s
are generally different. The difference arises due to the fact that when ranking is
used instead of full set of observations, there is always some loss of information.
Unless many ties exist, the coefficient of rank correlation shouldbe only slightly
measured quantitatively but can be arranged serially. It can also be used where
6. Spearman’s formula has its limitations also. It is not practicable in the case of
bivariate frequency distribution. For N >30, this formula should not be used unless
121
4.3.5 CONCURRENT DEVIATION METHOD
This is a casual method of determining the correlation between two series when we are not very
serious about its precision. This is based on the signs of the deviations (i.e. the
direction of the change) of the values of the variable from its preceding value and does not take
into account the exact magnitude of the values of the variables. Thus we put a plus (+) sign,
minus (-) sign or equality (=) sign for the deviation if the value of the variable is greater than,
less than or equal to the preceding value respectively. The deviations in the values of two
variables are said to be concurrent if they have the same sign (either both deviations are positive
or both are negative or both are equal). The formula used for computing correlation coefficient
2c − N
rc = + + …………(4.9)
Where c is the number of pairs of concurrent deviations and N is the number of pairs of
deviations. If (2c-N) is positive, we take positive sign in and outside the square root in Eq. (4.9)
and if (2c-N) is negative, we take negative sign in and outside the square root in Eq. (4.9).
Remarks: (i) It should be clearly noted that here N is not the number of pairs of observations
but it is the number of pairs of deviations and as such it is one less than the number of pairs of
observations.
“If the short time fluctuations of the time series are positively correlated or in other
words, if their deviations are concurrent, their curves would move in the same direction
and would indicate positive correlation between them”
Example 4-9
122
Calculate coefficient of correlation by the concurrent deviation method
Supply: 112 125 126 118 118 121 125 125 131 135
Price: 106 102 102 104 98 96 97 97 95 90
Solution:
Calculations for Coefficient of Concurrent Deviations
{Using Eq. (4.9)}
Supply Sign of deviation from Price Sign of deviation Concurrent
(X) preceding value (X) (Y) preceding value (Y) deviations
112 106
125 + 102 -
126 + 102 =
118 - 104 +
118 = 98 -
121 + 96 -
125 + 97 + +(c)
125 = 97 = = (c)
131 + 95 -
135 + 90 -
We have
Number of pairs of deviations, N =10 – 1 = 9
c = Number of concurrent deviations
= Number of deviations having like signs
=2
Coefficient of correlation by the method of concurrent deviations is given by:
2c − N
rc = +
rc =
rc = +
Since 2c – N = -5 (negative), we take negative sign inside and outside the square root
123
rc = −
rc = − 0.5556
rc = −0.7
Hence there is a fairly good degree of negative correlation between supply and price.
As mentioned earlier, correlation analysis is a statistical tool, which should be properly used so
resulting in misleading conclusions. We give below some errors frequently made in the use of
correlation analysis:
reasonably sure that one variable is the cause while the other is the effect. Let us take
an example. .
Suppose that we study the performance of students in their graduate examination and
their earnings after, say, three years of their graduation. We may find that these two
variables are highly and positively related. At the same time, we must not forget that
both the variables might have been influenced by some other factors such as quality of
process and so forth. If the data on these factors are available, then it is worthwhile to
that correlation explains 70 percent of the total variation in Y. The error can be seen
124
determination r2 will be 0.49. This means that only 49 percent of the total variation in
Y is explained.
causal relationship, that is, the percentage of the change in one variable is due to the
3. Another mistake in the interpretation of the coefficient of correlation occurs when one
concludes a positive or negative relationship even though the two variables are actually
unrelated. For example, the age of students and their score in the examinationhave no
relation with each other. The two variables may show similar movements but there does
To sum up, one has to be extremely careful while interpreting coefficient of correlation. Be-
fore one concludes a causal relationship, one has to consider other relevant factors that might
have any influence on the dependent variable or on both the variables. Such an approach will
avoid many of the pitfalls in the interpretation of the coefficient of correlation. It has been
rightly said that the coefficient of correlation is not only one of the most widely used, but also
2. Explain the meaning and significance of the concept of correlation. Does correlation
always signify casual relationships between two variables? Explain with illustration
(a) Over a period of time there has been an increased financial aid to under developed
countries and also an increase in comedy act television shows. The correlation is
almost perfect.
125
(b) The correlation between salaries of school teachers and amount of liquor sold
4. What is a scatter diagram? How does it help in studying correlation between two
6. Draw a scatter diagram from the data given below and interpret it.
X: 10 20 30 40 50 60 70 80
Y: 32 20 24 36 40 28 38 44
X: 39 65 62 90 82 75 25 98 36 78
Y: 47 53 58 86 62 68 60 91 51 84
people at random by asking the number of advertisements read or seen in a week (X)
X: 5 10 4 0 2 7 3 6
Y: 10 12 5 2 1 3 4 8
126
Calculate the correlation coefficient and comment on the result.
9. Calculate coefficient of correlation between X and Y series from the following data
X: 78 89 96 69 59 79 68 61
Y: 125 137 156 112 107 136 123 108
10. In two set of variables X and Y, with 50 observations each, the following data are
observed:
X = 10, SD of X = 3
Y = 6, SD of Y = 2 rxy = 0.3
However, on subsequent verification, it was found that one value of X (=10) and one
value of Y (= 6) were inaccurate and hence weeded out with the remaining 49 pairs of
11. Calculate coefficient of correlation r between the marks in statistics (X) and
X: 52 74 93 55 41 23 92 64 40 71
Y: 45 80 63 60 35 40 70 58 43 64
12. The coefficient of correlation between two variables X and Y is 0.48. The covariance
13. Twelve entries in painting competition were ranked by two judges as shown below:
Entry: A B C D E F G H I J
Judge I: 5 2 3 4 1 6 8 7 10 9
Judge II: 4 5 2 1 6 7 10 9 3 8
14. Calculate Spearman’s rank correlation coefficient between advertisement cost (X) and
127
X: 39 65 62 90 82 75 25 98 36 78
Y: 47 53 58 86 62 68 60 91 51 84
15. An examination of eight applicants for a clerical post was taken by a firm. From the
marks obtained by the applicants in the Accountancy (X) and Statistics (Y) paper,
Applicant: A B C D E F G H
X: 15 20 28 12 40 60 20 80
Y: 40 30 50 30 20 10 30 60
16. Calculate the coefficient of concurrent deviation from the following data:
17. Obtain a suitable measure of correlation from the following data regarding changes in
Month: A M J J A S O N D
A: +4 +3 +2 -1 -3 +4 -5 +1 +2
B: -2 +5 +3 -2 -1 -3 +4 -1 -3
18. The cross-classification table shows the marks obtained by 105 students in the
Marks in Statistics
50-59 4 6 8 7 25
60-69 - 10 12 13 35
70-79 16 9 20 - 45
80-89 - - - - -
Total 20 25 40 20 105
128
129
REGRESSION ANALYSIS
130
...if we find any association between two or more variables, we might be interested in
estimating the value of one variable for known value(s) of another variable(s)
5.1 INTRODUCTION
In business, several times it becomes necessary to have some forecast so that the management
can take a decision regarding a product or a particular course of action. In order to make a
forecast, one has to ascertain some relationship between two or more variables relevant to a
particular situation. For example, a company is interested to know how far the demand for
television sets will increase in the next five years, keeping in mind the growth of population
in a certain town. Here, it clearly assumes that the increase in population will lead to an
increased demand for television sets. Thus, to determine the nature and extent of relationship
In the preceding lesson, we studied in some depth linear correlation between two variables.
Here we have a similar concern, the association between variables, except that we develop it
further in two respects. First, we learn how to build statistical models of relationshipsbetween
the variables to have a better understanding of their features. Second, we extend the models to
For this purpose, we have to use the technique - regression analysis - which forms the subject-
In 1889, Sir Francis Galton, a cousin of Charles Darwin published a paper on heredity,
“Natural Inheritance”. He reported his discovery that sizes of seeds of sweet pea plants
appeared to “revert” or “regress”, to the mean size in successive generations. He also reported
results of a study of the relationship between heights of fathers and heights of their sons. A
straight line was fit to the data pairs: height of father versus height of son. Here, too, he found
a “regression to mediocrity” The heights of the sons represented a movement away from their
131
fathers, towards the average height. We credit Sir Galton with the idea of statistical regression.
While most applications of regression analysis may have little to do with the
now refers to the statistical technique of modeling the relationship between two or
more variables. In general sense, regression analysis means the estimation or prediction
of the unknown value of one variable from the known value(s) of the other variable(s).
It is one of the most important and widely used statistical techniques in almost all
In this lesson we will focus only on simple regression –linear regression involving only two
variables: a dependent variable and an independent variable. Regression analysis for studying
Simple regression involves only two variables; one variable is predicted by another variable.
The variable to be predicted is called the dependent variable. The predictor is called the
independent variable, or explanatory variable. For example, when we are trying to predict
the demand for television sets on the basis of population growth, we are using the demand for
television sets as the dependent variable and the population growth as the independent or
predictor variable.
The decision, as to which variable is which sometimes, causes problems. Often the choice is
obvious, as in case of demand for television sets and population growth because it would make
no sense to suggest that population growth could be dependent on TV demand! The population
growth has to be the independent variable and the TV demand the dependent variable.
132
If we are unsure, here are some points that might be of use:
➢ if we have control over one of the variables then that is the independent. For example,
a manufacturer can decide how much to spend on advertising and expect his sales to
➢ it there is any lapse of time between the two variables being measured, then the latter
must depend upon the former, it cannot be the other way round
➢ if we want to predict the values of one variable from your knowledge of the other
The task of bringing out linear relationship consists of developing methods of fitting a
straight line, or a regression line as is often called, to the data on two variables.
The line of Regression is the graphical or relationship representation of the best estimate of one
variable for any given value of the other variable. The nomenclature of the line depends on the
independent and dependent variables. If X and Y are two variables of which relationship is to
be indicated, a line that gives best estimate of Y for any value of X, it is called Regression line
For purposes of illustration as to how a straight line relationship is obtained, consider the
sample paired data on sales of each of the N = 5 months of a year and the marketing expenditure
Table 5-1
Sales Marketing Expenditure
Month (Rs lac) (Rs thousands)
133
Y X
April 14 10
May 17 12
June 23 15
July 21 20
August 25 23
Let Y, the sales, be the dependent variable and X, the marketing expenditure, the independent
variable. We note that for each value of independent variable X, there is a specific value of
the dependent variable Y, so that each value of X and Y can be seen as paired observations.
between the two variables is linear, that is, the one which is best explained by a straight line. A
good way of doing this is to plot the data on X and Y on a graph so as to yielda scatter diagram,
as may be seen in Figure 5-1. A careful reading of the scatter diagram reveals that:
➢ the overall tendency of the points is to move upward, so the relationship is positive
➢ the general course of movement of the various points on the diagram can be best
➢ there is a high degree of correlation between the variables, as the points are very close
to each other
134
Figure 5-1 Scatter Diagram with Line of Best Fit
If the movement of various points on the scatter diagram is best described by a straight line,
the next step is to fit a straight line on the scatter diagram. It has to be so fitted that on the whole
it lies as close as possible to every point on the scatter diagram. The necessary
requirement for meeting this condition being that the sum of the squares of the vertical
As shown in Figure 5-1, if dl, d2,..., dN are the vertical deviations' of observed Y values from
d 2 + d 2 + ...................... + d 2 =
N
d2
1 2 N j
j =1
is the minimum. The deviations dj have to be squared to avoid negative deviations canceling
out the positive deviations. Since a straight line so fitted best approximates all the points on the
scatter diagram, it is better known as the best approximating line or the line of best fit. A line
Free hand drawing is the simplest method of fitting a straight line. After a careful
inspection of the movement and spread of various points on the scatter diagram, a
straight line is drawn through these points by using a transparent ruler such that on the
135
whole it is closest to every point. A straight line so drawn is particularly useful when
Whereas the use of free hand drawing may yield a line nearest to the line of best fit, the major
drawback is that the slope of the line so drawn varies from person to person because of the
influence of subjectivity. Consequently, the values of the dependent variable estimated on the
basis of such a line may not be as accurate and precise as those based on the line of best fit.
The least square method of fitting a line of best fit requires minimizing the sum of the
squares of vertical deviations of each observed Y value from the fitted line. These deviations,
such as d1 and d3, are shown in Figure 5-1 and are given by Y - Yc, where Y is the observed
value and Yc the corresponding computed value given by the fitted line
Yc = a + bX i …………(5.1)
The straight line relationship in Eq.(5.1), is stated in terms of two constants a and b
➢ The constant a is the Y-intercept; it indicates the height on the vertical axis from
where the straight line originates, representing the value of Y when X is zero.
➢ Constant b is a measure of the slope of the straight line; it shows the absolute change in
Y for a unit change in X. As the slope may be positive or negative, it indicates the nature
coefficient of Y on X.
Since a straight line is completely defined by its intercept a and slope b, the task of fitting the
same reduces only to the computation of the values of these two constants. Once these two
values are known, the computed Yc values against each value of X can be easily obtained by
136
In the method of least squares the values of a and b are obtained by solving simultaneously
Y = aN + b X …………(5.2)
XY = a X + b X 2 …………(5.2)
The value of the expressions - X , Y , XY and X 2 can be obtained from the given
observations and then can be substituted in the above equations to obtain the value of a and b.
Since simultaneous solving the two normal equations for a and b may quite often be
cumbersome and time consuming, the two values can be directly obtained as
a = Y − bX …………(5.3)
and
N XY − X Y
b= …………(5.4)
N X − ( X )
2 2
Note: Eq. (5.3) is obtained simply by dividing both sides of the first of Eqs. (5.2) by N and
Eq.(5.4) is obtained by substituting (Y − b X ) in place of a in the second of Eqs. (5.2)
Y X 2 − X XY
…………(5.5)
a=
N X − ( X )
2 2
and
Y −a
b= …………(5.6)
X
N XY − X Y
Note: Eq. (5.5) is obtained by substituting for b in Eq. (5.3) and Eq.
N X 2 − ( X )
2
137
Table 5-2
Computation of a and b
Y X XY X2 Y2
138
14 10 140 100 196
17 12 204 144 289
23 15 345 225 529
21 20 420 400 441
25 23 575 529 625
100x1398 − 80x1684
a=
5x1398 − (80 )2
139800 −134720
=
6990 − 6400
5080
=
590
= 8.6101695
and
5x1684 − 80x100
b=
5x1398 − (80)2
8420 − 8000
=
6990 − 6400
420
=
590
= 0.7118644
138
Figure 5-2 Regression Line of Y on X
Then, to fit the line of best fit on the scatter diagram, only two computed Yc values are
needed. These can be easily obtained by substituting any two values of X in Eq. (5.1a). When
these are plotted on the diagram against their corresponding values of X, we get two points,
by joining which (by means of a straight line) gives us the required line of best fit, as shown
in Figure 5-2
We can have some important relationships for data analysis, involving other measures such as
Yc = ( Y − b X ) +bX
XY X Y
N N
N −
b= 2
X 2
X
−
N N
XY − XY
or b= N 2
Sx
Cov( X ,Y )
or b= …………(5.8)
S x2
Cov( X , Y )
rxy =
Sx Sy
139
or Cov( X , Y ) = rxy S x S y
b = r Sx S y
xy
S x2
Sy
b=r …………(5.9)
xy
Sx
c xy
Sx
The main objective of regression analysis is to know the nature of relationship between two
variables and to use it for predicting the most likely value of the dependent variable
corresponding to a given, known value of the independent variable. This can be done by
substituting in Eq.(5.1a) any known value of X corresponding to which the most likely estimate
of Y is to be found.
Yc = 8.61 + 0.71(15)
= 8.61 + 10.65
= 19.26
It may be appreciated that an estimate of Y derived from a regression equation will not be
exactly the same as the Y value which may actually be observed. The difference between
estimated Yc values and the corresponding observed Y values will depend on the extent of
140
The closer the various paired sample points (Y, X) clustered around the line of best fit, the
smaller the difference between the estimated Yc and observed Y values, and vice-versa. On the
whole, the lesser the scatter of the various points around, and the lesser the vertical distance by
which these deviate from the line of best fit, the more likely it is that an estimated Yc valueis
The estimated Yc values will coincide the observed Y values only when all the points on the
scatter diagram fall in a straight line. If this were to be so, the sales for a given marketing
expenditure could have been estimated with l00 percent accuracy. But such a situation is too
rare to obtain. Since some of the points must lie above and some below the straight line, perfect
prediction is practically non-existent in the case of most business and economic situations.
This means that the estimated values of one variable based on the known values of the other
variable are always bound to differ. The smaller the difference, the greater the precision of
the estimate, and vice-versa. Accordingly, the preciseness of an estimate can be obtained only
through a measure of the magnitude of error in the estimates, called the error of estimate.
A measure of the error of estimate is given by the standard error of estimate of Y on X, denoted
2
Syx = …………(5.11)
c
Syx measures the average absolute amount by which observed Y values depart from the
Computation of Syx becomes little cumbersome where the number of observations N is large.
141
Syx = …………(5.12)
N
By substituting the values of Y 2 , Y , and XY from the Table 5-2, and the calculated
values of a and b
We have
23.36
=
= 4.67
= 2.16
Interpretations of Syx
A careful observation of how the standard error of estimate is computed reveals the following:
1. Syx is a concept statistically parallel to the standard deviation Sy . The only difference
between the two being that the standard deviation measures the dispersion around the
mean; the standard error of estimate measures the dispersion around the regression line.
Similar to the property of arithmetic mean, the sum of the deviations of different Y
2. Syx tells us the amount by which the estimated Yc values will, on an average, deviate
from the observed Y values. Hence it is an estimate of the average amount of error in
the estimated Yc values. The actual error (the residual of Y and Yc) may, however, be
smaller or larger than the average error. Theoretically, these errors follow a normal
distribution. Thus, assuming that n ≥ 30, Yc ± 1.Syx means that 68.27% of the estimates
142
based on the regression equation will be within 1.Syx Similarly, Yc ± 2.Syx means that
thousand being Rs 19.26 lac, one may like to know how good this estimate is. Since Syx
is estimated to be Rs 2.16 lac, it means there are about 68 chances (68.27) out of 100
that this estimate is in error by not more than Rs 2.16 lac above or below Rs
19.26 lac. That is, there are 68% chances that actual sales would fall between (19.26 -
3. Since Syx measures the closeness of the observed Y values and the estimated Yc values,
it also serves as a measure of the reliability of the estimate. Greater the closeness
between the observed and estimated values of Y, the lesser the error and, consequently,
4. Standard error of estimate Syx can also be seen as a measure of correlation insofar as it
expresses the degree of closeness of scatter of observed Y values about the regression
line. The closer the observed Y values scattered around the regression line, the higher
same units of measurement as the data on the dependent variable. This creates problems
is mainly due to this limitation that the standard error of estimate is not generally used
143
So far we have considered the regression of Y on X, in the sense that Y was in the role of
dependent and X in the role of an independent variable. In their reverse position, such that X
is now the dependent and Y the independent variable, we fit a line of regression of X on Y.
Where Xc denotes the computed values of X against the corresponding values of Y. a’ is the
a’ = X - b’ Y ............................................................................. (5.15)
and
N XY − X Y
b' = …………(5.16)
N Y 2 − ( Y )
2
or
a' =
X Y 2 − Y XY …………(5.17)
N Y 2 − ( Y )
2
and
X − a'
b' = …………(5.18)
Y
Cov(Y , X )
b' = …………(5.19)
S y2
144
Sx
b' = ryx S …………(5.20)
145
Xc - X = b’ (Y- Y ) .................................................................. (5.21)
Sx
Xc - X = r yx (Y - Y ) ......................................................... (5.22)
S
y
As before, once the values of a’ and b’ have been found, their substitution in Eq.(5.13) will
Sxy
( X − X c)
2
= .......................................................................................... (5.23)
N
or
Sxy = …………(5.24)
N
For example, if we want to estimate the marketing expenditure to achieve a sale target of Rs
Xc = a’ + b’Y
So using Eqs. (5.17) and (5.16), and substituting the values of X , Y 2 , Y and XY
from Table 5-2, we have
80x2080 − 100x1684
a' =
5x2080 − (100 )2
166400 − 168400
=
10400 − 10000
− 2000
=
400
= -5.00
and
5x1684 − 80x100
b' =
5x2080 − (100 )2
8420 − 8000
=
10400 − 10000
146
420
=
400
= 1.05
Now given that a’= -5.00 and b’=1.05, Regression equation (5.13) takes the form
Xc = -5.00 +1.05Y
Xc = -5.00+1.05x40
= -5 + 42
= 37
marketing.
the effect on dependent variable if there is a unit change in the independent variable. Since
for a paired data on X and Y variables, there are two regression lines: regression line of Y on X
The following are the important properties of regression coefficients that are helpful in data
analysis
1. The value of both the regression coefficients cannot be greater than 1. However, value
of both the coefficients can be below 1 or at least one of them must be below 1, so
that the square root of the product of two regression coefficients must lie in the limit
±1.
147
r = ±..............................................................................
b. b' (5.25)
The signs of both the regression coefficients are the same, and so the value of r will
3. The mean of both the regression coefficients is either equal to or greater than the
b + b'
r
2
X−A Y −B
U= and V=
h k
r2 = b.b’
Sy Sx
Y- Y = r (X - X ) and X- X = r (Y - Y )
Sx Sy
148
We can write the slope of these lines, as
Sy Sx
b= r and b’ = r
Sx Sy
b − b'
tan =
1 + bb'
S x S y r 2 −1
= 2
S + S 2 r
x y
–1
S S r 2 −1
x y
2
or = tan S + S2 r …………(5.26)
x y
148
Figure 5-3 Regression Lines and Coefficient of Correlation
Eq. (5.26) reveals the following:
➢ In case of perfect positive correlation (r = +1) and in case of perfect negative correlation
(r = -1), = 0, so the two regression lines will coincide, i.e. we have only one line, see
The farther the two regression lines from each other, lesser will be the degree of
correlation and nearer the two regression lines, more will be the degree of correlation,
➢ If the variables are independent i.e. r = 0, the lines of regression will cut each other at
Note : Both the regression lines cut each other at mean value of X and mean value of Y i.e. at
X and Y .
accounted for by the independent variable. In other words, the coefficient of determination
gives the ratio of the explained variance to the total variance. The coefficient of determination
Coefficient of determination
Explained Variance
r2 =
Total Variance
((Y −Y )
2
) …………(5.27)
2
r = c
2
Y −Y
149
We can calculate another coefficient K2, known as coefficient of Non-Determination, which
(Y − Y ) 2
( ) …………(5.28)
2
K = c
2
Y −Y
Explained Variance
K2 = 1-
Total Variance
= 1 - r2 ................................................................................. (5.29)
The square root of the coefficient of non-determination, i.e. K gives the coefficient of
alienation
K = ± ............................................................................ (5.30)
A simple algebraic operation on Eq. (5.30) brings out some interesting points about the
(Y − Y ) (Y − Y )
2
= N S2 = N S2
2
c and
yx y
(Y − Y )
2
K2 = c
(Y − Y ) 2
N S yx2
K2 =
N S y2
S yx2
=
S y2
2
S yx
So 1 – r2 =
S y2
S yx
or = …………(5.31)
Sy
150
If coefficient of correlation, r, is defined as the under root of the coefficient of determination
r= r2
2
S yx
r = 1−
2
S y2
S
r = ................................................................................
1 − yx2 (5.32)
Sy
On carefully observing Eq. (5.32), it will be noticed that the ratio Syx/Sy will be large if the
coefficient of determination is small, and it will be small when the coefficient of determination
is large. Thus
Eq. (5.32) also implies that Syx is generally less than Sy. The two can at the most be equal, but
Interpretations of r2:
1. Even though the coefficient of determination, whose under root measures the degree
pure number, the unit in which Syx is measured becomes irrelevant. This facilitates
comparison between the two sets of data in terms of their coefficient of determination
r2 (or the coefficient of correlation r). This was not possible in terms of Sy x as the
2. The value of r2 can range between 0 and 1. When r2 = 1, all the points on the scatter
diagram fall on the regression line and the entire variations are explained by the straight
line. On the other hand, when r2 = 0, none of the points on the scatter diagramfalls on
the regression line, meaning thereby that there is no relationship between the two
151
not tell us about the direction of the relationship (whether it is positive or negative)
3. When r2 = 0.7455 (or any other value), 74.55% of the total variations in sales are
explained by the marketing expenditure used. What remains is the coefficient of non-
unexplained, which are due to factors other than the changes in the marketing
expenditure.
4. r2 provides the necessary link between regression and correlation which are the two
related aspects of a single problem of the analysis of relationship between two variables.
variables under study, without making a distinction between the dependent and
independent ones. Nor does it, therefore, help in predicting the value of one variable for
5. The coefficient of correlation overstates the degree of relationship and it’s meaning is
sales and marketing expenditure. Therefore, the coefficient of' determination is a more
6. The sum of r and K never adds to one, unless one of the two is zero. That is, r + K can
the relationship between the variables. If we have information on more than one variable, we
might be interested in seeing if there is any connection - any association - between them. If
152
we found such a association, we might again be interested in predicting the value of one
1. Correlation literally means the relationship between two or more variables that vary in
movements in the other(s). On the other hand, regression means stepping back or
returning to the average value and is a mathematical measure expressing the average
2. Correlation coefficient rxy between two variables X and Y is a measure of the direction
and degree of the linear relationship between two variables that is mutual. It is
symmetric, i.e., ryx = rxy and it is immaterial which of X and Y is dependent variable and
Regression analysis aims at establishing the functional relationship between the two( or
more) variables under study and then using this relationship to predict or estimate the
value of the dependent variable for any given value of the independent variable(s).It
also reflects upon the nature of the variable, i.e., which is dependent variable and which
is independent variable. Regression coefficient are not symmetric in X and Y, i.e., byx
bxy.
3. Correlation need not imply cause and effect relationship between the variable under
study. However, regression analysis clearly indicates the cause and effect relationship
4. Correlation coefficient rxy is a relative measure of the linear relationship between X and
±1.
153
On the other hand, the regression coefficients, byx and bxy are absolute measures
representing the change in the value of the variable Y (or X), for a unit change in the
value of the variable X (or Y). Once the functional form of regression curve is known;
by substituting the value of the independent variable we can obtain the value of the
dependent variable and this value will be in the units of measurement of the dependent
variable.
5. There may be non-sense correlation between two variables that is due to pure chance
and has no practical relevance, e.g., the correlation, between the size of shoe and the
a term of 5 years and the sale of motor tyres by a firm in that territory for the same
period.
Solution: Here the dependent variable is number of tyres; dependent on motor registrations.
Hence we put motor registrations as X and sales of tyres as Y and we have to establish the
regression line of Y on X.
154
X Y dx = X- X dy = Y- Y dx2 dx dy
X = 3,500 Y = 6,500 d d d d
2
x =0 y =0 = 27,800 x d y = 41,500
x
X=
X = 3,500 Y = 6,500
= 1,300
=700 and Y=
N 5 N 5
byx =
(X − X )(Y − Y ) d d x y
=
4,1500
= 1.4928
=
(X − X )
2
2,7800
d x2
Y- Y = byx (X- X )
Y = 1.4928 X + 255.04
When X = 850, the value of Y can be calculated from the above equation, by putting X = 850
in the equation.
= 1523.92
= 1,524 Tyres
Example 5-2
A panel of Judges A and B graded seven debators and independently awarded the
following marks:
155
2 34 39
156
3 28 26
4 30 30
5 44 38
6 38 34
7 31 28
An eighth debator was awarded 36 marks by judge A, while Judge B was not present. If
Judge B were also present, how many marks would you expect him to award to the eighth
debator, assuming that the same degree of relationship exists in their judgement?
Solution: Let us use marks from Judge A as X and those from Judge B as Y. Now we have to
X = A+
U = 35 + 0
= 35 and Y = A+
V = 30 + 17 = 32.43
N 7 N 7
N UV − (U V )
byx = bvu =
N U 2 − ( U )2
7x121 - 0x17
= = 0.587
7x206 - 0
Y- Y = byx (X- X )
157
or Y = 0.587X + 11.87
Y = 0.587 x 36 + 11.87
= 33
Thus if Judge B were present, he would have awarded 33 marks to the eighth debator.
Example 5-3
For some bivariate data, the following results were obtained.
X = 53.2 Y = 27.9
Y- Y = byx (X- X )
or Y = -1.5X + 107.7
Y = -1.5 x 60 + 107.7
= 17.7
r2 = byx bxy
158
= (-1.5) x (–0.2)
= 0.3
So r = ± 0.3 = ± 0.5477
Since both the regression coefficients are negative, we assign negative value to the
correlation coefficient
r = - 0.5477
Example 5-4
Write regression equations of X on Y and of Y on X for the following data
X: 45 48 50 55 65 70 75 72 80 85
Y: 25 30 35 30 40 50 45 55 60 65
Solution: We prepare the table for working out the values for the regression lines.
X Y U = X-65 V = Y-45 U2 UV V2
45 25 -20 -20 400 400 400
48 30 -17 -15 289 255 225
50 35 -15 -10 225 150 100
55 30 -10 -15 100 150 225
65 40 0 -5 0 0 25
70 50 5 5 25 25 25
75 45 10 0 100 0 0
72 55 7 5 49 35 25
80 60 15 15 225 225 225
85 65 20 20 400 400 400
We have,
X=
X 645 Y = 435
= 43.5
= = 64.5 and Y=
N 10 N 10
N UV − (U V )
byx =
N U 2 − ( U )
2
159
(10) x 1415 - (5) x (-20)
=
(10) x 1813 - (5) 2
Regression equation of Y on X is
Y- Y = byx (X- X )
or Y = 0.787X + 7.26
N UV − (U V )
bxy =
N V 2 − ( V )
2
or X = 0.87Y + 26.65
Example 5-5
The lines of regression of a bivariate population are
8X – 10Y + 66 = 0
160
Solution: The regression lines given are
8X – 10Y + 66 = 0
Since both the lines of regression pass through the mean values, the point ( X , Y ) will satisfy
8 X - 10 Y + 66 = 0
40 X - 18 Y - 214 = 0
X = 13 and Y = 17
(ii) For correlation coefficient between X and Y, we have to calculate the values of byx and
bxy
10Y = 8X + 66
r2 = byx . bxy
So r = + 9 / 25
= + 0.6
Both the values of the regression coefficients being positive, we have to consider only the
161
Sx = ± 3
Sy = 4/5 x 3/0.6
= 4
Example 5-6
The height of a child increases at a rate given in the table below. Fit the straight line
using the method of least-square and calculate the average increase and the standard
error of estimate.
Month: 1 2 3 4 5 6 7 8 9 10
Height: 52.5 58.7 65 70.2 75.4 81.1 87.2 95.5 102.2 108.4
Considering the regression line as Y = a + bX, we can obtain the values of a and b from the
above values.
162
a=
Y X 2 − X XY
N X 2 − ( X )
2
= 45.73
N XY − X Y
b=
N X 2 − ( X )
2
10 x 4887.5 - 55 x 796.2
=
10 x 385 - 55 x 55
= 6.16
Y = 45.73 + 6.16X
For standard error of estimation, we note the calculated values of the variable against the
observed values,
163
2
1 2
S yx = i
10.421
=
10
= 1.02
Example 5-7
Given X = 4Y+5 and Y = kX + 4 are the lines of regression of X on Y and of Y on X
If k = 1/16, find the means of the two variables and coefficient of correlation between them.
So bxy = 4
We get byx = k
Now
r2 = bxy. byx
= 4k
Since 0 r 2 1, we obtain 0 4k 1,
1
Or 0k ,
4
1
Now for k = ,
16
1 1
r 2 = 4x =
16 4
r=+½
164
1
, the regression line of Y on X becomes
When k = 16
1
Y= X+4
16
Or X – 16Y + 64 = 0
Since line of regression pass through the mean values of the variables, we obtain revised
equations as
X - 4Y - 5 = 0
X - 16 Y + 64 = 0
X = 28 and Y = 5.75
Example 5-8
A firm knows from its past experience that its monthly average expenses (X) on advertisement
are Rs 25,000 with standard deviation of Rs 25.25. Similarly, its average monthly product sales
(Y) have been Rs 45,000 with standard deviation of Rs 50.50. Given this information and also
the coefficient of correlation between sales and advertisement expenditure as 0.75, estimate
50,000
(ii) the most appropriate advertisement expenditure for achieving a sales target of
Rs 80,000
X = Rs 25,000 Sx = Rs 25.25
Y = Rs 45,000 Sy = Rs 50.50
r = 0.75
165
Sy
(i) Using equation Yc - Y = r (X- X ), the most appropriate value of sales Yc for an
Sx
50.50
Yc – 45,000 = 0.75 (50,000 – 25,000)
25.25
Yc = 45,000 + 37,500
= Rs 82,500
Sx
(ii) Using equation Xc - X = r (Y - Y ), the most appropriate value of advertisement
Sy
25.25
Xc – 25,000 = 0.75 (80,000 – 45,000)
50.50
Xc = 13,125 + 25,000
= Rs 38,125
3. What is meant by ‘regression’? Why should there be in general, two lines of regression
for each bivariate distribution? How the two regression lines are useful in studying
166
(iv) Coefficient of Determination
167
(v) Coefficient of Non-determination
analysis.
X : 1 3 4 8 9 11 14
Y : 1 2 4 5 7 8 9
Hence obtain
d) X and Y
8. What are regression coefficients? Show that r2 = byx. bxy where the symbols have their
usual meanings. What can you say about the angle between the regression lines when
9. Obtain the equations of the lines of regression of Y on X from the following data.
X : 12 18 24 30 36 42 48
Y : 5.27 5.68 6.25 7.21 8.02 8.71 8.42
10. The following table gives the ages and blood pressure of 9 women.
Age (X) : 56 42 36 47 49 42 60 72 63
Blood Pressure(Y) 147 125 118 128 145 140 155 160 149
168
(ii) Estimate the blood pressure of a woman whose age is 45 years.
11. Given the following results for the height (X) and weight (Y) in appropriate units of
1,000 students:
Obtain the equations of the two lines of regression. Estimate the height of a student A
who weighs 200 units and also estimate the weight of the student B whose height is
60 units.
12. From the following data, find out the probable yield when the rainfall is 29”.
Rainfall Yield
Mean 25” 40 units per hectare
Standard Deviation 3” 6 units per hectare
13. A study of wheat prices at two cities yielded the following data:
City A City B
Coefficient of correlation r is 0.774. Estimate from the above data the most likely
price of wheat
14. Find out the regression equation showing the regression of capacity utilisation on
r = 0.62
169
Estimate the production, when capacity utilisation is 70%.
15. The following table shows the mean and standard deviation of the prices of two shares
in a stock exchange.
likely price of share A corresponding to a price of Rs 55, observed in the case of share
B.
16. Find out the regression coefficients of Y on X and of X on Y on the basis of following
data:
17. Find the regression equation of X and Y and the coefficient of correlation from the
following data:
2X – 3Y = 0 and 4Y – 5X-8 = 0.
(i) Identify which of the two can be called regression line of Y on X and of X on Y.
Which of these is the lines of regression of X and Y. Find rxy and Sy when Sx = 3
170
21. The regression equation of profits (X) on sales (Y) of a certain firm is 3Y – 5X +108 =
0. The average sales of the firm were Rs 44,000 and the variance of profits is 9/16th of
the variance of sales. Find the average profits and the coefficient of correlationbetween
171