0% found this document useful (0 votes)
35 views169 pages

DT Notes Unit 1 & 2 Part 1

Uploaded by

garvbarreja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views169 pages

DT Notes Unit 1 & 2 Part 1

Uploaded by

garvbarreja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 169

Decision Techniques for Business

Unit- 1

1.1 INTRODUCTION

For a layman, ‘Statistics’ means numerical information expressed in quantitativeterms.

This information may relate to objects, subjects, activities, phenomena, or regions of

space. As a matter of fact, data have no limits as to their reference,coverage,

and scope. At the macro level, these are data on gross national product and shares of

agriculture, manufacturing, and services in GDP (Gross Domestic Product).

1
At the micro level, individual firms, howsoever small or large, produce extensive

statistics on their operations. The annual reports of companies contain variety of data

on sales, production, expenditure, inventories, capital employed, and other activities.

These data are often field data, collected by employing scientific survey techniques.

Unless regularly updated, such data are the product of a one-time effort and have limited

use beyond the situation that may have called for their collection. A student knows

statistics more intimately as a subject of study like economics, mathematics, chemistry,

physics, and others. It is a discipline, which scientifically deals with data, and is often

described as the science of data. In dealing with statistics as data, statistics has

developed appropriate methods of collecting, presenting, summarizing, and analysing

data, and thus consists of a body of these methods.

1.2 MEANING AND DEFINITIONS OF STATISTICS

In the beginning, it may be noted that the word ‘statistics’ is used rather curiously in

two senses plural and singular. In the plural sense, it refers to a set of figures or data. In

the singular sense, statistics refers to the whole body of tools that are used to

collect data, organise and interpret them and, finally, to draw conclusions from them.

It should be noted that both the aspects of statistics are important if the quantitative data

are to serve their purpose. If statistics, as a subject, is inadequate and consists of poor

methodology, we could not know the right procedure to extract from the data the

information they contain. Similarly, if our data are defective or that they are inadequate

or inaccurate, we could not reach the right conclusions even though our subject is well

developed.

A.L. Bowley has defined statistics as: (i) statistics is the science of counting, (ii)

Statistics may rightly be called the science of averages, and (iii) statistics is the

science of measurement of social organism regarded as a whole in all its mani-

2
festations. Boddington defined as: Statistics is the science of estimates and probabilities.

Further, W.I. King has defined Statistics in a wider context, the science of Statistics is

the method of judging collective, natural or social phenomena from the results obtained

by the analysis or enumeration or collection of estimates.

Seligman explored that statistics is a science that deals with the methods of collecting,

classifying, presenting, comparing and interpreting numerical data collected to throw

some light on any sphere of enquiry. Spiegal defines statistics highlighting its role in

decision-making particularly under uncertainty, as follows: statistics is concernedwith

scientific method for collecting, organising, summa rising, presenting and analyzing

data as well as drawing valid conclusions and making reasonable decisions on the basis

of such analysis. According to Prof. Horace Secrist, Statistics is the aggregate of facts,

affected to a marked extent by multiplicity of causes, numerically expressed,

enumerated or estimated according to reasonable standards of accuracy, collected in a

systematic manner for a pre-determined purpose, and placed in relation to each other.

From the above definitions, we can highlight the major characteristics of statistics as

follows:

(i) Statistics are the aggregates of facts. It means a single figure is not statistics.

For example, national income of a country for a single year is not statistics but

the same for two or more years is statistics.

(ii) Statistics are affected by a number of factors. For example, sale of a product

depends on a number of factors such as its price, quality, competition, the

income of the consumers, and so on.

3
(iii) Statistics must be reasonably accurate. Wrong figures, if analysed, will lead to

erroneous conclusions. Hence, it is necessary that conclusions must be based on

accurate figures.

(iv) Statistics must be collected in a systematic manner. If data are collected in a

haphazard manner, they will not be reliable and will lead to misleading

conclusions.

(v) Collected in a systematic manner for a pre-determined purpose

(vi) Lastly, Statistics should be placed in relation to each other. If one collects data

unrelated to each other, then such data will be confusing and will not lead to any

logical conclusions. Data should be comparable over time and over space.

1.3 TYPES OF DATA AND DATA SOURCES

Statistical data are the basic raw material of statistics. Data may relate to an activity of

our interest, a phenomenon, or a problem situation under study. They derive as a

result of the process of measuring, counting and/or observing. Statistical data, therefore,

refer to those aspects of a problem situation that can be measured, quantified, counted,

or classified. Any object subject phenomenon, or activity that generates data through

this process is termed as a variable. In other words, a variableis one that shows a degree

of variability when successive measurements are recorded. In statistics, data are

classified into two broad categories: quantitative data and qualitative data. This

classification is based on the kind of characteristics that are measured.

Quantitative data are those that can be quantified in definite units of measurement.

These refer to characteristics whose successive measurements yield quantifiable

observations. Depending on the nature of the variable observed for measurement,

quantitative data can be further categorized as continuous and discrete data.

4
Obviously, a variable may be a continuous variable or a discrete variable.

(i) Continuous data represent the numerical values of a continuous variable. A

continuous variable is the one that can assume any value between any two points

on a line segment, thus representing an interval of values. The valuesare quite

precise and close to each other, yet distinguishably different. All characteristics

such as weight, length, height, thickness, velocity, temperature, tensile strength,

etc., represent continuous variables. Thus, the data recorded on these and

similar other characteristics are called continuous data. It may be noted that a

continuous variable assumes the finest unit of measurement.Finest in the sense

that it enables measurements to the maximum degree of precision.

(ii) Discrete data are the values assumed by a discrete variable. A discrete variable

is the one whose outcomes are measured in fixed numbers. Such data are

essentially count data. These are derived from a process of counting, such as the

number of items possessing or not possessing a certain characteristic. The

number of customers visiting a departmental store everyday, the incoming

flights at an airport, and the defective items in a consignment received for sale,

are all examples of discrete data.

Qualitative data refer to qualitative characteristics of a subject or an object. A

characteristic is qualitative in nature when its observations are defined and noted in

terms of the presence or absence of a certain attribute in discrete numbers. These data

are further classified as nominal and rank data.

(i) Nominal data are the outcome of classification into two or more categories of

items or units comprising a sample or a population according to some quality

characteristic. Classification of students according to sex (as males and

5
females), of workers according to skill (as skilled, semi-skilled, and unskilled),

and of employees according to the level of education (as matriculates,

undergraduates, and post-graduates), all result into nominal data. Given any

such basis of classification, it is always possible to assign each item to a

particular class and make a summation of items belonging to each class. The

count data so obtained are called nominal data.

(ii) Rank data, on the other hand, are the result of assigning ranks to specify order

in terms of the integers 1,2,3, ..., n. Ranks may be assigned according to the

level of performance in a test. a contest, a competition, an interview, or a

show. The candidates appearing in an interview, for example, may be assigned

ranks in integers ranging from I to n, depending on their performance in the

interview. Ranks so assigned can be viewed as the continuous values of a

variable involving performance as the quality characteristic.

Data sources could be seen as of two types, viz., secondary and primary. The two can

be defined as under:

(i) Secondary data: They already exist in some form: published or unpublished -

in an identifiable secondary source. They are, generally, available from

published source(s), though not necessarily in the form actually required.

(ii) Primary data: Those data which do not already exist in any form, and thus have

to be collected for the first time from the primary source(s). By their verynature,

these data require fresh and first-time collection covering the whole population

or a sample drawn from it.

1.4 TYPES OF STATISTICS

There are two major divisions of statistics such as descriptive statistics and inferential

statistics. The term descriptive statistics deals with collecting, summarizing, and

6
simplifying data, which are otherwise quite unwieldy and voluminous. It seeks to

achieve this in a manner that meaningful conclusions can be readily drawn from the

data. Descriptive statistics may thus be seen as comprising methods of bringing out and

highlighting the latent characteristics present in a set of numerical data. It notonly

facilitates an understanding of the data and systematic reporting thereof in a manner;

and also makes them amenable to further discussion, analysis, and interpretations.

The first step in any scientific inquiry is to collect data relevant to the problem in hand.

When the inquiry relates to physical and/or biological sciences, data collection is

normally an integral part of the experiment itself. In fact, the very manner in which an

experiment is designed, determines the kind of data it would require and/or generate.

The problem of identifying the nature and the kind of the relevant data is thus

automatically resolved as soon as the design of experiment is finalized. It is possible in

the case of physical sciences. In the case of social sciences, where the required data are

often collected through a questionnaire from a number of carefully selected

respondents, the problem is not that simply resolved. For one thing, designing the

questionnaire itself is a critical initial problem. For another, the number of respondents

to be accessed for data collection and the criteria for selecting themhas their own

implications and importance for the quality of results obtained. Further, the data have

been collected, these are assembled, organized, and presented in the form of

appropriate tables to make them readable. Wherever needed, figures, diagrams, charts,

and graphs are also used for better presentation of the data. A useful tabular and graphic

presentation of data will require that the raw data be properly classified in accordance

with the objectives of investigation and the relational analysisto be carried out. .

7
A well thought-out and sharp data classification facilitates easy description of the

hidden data characteristics by means of a variety of summary measures. These include

measures of central tendency, dispersion, skewness, and kurtosis, which constitute the

essential scope of descriptive statistics. These form a large part of the subject matter

of any basic textbook on the subject, and thus they are being discussed in that order

here as well.

Inferential statistics, also known as inductive statistics, goes beyond describing a

given problem situation by means of collecting, summarizing, and meaningfully

presenting the related data. Instead, it consists of methods that are used for drawing

inferences, or making broad generalizations, about a totality of observations on the basis

of knowledge about a part of that totality. The totality of observations about which an

inference may be drawn, or a generalization made, is called a population ora universe.

The part of totality, which is observed for data collection and analysis to gain

knowledge about the population, is called a sample.

The desired information about a given population of our interest; may also be collected

even by observing all the units comprising the population. This total coverage is called

census. Getting the desired value for the population through census is not always

feasible and practical for various reasons. Apart from time and money considerations

making the census operations prohibitive, observing each individual unit of the

population with reference to any data characteristic may at times involve even

destructive testing. In such cases, obviously, the only recourse available is to employ

the partial or incomplete information gathered through a sample for thepurpose. This is

precisely what inferential statistics does. Thus, obtaining a particular value from the

sample information and using it for drawing an inference about the entire population

underlies the subject matter of inferential statistics. Consider a

8
situation in which one is required to know the average body weight of all the college

students in a given cosmopolitan city during a certain year. A quick and easy way to do

this is to record the weight of only 500 students, from out of a total strength of,say,

10000, or an unknown total strength, take the average, and use this average based on

incomplete weight data to represent the average body weight of all the college students.

In a different situation, one may have to repeat this exercise for some future year and

use the quick estimate of average body weight for a comparison. This maybe needed,

for example, to decide whether the weight of the college students has undergone a

significant change over the years compared.

Inferential statistics helps to evaluate the risks involved in reaching inferences or

generalizations about an unknown population on the basis of sample information. for

example, an inspection of a sample of five battery cells drawn from a given lot may

reveal that all the five cells are in perfectly good condition. This information may be

used to conclude that the entire lot is good enough to buy or not.

Since this inference is based on the examination of a sample of limited number of cells,

it is equally likely that all the cells in the lot are not in order. It is also possible that all

the items that may be included in the sample are unsatisfactory. This may be used to

conclude that the entire lot is of unsatisfactory quality, whereas the fact may indeed be

otherwise. It may, thus, be noticed that there is always a risk of an inference about a

population being incorrect when based on the knowledge of a limited sample. The

rescue in such situations lies in evaluating such risks. For this, statistics provides the

necessary methods. These centres on quantifying in probabilistic term the chances of

decisions taken on the basis of sample information being incorrect. This requires an

understanding of the what, why, and how of probability and probability distributions

to equip ourselves with methods of drawing statistical inferences and estimating the

9
degree of reliability of these inferences.

1.5 SCOPE OF STATISTICS

Apart from the methods comprising the scope of descriptive and inferential branches of

statistics, statistics also consists of methods of dealing with a few other issues of specific

nature. Since these methods are essentially descriptive in nature, they have been

discussed here as part of the descriptive statistics. These are mainly concerned with the

following:

(i) It often becomes necessary to examine how two paired data sets are related.

For example, we may have data on the sales of a product and the expenditure

incurred on its advertisement for a specified number of years. Given that sales

and advertisement expenditure are related to each other, it is useful to examine

the nature of relationship between the two and quantify the degree of that

relationship. As this requires use of appropriate statistical methods, these falls

under the purview of what we call regression and correlation analysis.

(ii) Situations occur quite often when we require averaging (or totalling) of data

on prices and/or quantities expressed in different units of measurement. For

example, price of cloth may be quoted per meter of length and that of wheat per

kilogram of weight. Since ordinary methods of totalling and averaging do not

apply to such price/quantity data, special techniques needed for the purpose are

developed under index numbers.

(iii) Many a time, it becomes necessary to examine the past performance of an

activity with a view to determining its future behaviour. For example, when

engaged in the production of a commodity, monthly product sales are an

important measure of evaluating performance. This requires compilation and

analysis of relevant sales data over time. The more complex the activity, the

10
more varied the data requirements. For profit maximising and future sales

planning, forecast of likely sales growth rate is crucial. This needs careful

collection and analysis of past sales data. All such concerns are taken care of

under time series analysis.

(iv) Obtaining the most likely future estimates on any aspect(s) relating to a business

or economic activity has indeed been engaging the minds of allconcerned. This

is particularly important when it relates to product sales and demand, which

serve the necessary basis of production scheduling and planning. The

regression, correlation, and time series analyses together help develop the basic

methodology to do the needful. Thus, the study of methods and techniques of

obtaining the likely estimates on business/economic variables comprises the

scope of what we do under business forecasting.

Keeping in view the importance of inferential statistics, the scope of statistics may

finally be restated as consisting of statistical methods which facilitate decision--making

under conditions of uncertainty. While the term statistical methods is often used to

cover the subject of statistics as a whole, in particular it refers to methods by which

statistical data are analysed, interpreted, and the inferences drawn for decision- making.

Though generic in nature and versatile in their applications, statistical methods have

come to be widely used, especially in all matters concerning business and economics.

These are also being increasingly used in biology, medicine, agriculture, psychology,

and education. The scope of application of these methods has started opening and

expanding in a number of social science disciplines as well. Even a political scientist

finds them of increasing relevance for examining the political behaviour and it is, of

course, no surprise to find even historians statistical data, for history is essentially past

11
data presented in certain actual format.

1.6 IMPORTANCE OF STATISTICS IN BUSINESS

There are three major functions in any business enterprise in which the statistical

methods are useful. These are as follows:

(i) The planning of operations: This may relate to either special projects or to the

recurring activities of a firm over a specified period.

(ii) The setting up of standards: This may relate to the size of employment,

volume of sales, fixation of quality norms for the manufactured product,norms

for the daily output, and so forth.

(iii) The function of control: This involves comparison of actual production

achieved against the norm or target set earlier. In case the production has

fallen short of the target, it gives remedial measures so that such a deficiency

does not occur again.

A worth noting point is that although these three functions-planning of operations,

setting standards, and control-are separate, but in practice they are very much

interrelated.

Different authors have highlighted the importance of Statistics in business. For instance,

Croxton and Cowden give numerous uses of Statistics in business such as project

planning, budgetary planning and control, inventory planning and control, quality

control, marketing, production and personnel administration. Within these also they

have specified certain areas where Statistics is very relevant. Another author, Irwing

W. Burr, dealing with the place of statistics in an industrial organisation, specifies a

number of areas where statistics is extremely useful. These are: customer wants and

market research, development design and specification, purchasing,

12
production, inspection, packaging and shipping, sales and complaints, inventory and

maintenance, costs, management control, industrial engineering and research.

Statistical problems arising in the course of business operations are multitudinous. As

such, one may do no more than highlight some of the more important ones to emphasis

the relevance of statistics to the business world. In the sphere of production, for

example, statistics can be useful in various ways.

Statistical quality control methods are used to ensure the production of quality goods.

Identifying and rejecting defective or substandard goods achieve this. The sale targets

can be fixed on the basis of sale forecasts, which are done by using varying methods

of forecasting. Analysis of sales affected against the targets set earlier would indicate

the deficiency in achievement, which may be on account of several causes: (i) targets

were too high and unrealistic (ii) salesmen's performance has been poor (iii) emergence

of increase in competition (iv) poor quality of company's product, and so on. These

factors can be further investigated.

Another sphere in business where statistical methods can be used is personnel

management. Here, one is concerned with the fixation of wage rates, incentive norms

and performance appraisal of individual employee. The concept of productivity is

very relevant here. On the basis of measurement of productivity, the productivity bonus

is awarded to the workers. Comparisons of wages and productivity are undertaken in

order to ensure increases in industrial productivity.

Statistical methods could also be used to ascertain the efficacy of a certain product, say,

medicine. For example, a pharmaceutical company has developed a new medicine in

the treatment of bronchial asthma. Before launching it on commercial basis, it wants to

ascertain the effectiveness of this medicine. It undertakes an experimentation involving

the formation of two comparable groups of asthma

13
patients. One group is given this new medicine for a specified period and the other

one is treated with the usual medicines. Records are maintained for the two groups for

the specified period. This record is then analysed to ascertain if there is any significant

difference in the recovery of the two groups. If the difference is really significant

statistically, the new medicine is commercially launched.

1.7 LIMITATIONS OF STATISTICS

Statistics has a number of limitations, pertinent among them are as follows:

(i) There are certain phenomena or concepts where statistics cannot be used. This

is because these phenomena or concepts are not amenable to measurement.

For example, beauty, intelligence, courage cannot be quantified. Statistics has

no place in all such cases where quantification is not possible.

(ii) Statistics reveal the average behaviour, the normal or the general trend. An

application of the 'average' concept if applied to an individual or a particular

situation may lead to a wrong conclusion and sometimes may be disastrous. For

example, one may be misguided when told that the average depth of ariver

from one bank to the other is four feet, when there may be some points in

between where its depth is far more than four feet. On this understanding, one

may enter those points having greater depth, which may be hazardous.

(iii) Since statistics are collected for a particular purpose, such data may not be

relevant or useful in other situations or cases. For example, secondary data

(i.e., data originally collected by someone else) may not be useful for the other

person.

(iv) Statistics are not 100 per cent precise as is Mathematics or Accountancy.

Those who use statistics should be aware of this limitation.

14
(v) In statistical surveys, sampling is generally used as it is not physically possible

to cover all the units or elements comprising the universe. The results may not

be appropriate as far as the universe is concerned. Moreover, different surveys

based on the same size of sample but different sample units may yield different

results.

(vi) At times, association or relationship between two or more variables is studied

in statistics, but such a relationship does not indicate cause and effect'

relationship. It simply shows the similarity or dissimilarity in the movement of

the two variables. In such cases, it is the user who has to interpret the results

carefully, pointing out the type of relationship obtained.

(vii) A major limitation of statistics is that it does not reveal all pertaining to a certain

phenomenon. There is some background information that statistics does not

cover. Similarly, there are some other aspects related to the problem on hand,

which are also not covered. The user of Statistics has to be well informed and

should interpret Statistics keeping in mind all other aspects having relevance on

the given problem.

Apart from the limitations of statistics mentioned above, there are misuses of it. Many

people, knowingly or unknowingly, use statistical data in wrong manner. Let us see

what the main misuses of statistics are so that the same could be avoided when one has

to use statistical data. The misuse of Statistics may take several forms some of which

are explained below.

(i) Sources of data not given: At times, the source of data is not given. In the

absence of the source, the reader does not know how far the data are reliable.

Further, if he wants to refer to the original source, he is unable to do so.

15
(ii) Defective data: Another misuse is that sometimes one gives defective data.

This may be done knowingly in order to defend one's position or to prove a

particular point. This apart, the definition used to denote a certain phenomenon

may be defective. For example, in case of data relating to unem- ployed persons,

the definition may include even those who are employed, though partially. The

question here is how far it is justified to include partially employed persons

amongst unemployed ones.

(iii) Unrepresentative sample: In statistics, several times one has to conduct a

survey, which necessitates to choose a sample from the given population or

universe. The sample may turn out to be unrepresentative of the universe. One

may choose a sample just on the basis of convenience. He may collect the

desired information from either his friends or nearby respondents in his

neighbourhood even though such respondents do not constitute a representative

sample.

(iv) Inadequate sample: Earlier, we have seen that a sample that is unrepresentative

of the universe is a major misuse of statistics. This apart, at times one may

conduct a survey based on an extremely inadequate sample.For example, in

a city we may find that there are 1, 00,000 households. When we have to conduct

a household survey, we may take a sample of merely 100 households comprising

only 0.1 per cent of the universe. A survey based on such a small sample may

not yield right information.

(v) Unfair Comparisons: An important misuse of statistics is making unfair

comparisons from the data collected. For instance, one may construct an index

of production choosing the base year where the production was much less. Then

he may compare the subsequent year's production from this low base.

16
Such a comparison will undoubtedly give a rosy picture of the production

though in reality it is not so. Another source of unfair comparisons could be

when one makes absolute comparisons instead of relative ones. An absolute

comparison of two figures, say, of production or export, may show a good

increase, but in relative terms it may turnout to be very negligible. Another

example of unfair comparison is when the population in two cities is different,

but a comparison of overall death rates and deaths by a particular disease is

attempted. Such a comparison is wrong. Likewise, when data are not properly

classified or when changes in the composition of population in the two years are

not taken into consideration, comparisons of such data would be unfair as they

would lead to misleading conclusions.

(vi) Unwanted conclusions: Another misuse of statistics may be on account of

unwarranted conclusions. This may be as a result of making false assumptions.

For example, while making projections of population in the next five years,

one may assume a lower rate of growth though the past two years indicate

otherwise. Sometimes one may not be sure about the changes in business

environment in the near future. In such a case, one may use an assumption that

may turn out to be wrong. Another source of unwarranted conclusion may be

the use of wrong average. Suppose in a series there are extreme values, one is

too high while the other is too low, such as 800 and 50. The use of an

arithmetic average in such a case may give a wrong idea. Instead, harmonic

mean would be proper in such a case.

(vii) Confusion of correlation and causation: In statistics, several times one has

to examine the relationship between two variables. A close relationship between the

two variables may not establish a cause-and-effect-relationship in the sense that one

17
variable is the cause and the other is the effect. It should be taken as something that

measures degree of association rather than try to find out causal relationship..

1.8 SUMMARY

In a summarized manner, ‘Statistics’ means numerical information expressed in

quantitative terms. As a matter of fact, data have no limits as to their reference,

coverage, and scope. At the macro level, these are data on gross national product and

shares of agriculture, manufacturing, and services in GDP (Gross Domestic Product).

At the micro level, individual firms, howsoever small or large, produce extensive

statistics on their operations. The annual reports of companies contain variety of data

on sales, production, expenditure, inventories, capital employed, and other activities.

These data are often field data, collected by employing scientific survey techniques.

Unless regularly updated, such data are the product of a one-time effort and have limited

use beyond the situation that may have called for their collection. A student knows

statistics more intimately as a subject of study like economics, mathematics, chemistry,

physics, and others. It is a discipline, which scientifically deals with data, and is often

described as the science of data. In dealing with statistics as data, statistics has

developed appropriate methods of collecting, presenting, summarizing, and analysing

data, and thus consists of a body of these methods.

1.9 SELF-TEST QUESTIONS

1. Define Statistics. Explain its types, and importance to trade, commerce and

business.

2. “Statistics is all-pervading”. Elucidate this statement.

3. Write a note on the scope and limitations of Statistics.

4. What are the major limitations of Statistics? Explain with suitable examples.

5. Distinguish between descriptive Statistics and inferential Statistics.

18
19
AN OVERVIEW OF CENTRAL TENDENCY

OBJECTIVE: The present lesson imparts understanding of the calculations and main

properties of measures of central tendency, including mean, mode,

median, quartiles, percentiles, etc.

STRUCTURE:

2.1 Introduction
2.2 Arithmetic Mean
2.3 Median
2.4 Mode
2.5 Relationships of the Mean, Median and Mode
2.6 The Best Measure of Central Tendency
2.7 Geometric Mean
2.8 Harmonic Mean
2.9 Quadratic Mean
2.10 Summary
2.11 Self-Test Questions
2.12 Suggested Readings

2.1 INTRODUCTION

The description of statistical data may be quite elaborate or quite brief depending on

two factors: the nature of data and the purpose for which the same data have been

collected. While describing data statistically or verbally, one must ensure that the

description is neither too brief nor too lengthy. The measures of central tendency enable

us to compare two or more distributions pertaining to the same time period or within

the same distribution over time. For example, the average consumption of teain two

different territories for the same period or in a territory for two years, say, 2003and

2004, can be attempted by means of an average.

20
2.2 ARITHMETIC MEAN

Adding all the observations and dividing the sum by the number of observations

results the arithmetic mean. Suppose we have the following observations:

10, 15,30, 7, 42, 79 and 83

These are seven observations. Symbolically, the arithmetic mean, also called simply

mean is

x = x/n, where x is simple mean.

10 + 15 + 30 + 7 + 42 + 79 + 83
=
7

266
= = 38
7

It may be noted that the Greek letter  is used to denote the mean of the population

and n to denote the total number of observations in a population. Thus the population

mean  = x/n. The formula given above is the basic formula that forms the definition

of arithmetic mean and is used in case of ungrouped data where weights are not

involved.

2.2.1 UNGROUPED DATA-WEIGHTED AVERAGE

In case of ungrouped data where weights are involved, our approach for calculating

arithmetic mean will be different from the one used earlier.

Example 2.1: Suppose a student has secured the following marks in three tests:

Mid-term test 30

Laboratory 25

Final 20

30 + 25 + 20
The simple arithmetic mean will be = 25
3

21
However, this will be wrong if the three tests carry different weights on the basis of

their relative importance. Assuming that the weights assigned to the three tests are:

Mid-term test 2 points

Laboratory 3 points

Final 5 points

Solution: On the basis of this information, we can now calculate a weighted mean as

shown below:

Table 2.1: Calculation of a Weighted Mean

Type of Test Relative Weight (w) Marks (x) (wx)

Mid-term 2 30 60

Laboratory 3 25 75

Final 5 20 100

Total  w = 10 235

 wx w1 x1 + w2 x2 + w3 x3
x= =
w w1 + w2 + w3

60 + 75 + 100
= = 23.5 marks
2+3+5

It will be seen that weighted mean gives a more realistic picture than the simple or

unweighted mean.

Example 2.2: An investor is fond of investing in equity shares. During a period of

falling prices in the stock exchange, a stock is sold at Rs 120 per share on one day, Rs

105 on the next and Rs 90 on the third day. The investor has purchased 50 shares on the

first day, 80 shares on the second day and 100 shares on the third' day. What average

price per share did the investor pay?

22
Solution:

Table 2.2: Calculation of Weighted Average Price

Day Price per Share (Rs) (x) No of Shares Purchased (w) Amount Paid (wx)

1 120 50 6000

2 105 80 8400

3 90 100 9000

Total - 230 23,400

w1 x1 + w2 x2 + w3 x3  wx
Weighted average = =
w1 + w2 + w3 w

6000 + 8400 + 9000


= = 101.7 marks
50 + 80 + 100

Therefore, the investor paid an average price of Rs 101.7 per share.

It will be seen that if merely prices of the shares for the three days (regardless of the

number of shares purchased) were taken into consideration, then the average price

would be

120 + 105 + 90
Rs. = 105
3

This is an unweighted or simple average and as it ignores the-quantum of shares

purchased, it fails to give a correct picture. A simple average, it may be noted, is also

a weighted average where weight in each case is the same, that is, only 1. When we use

the term average alone, we always mean that it is an unweighted or simple average.

2.2.2 GROUPED DATA-ARITHMETIC MEAN

For grouped data, arithmetic mean may be calculated by applying any of the following

methods:

(i) Direct method, (ii) Short-cut method ,(iii) Step-deviation method

23
In the case of direct method, the formula x = fm/n is used. Here m is mid-point of

various classes, f is the frequency of each class and n is the total number of

frequencies. The calculation of arithmetic mean by the direct method is shown below.

Example 2.3: The following table gives the marks of 58 students in Statistics.

Calculate the average marks of this group.

Marks No. of Students


0-10 4
10-20 8
20-30 11
30-40 15
40-50 12
50-60 6
60-70 2
Total 58

Solution:

Table 2.3: Calculation of Arithmetic Mean by Direct Method

No. of Students
Marks Mid-point m fm
f
0-10 5 4 20
10-20 15 8 120
20-30 25 11 275
30-40 35 15 525
40-50 45 12 540
50-60 55 6 330
60-70 65 2 130
fm = 1940

Where,

x=
 fm = 1940 = 33.45 marks or 33 marks approximately.

n 58

It may be noted that the mid-point of each class is taken as a good approximation of the

true mean of the class. This is based on the assumption that the values are distributed

fairly evenly throughout the interval. When large numbers of frequency occur, this

assumption is usually accepted.

24
In the case of short-cut method, the concept of arbitrary mean is followed. The

formula for calculation of the arithmetic mean by the short-cut method is givenbelow:

x= A+
 fd
n

Where A = arbitrary or assumed mean

f = frequency

d = deviation from the arbitrary or assumed mean

When the values are extremely large and/or in fractions, the use of the direct method

would be very cumbersome. In such cases, the short-cut method is preferable. This is

because the calculation work in the short-cut method is considerably reduced

particularly for calculation of the product of values and their respective frequencies.

However, when calculations are not made manually but by a machine calculator, it may

not be necessary to resort to the short-cut method, as the use of the direct method may

not pose any problem.

As can be seen from the formula used in the short-cut method, an arbitrary or assumed

mean is used. The second term in the formula (fd  n) is the correction factor for the

difference between the actual mean and the assumed mean. If the assumed mean turns

out to be equal to the actual mean, (fd  n) will be zero. The use of the short-cut

method is based on the principle that the total of deviations taken from an actual mean

is equal to zero. As such, the deviations taken from any other figure will depend on how

the assumed mean is related to the actual mean. While one may choose any value as

assumed mean, it would be proper to avoid extreme values, that is, too small or too high

to simplify calculations. A value apparently close to the arithmetic mean should be

chosen.

25
For the figures given earlier pertaining to marks obtained by 58 students, we calculate

the average marks by using the short-cut method.

Example 2.4:

Table 2.4: Calculation of Arithmetic Mean by Short-cut Method

Mid-point
Marks f d fd
m
0-10 5 4 -30 -120
10-20 15 8 -20 -160
20-30 25 11 -10 -110
30-40 35 15 0 0
40-50 45 12 10 120
50-60 55 6 20 120
60-70 65 2 30 60
fd = -90

It may be noted that we have taken arbitrary mean as 35 and deviations from midpoints.

In other words, the arbitrary mean has been subtracted from each value of mid-point

and the resultant figure is shown in column d.

x= A+
 fd
n

 − 90 
= 35 + 
 58 

= 35 - 1.55 = 33.45 or 33 marks approximately.

Now we take up the calculation of arithmetic mean for the same set of data using the

step-deviation method. This is shown in Table 2.5.

Table 2.5: Calculation of Arithmetic Mean by Step-deviation Method

Marks Mid-point f d d’= d/10 Fd’


0-10 5 4 -30 -3 -12
10-20 15 8 -20 -2 -16
20-30 25 11 -10 -1 -11
30-40 35 15 0 0 0
40-50 45 12 10 1 12
50-60 55 6 20 2 12
60-70 65 2 30 3 6
fd’ =-9

26
x = A+
 fd '  C
n
− 9  10 
= 35 +  = 33.45 or 33 marks approximately.
 
 58 

It will be seen that the answer in each of the three cases is the same. The step- deviation

method is the most convenient on account of simplified calculations. It may also be

noted that if we select a different arbitrary mean and recalculate deviations from that

figure, we would get the same answer.

Now that we have learnt how the arithmetic mean can be calculated by using different

methods, we are in a position to handle any problem where calculation of the arithmetic

mean is involved.

Example 2.6: The mean of the following frequency distribution was found to be 1.46.

No. of Accidents No. of Days (frequency)


0 46
1 ?
2 ?
3 25
4 10
5 5
Total 200 days

Calculate the missing frequencies.

Solution:

Here we are given the total number of frequencies and the arithmetic mean. We have to

determine the two frequencies that are missing. Let us assume that the frequency against

1 accident is x and against 2 accidents is y. If we can establish two simultaneous

equations, then we can easily find the values of X and Y.

(0.46) + (1. x) + (2 . y) + (3 . 25) + (4 . l0) + (5 . 5)


Mean =
200

27
x + 2y + 140
1.46 = 200

x + 2y + 140 = (200) (1.46)

x + 2y = 152

x + y=200- {46+25 + 1O+5}

x + y = 200 - 86

x + y = 114

Now subtracting equation (ii) from equation (i), we get

x + 2y = 152
x+y = 114
- - -
y = 38

Substituting the value of y = 38 in equation (ii) above, x + 38 = 114

Therefore, x = 114 - 38 = 76

Hence, the missing frequencies are:

Against accident 1 : 76

Against accident 2 : 38

2.2.3 CHARACTERISTICS OF THE ARITHMETIC MEAN

Some of the important characteristics of the arithmetic mean are:

1. The sum of the deviations of the individual items from the arithmetic mean is

always zero. This means I: (x - x ) = 0, where x is the value of an item and x is

the arithmetic mean. Since the sum of the deviations in the positive direction

is equal to the sum of the deviations in the negative direction, the arithmetic

mean is regarded as a measure of central tendency.

2. The sum of the squared deviations of the individual items from the arithmetic

mean is always minimum. In other words, the sum of the squared deviations

taken from any value other than the arithmetic mean will be higher.

28
3. As the arithmetic mean is based on all the items in a series, a change in the value

of any item will lead to a change in the value of the arithmetic mean.

4. In the case of highly skewed distribution, the arithmetic mean may get distorted

on account of a few items with extreme values. In such a case, itmay cease

to be the representative characteristic of the distribution.

2.3 MEDIAN

Median is defined as the value of the middle item (or the mean of the values of the

two middle items) when the data are arranged in an ascending or descending order of

magnitude. Thus, in an ungrouped frequency distribution if the n values are arranged in

ascending or descending order of magnitude, the median is the middle value if n is odd.

When n is even, the median is the mean of the two middle values.

Suppose we have the following series:

15, 19,21,7, 10,33,25,18 and 5

We have to first arrange it in either ascending or descending order. These figures are

arranged in an ascending order as follows:

5,7,10,15,18,19,21,25,33

Now as the series consists of odd number of items, to find out the value of the middle

item, we use the formula

n +1
Where
2

n +1
Where n is the number of items. In this case, n is 9, as such = 5, that is, the size
2

of the 5th item is the median. This happens to be 18.

Suppose the series consists of one more items 23. We may, therefore, have to include

23 in the above series at an appropriate place, that is, between 21 and 25. Thus, the

series is now 5, 7, 10, 15, 18, 19, and 21,23,25,33. Applying the above formula, the

29
median is the size of 5.5th item. Here, we have to take the average of the values of 5th

and 6th item. This means an average of 18 and 19, which gives the median as 18.5.

n +1
It may be noted that the formula itself is not the formula for the median; it
2

merely indicates the position of the median, namely, the number of items we have to

count until we arrive at the item whose value is the median. In the case of the even

number of items in the series, we identify the two items whose values have to be

averaged to obtain the median. In the case of a grouped series, the median is calculated

by linear interpolation with the help of the following formula:

l2 + l1
M = l1 (m − c)
f

Where M = the median

l1 = the lower limit of the class in which the median lies

12 = the upper limit of the class in which the median lies

f = the frequency of the class in which the median lies

m = the middle item or (n + 1)/2th, where n stands for total number of

items

c = the cumulative frequency of the class preceding the one in which the median lies

Example 2.7:

Monthly Wages (Rs) No. of Workers


800-1,000 18
1,000-1,200 25
1,200-1,400 30
1,400-1,600 34
1,600-1,800 26
1,800-2,000 10

Total 143

In order to calculate median in this case, we have to first provide cumulative

frequency to the table. Thus, the table with the cumulative frequency is written as:

30
Cumulative Frequency
Monthly Wages Frequency
800 -1,000 18 18
1,000 -1,200 25 43
1,200 -1,400 30 73
1,400 -1,600 34 107
1,600 -1,800 26 133
1.800 -2,000 10 143
l2 + l1
M = l1 (m − c)
f

M = n + 1 = 143 + 1 = 72
2 2

It means median lies in the class-interval Rs 1,200 - 1,400.

Now, M = 1200 + 1400 − 1200 (72 − 43)


30

200
= 1200 + (29)
30

= Rs 1393.3

At this stage, let us introduce two other concepts viz. quartile and decile. To understand

these, we should first know that the median belongs to a general class of statistical

descriptions called fractiles. A fractile is a value below that lays a given fraction of a

set of data. In the case of the median, this fraction is one-half (1/2). Likewise, a quartile

has a fraction one-fourth (1/4). The three quartiles Q1, Q2 and Q3 are such that 25 percent

of the data fall below Q1, 25 percent fall between Q1 and Q2, 25 percent fall between

Q2 and Q3 and 25 percent fall above Q3 It will be seen that Q2 is the median. We can

use the above formula for the calculation of quartiles as well. The only difference will

be in the value of m. Let us calculate both Q1 and Q3 in respect of the table given in

Example 2.7.

l2 − l1
Q1 = l1 (m − c)
f

31
n + 1 = 143 + 1 = 36
Here, m will be = 4 4

1200 − 1000
Q = 1000 + (36 − 18)
1
25

200
= 1000 + (18)
25

= Rs. 1,144

n + 1 3144
In the case of Q3, m will be 3 = = = 108
4 4

1800 − 1600
Q = 1600 + (108 − 107)
1
26

200
= 1600 + (1)
26

Rs. 1,607.7 approx

In the same manner, we can calculate deciles (where the series is divided into 10

parts) and percentiles (where the series is divided into 100 parts). It may be noted that

unlike arithmetic mean, median is not affected at all by extreme values, as it is a

positional average. As such, median is particularly very useful when a distribution

happens to be skewed. Another point that goes in favour of median is that it can be

computed when a distribution has open-end classes. Yet, another merit of median is that

when a distribution contains qualitative data, it is the only average that can be used. No

other average is suitable in case of such a distribution. Let us take a couple of examples

to illustrate what has been said in favour of median.

32
Example 2.8:Calculate the most suitable average for the following data:

Size of the Item Below 50 50-100 100-150 150-200 200 and above

Frequency 15 20 36 40 10

Solution: Since the data have two open-end classes-one in the beginning (below 50) and the

other at the end (200 and above), median should be the right choice as a measure of central

tendency.

Table 2.6: Computation of Median

Size of Item Frequency Cumulative Frequency


Below 50 15 15
50-100 20 35
100-150 36 71
150-200 40 111
200 and above 10 121

n +1
Median is the size of th item
2

121 + 1
= = 61st item
2

Now, 61st item lies in the 100-150 class

l2 − l1
Median = 11 = l1 (m − c)
f

150 − 100
= 100 + (61 − 35)
36

= 100 + 36.11 = 136.11 approx.

Example 2.9: The following data give the savings bank accounts balances of nine sample

households selected in a survey. The figures are in rupees.

745 2,000 1,500 68,000 461 549 3750 1800 4795

(a) Find the mean and the median for these data; (b) Do these data contain an outlier? If so,

exclude this value and recalculate the mean and median. Which of these summary measures

33
has a greater change when an outlier is dropped?; (c) Which of these two summary measures

is more appropriate for this series?

Solution:

745 + 2,000 + 1,500 + 68,000 + 461 + 549 + 3,750 + 1,800 + 4,795


Mean = Rs.
9

Rs 83,600
= = Rs 9,289
9

n + 1
Median = Size of th item
2

9 + 1
= = 5th item
2

Arranging the data in an ascending order, we find that the median is Rs 1,800.

(b) An item of Rs 68,000 is excessively high. Such a figure is called an 'outlier'. We

exclude this figure and recalculate both the mean and the median.

83,600 − 68,000
Mean = Rs.
8

15,600
= Rs = Rs. 1,950
8

n + 1
Median = Size of th item
2

8 + 1
= = 4.5th item.
2

1,500 − 1,800
= Rs. = Rs. 1,650
2

It will be seen that the mean shows a far greater change than the median when the

outlier is dropped from the calculations.

(c) As far as these data are concerned, the median will be a more appropriate measure
34
than the mean.

Further, we can determine the median graphically as follows:

35
Example 2.10: Suppose we are given the following series:

Class interval 0-10 10-20 20-30 30-40 40-50 50-60 60-70

Frequency 6 12 22 37 17 8 5

We are asked to draw both types of ogive from these data and to determine the

median.

Solution:

First of all, we transform the given data into two cumulative frequency distributions,

one based on ‘less than’ and another on ‘more than’ methods.

Table A
Frequency

Less than 10 6
Less than 20 18
Less than 30 40
Less than 40 77
Less than 50 94
Less than 60 102
Less than 70 107

Table B

Frequency
More than 0 107
More than 10 101
More than 20 89
More than 30 67
More than 40 30
More than 50 13
More than 60 5

It may be noted that the point of

intersection of the two ogives gives the

value of the median. From this point of

intersection A, we draw a straight line to

36
meet the X-axis at M. Thus, from the point of origin to the point at M gives the value

of the median, which comes to 34, approximately. If we calculate the median by

applying the formula, then the answer comes to 33.8, or 34, approximately. It may be

pointed out that even a single ogive can be used to determine the median. As we have

determined the median graphically, so also we can find the values of quartiles, deciles

or percentiles graphically. For example, to determine we have to take size of {3(n + 1)}

/4 = 81st item. From this point on the Y-axis, we can draw a perpendicular to meet the

'less than' ogive from which another straight line is to be drawn to meet the X-axis. This

point will give us the value of the upper quartile. In the same manner, other values of

Q1 and deciles and percentiles can be determined.

2.3.1 CHARACTERISTICS OF THE MEDIAN

1. Unlike the arithmetic mean, the median can be computed from open-ended

distributions. This is because it is located in the median class-interval, which

would not be an open-ended class.

2. The median can also be determined graphically whereas the arithmetic mean

cannot be ascertained in this manner.

3. As it is not influenced by the extreme values, it is preferred in case of a

distribution having extreme values.

4. In case of the qualitative data where the items are not counted or measured but

are scored or ranked, it is the most appropriate measure of central tendency.

2.4 MODE

The mode is another measure of central tendency. It is the value at the point around

which the items are most heavily concentrated. As an example, consider the following

series: 8,9, 11, 15, 16, 12, 15,3, 7, 15

37
There are ten observations in the series wherein the figure 15 occurs maximumnumber

of times three. The mode is therefore 15. The series given above is a discrete series; as

such, the variable cannot be in fraction. If the series were continuous, we could say that

the mode is approximately 15, without further computation.

In the case of grouped data, mode is determined by the following formula:

Mode= l1 + f1 − f 0
i
( f1 − f 0 ) + ( f 1 − f 2 )

Where, l1 = the lower value of the class in which the mode lies

fl = the frequency of the class in which the mode lies

fo = the frequency of the class preceding the modal class

f2 = the frequency of the class succeeding the modal class

i = the class-interval of the modal class

While applying the above formula, we should ensure that the class-intervals are uniform

throughout. If the class-intervals are not uniform, then they should be made uniform on

the assumption that the frequencies are evenly distributed throughout the class. In the

case of inequal class-intervals, the application of the above formula will give misleading

results.

Example 2.11: Let us take the following frequency distribution:

Class intervals (1) Frequency (2)


30-40 4
40-50 6
50-60 8
60-70 12
70-80 9
80-90 7
90-100 4
We have to calculate the mode in respect of this series.

Solution: We can see from Column (2) of the table that the maximum frequency of

12 lies in the class-interval of 60-70. This suggests that the mode lies in this class-

interval. Applying the formula given earlier, we get:


38
12 - 8  10
Mode = 60 +
12 - 8 (12 - 8) + (12 - 9)

4
= 60 +  10
4+3

= 65.7 approx.

In several cases, just by inspection one can identify the class-interval in which the mode

lies. One should see which the highest frequency is and then identify to which class-

interval this frequency belongs. Having done this, the formula given for calculating the

mode in a grouped frequency distribution can be applied.

At times, it is not possible to identify by inspection the class where the mode lies. In

such cases, it becomes necessary to use the method of grouping. This method consists

of two parts:

(i) Preparation of a grouping table: A grouping table has six columns, the first

column showing the frequencies as given in the problem. Column 2 shows

frequencies grouped in two's, starting from the top. Leaving the first frequency,

column 3 shows frequencies grouped in two's. Column 4 shows the frequencies

of the first three items, then second to fourth item and so on. Column 5 leaves

the first frequency and groups the remaining items in three's. Column 6 leaves

the first two frequencies and then groups the remaining in three's. Now, the

maximum total in each column is marked and shown eitherin a circle or in a

bold type.

(ii) Preparation of an analysis table: After having prepared a grouping table, an

analysis table is prepared. On the left-hand side, provide the first column for

column numbers and on the right-hand side the different possible values of

mode. The highest values marked in the grouping table are shown here by a

bar or by simply entering 1 in the relevant cell corresponding to the values

39
they represent. The last row of this table will show the number of times a

particular value has occurred in the grouping table. The highest value in the

analysis table will indicate the class-interval in which the mode lies. The

procedure of preparing both the grouping and analysis tables to locate the modal

class will be clear by taking an example.

Example 2.12: The following table gives some frequency data:

Size of Item Frequency

10-20 10
20-30 18
30-40 25
40-50 26
50-60 17
60-70 4

Solution:
Grouping Table
Size of item 1 2 3 4 5 6

10-20 10
28
20-30 18 53
43
30-40 25 69
51
40-50 26 68
43
50-60 17 47
21
60-70 4

Analysis table

Size of item
Col. No. 10-20 20-30 30-40 40-50 50-60

1 1
2 1 1
3 1 1 1 1
4 1 1 1
5 1 1 1

40
6 1 1 1

Total 1 3 5 5 2

This is a bi-modal series as is evident from the analysis table, which shows that the two

classes 30-40 and 40-50 have occurred five times each in the grouping. In such a

situation, we may have to determine mode indirectly by applying the followingformula:

Mode = 3 median - 2 mean

Median = Size of (n + l)/2th item, that is, 101/2 = 50.5th item. This lies in the class 30-

40. Applying the formula for the median, as given earlier, we get

40 - 30
= 30 + (50.5 − 28)
25

= 30 + 9 = 39

Now, arithmetic mean is to be calculated. This is shown in the following table.

Class- interval Frequency Mid- points d d' = d/10 fd'


10-20 10 15 -20 -2 -20
20-30 18 25 -10 -I -18
30-40 25 35 0 0 0
40-50 26 45 10 1 26
50-60 17 55 20 2 34
60-70 4 65 30 3 12
Total 100 34
Deviation is taken from arbitrary mean = 35

Mean = A+
 fd '  i
n
34
= 35 + 10
100

= 38.4

Mode = 3 median - 2 mean

= (3 x 39) - (2 x 38.4)

= 117 -76.8

41
= 40.2

This formula, Mode = 3 Median-2 Mean, is an empirical formula only. And it can

give only approximate results. As such, its frequent use should be avoided. However,

when mode is ill defined or the series is bimodal (as is the case in the present

example) it may be used.

2.5 RELATIONSHIPS OF THE MEAN, MEDIAN AND MODE

Having discussed mean, median and mode, we now turn to the relationship amongst

these three measures of central tendency. We shall discuss the relationship assuming

that there is a unimodal frequency distribution.

(i) When a distribution is symmetrical, the mean, median and mode are the same,

as is shown below in the following figure.

In case, a distribution is

skewed to the right, then

mean> median> mode.

Generally, income distri-

bution is skewed to the right where a large number of families have relatively

low income and a small number of families have extremely high income. In such

a case, the mean is pulled up by the extreme high incomes and the relation

among these three measures is as shown in Fig. 6.3. Here, we find thatmean>

median> mode.

(ii) When a distribution is skewed to the

left, then mode> median> mean.

This is because here mean ispulled

down below the median by

extremely low values. This is

42
shown as in the figure.

(iii) Given the mean and median of a unimodal distribution, we can determine

whether it is skewed to the

right or left. When mean>

median, it is skewed to the

right; when median> mean, it

is skewed to the left. It may be noted that the median is always in the middle

between mean and mode.

2.6 THE BEST MEASURE OF CENTRAL TENDENCY

At this stage, one may ask as to which of these three measures of central tendency the

best is. There is no simple answer to this question. It is because these three measures

are based upon different concepts. The arithmetic mean is the sum of the values divided

by the total number of observations in the series. The median is the value of the middle

observation that divides the series into two equal parts. Mode is the value around which

the observations tend to concentrate. As such, the use of a particular measure will

largely depend on the purpose of the study and the nature of the data; For example,

when we are interested in knowing the consumers preferences fordifferent brands of

television sets or different kinds of advertising, the choice should go in favour of mode.

The use of mean and median would not be proper. However,the median can sometimes

be used in the case of qualitative data when such data can be arranged in an ascending

or descending order. Let us take another example. Suppose we invite applications for a

certain vacancy in our company. A large number of candidates apply for that post. We

are now interested to know as to which age or age group has the largest concentration

of applicants. Here, obviously the mode will be the most appropriate choice. The

arithmetic mean may not be appropriate as it may

43
be influenced by some extreme values. However, the mean happens to be the most

commonly used measure of central tendency as will be evident from the discussion in

the subsequent chapters.

2.7 GEOMETRIC MEAN

Apart from the three measures of central tendency as discussed above, there are two

other means that are used sometimes in business and economics. These are the

geometric mean and the harmonic mean. The geometric mean is more important than

the harmonic mean. We discuss below both these means. First, we take up the geometric

mean. Geometric mean is defined at the nth root of the product of n observations of a

distribution.

Symbolically, GM = n If we have only two observations, say, 4 and

16 then GM = 4 16 = = 8. Similarly, if there are three observations, then we


64

have to calculate the cube root of the product of these three observations; and so on.

When the number of items is large, it becomes extremely difficult to multiply the

numbers and to calculate the root. To simplify calculations, logarithms are used.

Example 2.13: If we have to find out the geometric mean of 2, 4 and 8, then we find

Log GM =
log x i

Log 2 + Log 4 + Log8


=
3

0.3010 + 0.6021 + 0.9031


=
3

1.8062
= = 0.60206
3

GM = Antilog 0.60206

=4

44
When the data are given in the form of a frequency distribution, then the geometric

mean can be obtained by the formula:

Log GM = f 1 .log xl + f 2 .log x 2 + ... + f n . log x n

f1 + f 2 + ........... fn

=
 f .log x
f1 + f 2 +........... fn

Then, GM = Antilog n

The geometric mean is most suitable in the following three cases:

1. Averaging rates of change.

2. The compound interest formula.

3. Discounting, capitalization.

Example 2.14: A person has invested Rs 5,000 in the stock market. At the end of the

first year the amount has grown to Rs 6,250; he has had a 25 percent profit. If at the end

of the second year his principal has grown to Rs 8,750, the rate of increase is 40 percent

for the year. What is the average rate of increase of his investment during the two years?

Solution:

GM = 1.25 1.40 = 1.75. = 1.323

The average rate of increase in the value of investment is therefore 1.323 - 1 = 0.323,

which if multiplied by 100, gives the rate of increase as 32.3 percent.

Example 2.15: We can also derive a compound interest formula from the above set of

data. This is shown below:

Solution: Now, 1.25 x 1.40 = 1.75. This can be written as 1.75 = (1 + 0.323)2.

Let P2 = 1.75, P0 = 1, and r = 0.323, then the above equation can be written as P2 = (1

+ r)2 or P2 = P0 (1 + r)2.

45
Where P2 is the value of investment at the end of the second year, P0 is the initial

investment and r is the rate of increase in the two years. This, in fact, is the familiar

compound interest formula. This can be written in a generalised form as Pn = P0(1 +

r)n. In our case Po is Rs 5,000 and the rate of increase in investment is 32.3 percent. Let

us apply this formula to ascertain the value of Pn, that is, investment at the end of the

second year.

Pn = 5,000 (1 + 0.323)2

= 5,000 x 1.75

= Rs 8,750

It may be noted that in the above example, if the arithmetic mean is used, the resultant

25 + 40
figure will be wrong. In this case, the average rate for the two years is percent
2

165
per year, which comes to 32.5. Applying this rate, we get Pn = x 5,000
100

= Rs 8,250

This is obviously wrong, as the figure should have been Rs 8,750.

Example 2.16: An economy has grown at 5 percent in the first year, 6 percent in the

second year, 4.5 percent in the third year, 3 percent in the fourth year and 7.5 percent

in the fifth year. What is the average rate of growth of the economy during the five

years?

Solution:

Year Rate of Growth Value at the end of the Log x


( percent) Year x (in Rs)
1 5 105 2.02119
2 6 106 2.02531
3 4.5 104.5 2.01912
4 3 103 2.01284
5 7.5 107.5 2.03141
 log X = 10.10987

46
 log x 
GM = Antilog  
n 

= Antilog 
10.10987 

 5 

= Antilog 2.021974

= 105.19

Hence, the average rate of growth during the five-year period is 105.19 - 100 = 5.19

percent per annum. In case of a simple arithmetic average, the corresponding rate of

growth would have been 5.2 percent per annum.

2.7.1 DISCOUNTING

The compound interest formula given above was

Pn
Pn=P0(1+r)n This can be written as P0 =
(1 + r) n

This may be expressed as follows:

If the future income is Pn rupees and the present rate of interest is 100 r percent, then

the present value of P n rupees will be P0 rupees. For example, if we have a machine

that has a life of 20 years and is expected to yield a net income of Rs 50,000 per year,

and at the end of 20 years it will be obsolete and cannot be used, then the machine's

present value is

50,000 + 50,000 + 50,000 +................. 50,000

(1 + r) n (1 + r) 2 (1 + r)3 (1 + r) 20

This process of ascertaining the present value of future income by using the interest rate

is known as discounting.

In conclusion, it may be said that when there are extreme values in a series, geometric

mean should be used as it is much less affected by such values. The arithmetic mean

in such cases will give misleading results.


47
Before we close our discussion on the geometric mean, we should be aware of its

advantages and limitations.

2.7.2 ADVANTAGES OF G. M.

1. Geometric mean is based on each and every observation in the data set.

2. It is rigidly defined.

3. It is more suitable while averaging ratios and percentages as also in calculating

growth rates.

4. As compared to the arithmetic mean, it gives more weight to small values and

less weight to large values. As a result of this characteristic of the geometric

mean, it is generally less than the arithmetic mean. At times it may be equal to

the arithmetic mean.

5. It is capable of algebraic manipulation. If the geometric mean has two or more

series is known along with their respective frequencies. Then a combined

geometric mean can be calculated by using the logarithms.

2.7.3 LIMITATIONS OF G.M.

1. As compared to the arithmetic mean, geometric mean is difficult to

understand.

2. Both computation of the geometric mean and its interpretation are rather

difficult.

3. When there is a negative item in a series or one or more observations have

zero value, then the geometric mean cannot be calculated.

In view of the limitations mentioned above, the geometric mean is not frequently

used.

2.8 HARMONIC MEAN

48
The harmonic mean is defined as the reciprocal of the arithmetic mean of the

reciprocals of individual observations. Symbolically,

= Re ciprocal 
n 1/ x
HM=
1/ x1 + 1/ x 2 + 1/ x 3 + . .. + 1/ x n n

The calculation of harmonic mean becomes very tedious when a distribution has a large

number of observations. In the case of grouped data, the harmonic mean is calculated

by using the following formula:


 1n

HM = Reciprocal of   f i  
i −1 
xi 

or

n
n
 1
 f i  x 
i −1  i 

Where n is the total number of observations.

Here, each reciprocal of the original figure is weighted by the corresponding frequency

(f).

The main advantage of the harmonic mean is that it is based on all observations in a

distribution and is amenable to further algebraic treatment. When we desire to give

greater weight to smaller observations and less weight to the larger observations, then

the use of harmonic mean will be more suitable. As against these advantages, there

are certain limitations of the harmonic mean. First, it is difficult to understand as well

as difficult to compute. Second, it cannot be calculated if any of the observations is zero

or negative. Third, it is only a summary figure, which may not be an actual observation

in the distribution.

It is worth noting that the harmonic mean is always lower than the geometric mean,

which is lower than the arithmetic mean. This is because the harmonic mean assigns

49
lesser importance to higher values. Since the harmonic mean is based on reciprocals,

it becomes clear that as reciprocals of higher values are lower than those of lower

values, it is a lower average than the arithmetic mean as well as the geometric mean.

Example 2.17: Suppose we have three observations 4, 8 and 16. We are required to
1
calculate the harmonic mean. Reciprocals of 4,8 and 16 are: 1 , 1 , respectively

4 8 16

Since HM = n
1/ x1 + 1/ x 2 + 1/ x 3

3
=
1/ 4 + 1/ 8 + 1/ 16

3
=
0.25 + 0.125 + 0.0625

= 6.857 approx.

Example 2.18: Consider the following series:

Class-interval 2-4 4-6 6-8 8-10

Frequency 20 40 30 10

Solution:

Let us set up the table as follows:

Class-interval Mid-value Frequency Reciprocal of MV f x 1/x


2-4 3 20 0.3333 6.6660
4-6 5 40 0.2000 8.0000
6-8 7 30 0.1429 4.2870
8-10 9 10 0.1111 1.1111
Total 20.0641

n
 1
 f i  x 
i −1  i 
=
n

100
= = 4.984 approx.
20.0641

50
Example 2.19: In a small company, two typists are employed. Typist A types one page

in ten minutes while typist B takes twenty minutes for the same. (i) Both are asked to

type 10 pages. What is the average time taken for typing one page? (ii) Both are asked

to type for one hour. What is the average time taken by them for typing one page?

Solution: Here Q-(i) is on arithmetic mean while Q-(ii) is on harmonic mean.

(10  10) + (20  20)(min utes)


(i) M=
10  2( pages)

= 15 minutes

60  (min utes)
HM =
60 / 10 + 60 / 20( pages)

120 40
= = = 13 min utes and 20 seconds.
120 + 60 3
20

Example 2.20: It takes ship A 10 days to cross the Pacific Ocean; ship B takes 15

days and ship C takes 20 days. (i) What is the average number of days taken by a ship

to cross the Pacific Ocean? (ii) What is the average number of days taken by a cargo

to cross the Pacific Ocean when the ships are hired for 60 days?

Solution: Here again Q-(i) pertains to simple arithmetic mean while Q-(ii) is concerned

with the harmonic mean.

10 + 15 + 20
(i) M = = 15 days
3

60  3(days) _
(ii) HM =
60 / 10 + 60 / 15 + 60 / 20

180
=
360 + 240 + 180
60

51
= 13.8 days approx.

2.9 QUADRATIC MEAN

We have seen earlier that the geometric mean is the antilogarithm of the arithmetic

mean of the logarithms, and the harmonic mean is the reciprocal of the arithmetic mean

of the reciprocals. Likewise, the quadratic mean (Q) is the square root of the arithmetic

mean of the squares. Symbolically,

1 2 n
Q=
n

Instead of using original values, the quadratic mean can be used while averaging

deviations when the standard deviation is to be calculated. This will be used in the

next chapter on dispersion.

2.9.1 Relative Position of Different Means

The relative position of different means will always be:

Q> x >G>H provided that all the individual observations in a series are positive and

all of them are not the same.

2.9.2 Composite Average or Average of Means

Sometimes, we may have to calculate an average of several averages. In such cases, we

should use the same method of averaging that was employed in calculating the original

averages. Thus, we should calculate the arithmetic mean of several values of x, the

geometric mean of several values of GM, and the harmonic mean of several values of

HM. It will be wrong if we use some other average in averaging of means.

2.10 SUMMARY

It is the most important objective of statistical analysis is to get one single value that

describes the characteristics of the entire mass of cumbersome data. Such a value is

finding out, which is known as central value to serve our purpose.

52
2.11 SELF-TEST QUESTIONS
1. What are the desiderata (requirements) of a good average? Compare the mean,

the median and the mode in the light of these desiderata? Why averages are

called measures of central tendency?

2. "Every average has its own peculiar characteristics. It is difficult to say which

average is the best." Explain with examples.

3. What do you understand .by 'Central Tendency'? Under what conditions is the

median more suitable than other measures of central tendency?

4. The average monthly salary paid to all employees in a company was Rs 8,000.

The average monthly salaries paid to male and female employees of the

company were Rs 10,600 and Rs 7,500 respectively. Find out the percentages

of males and females employed by the company.

5. Calculate the arithmetic mean from the following data:

Class 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89

Frequency 2 4 9 11 12 6 4 2

6. Calculate the mean, median and mode from the following data:

Height in Inches Number of Persons

62-63 2
63-64 6
64-65 14
65-66 16
66-67 8
67-68 3
68-69 1
Total 50

7. A number of particular articles have been classified according to their weights.

After drying for two weeks, the same articles have again been weighed and

similarly classified. It is known that the median weight in the first weighing

53
was 20.83 gm while in the second weighing it was 17.35 gm. Some frequencies

a and b in the first weighing and x and y in the second are missing.It is known

that a = 1/3x and b = 1/2 y. Find out the values of the missing frequencies.

Class Frequencies

First Weighing Second Weighing

0- 5 a z

5-10 b y

10-15 11 40

15-20 52 50

20-25 75 30

25-30 22 28

8 Cities A, Band C are equidistant from each other. A motorist travels from A to

B at 30 km/h; from B to C at 40 km/h and from C to A at 50 km/h. Determine

his average speed for the entire trip.

9 Calculate the harmonic mean from the following data:

Class-Interval 2-4 4-6 6-8 8-10

Frequency 20 40 30 10

10 A vehicle when climbing up a gradient, consumes petrol @ 8 km per litre.

While coming down it runs 12 km per litre. Find its average consumption for

to and fro travel between two places situated at the two ends of 25 Ian long

gradient.

54
55
DISPERSION AND SKEWNESS

3.1 INTRODUCTION

In the previous chapter, we have explained the measures of central tendency. It may

be noted that these measures do not indicate the extent of dispersion or variability in a

distribution. The dispersion or variability provides us one more step in increasing our

understanding of the pattern of the data. Further, a high degree of uniformity (i.e. low

degree of dispersion) is a desirable quality. If in a business there is a high degree of

variability in the raw material, then it could not find mass production economical.

56
Suppose an investor is looking for a suitable equity share for investment. While

examining the movement of share prices, he should avoid those shares that are highly

fluctuating-having sometimes very high prices and at other times going very low.

Such extreme fluctuations mean that there is a high risk in the investment in shares. The

investor should, therefore, prefer those shares where risk is not so high.

3.2 MEANING AND DEFINITIONS OF DISPERSION

The various measures of central value give us one single figure that represents the

entire data. But the average alone cannot adequately describe a set of observations,

unless all the observations are the same. It is necessary to describe the variability or

dispersion of the observations. In two or more distributions the central value may be

the same but still there can be wide disparities in the formation of distribution.

Measures of dispersion help us in studying this important characteristic of a

distribution.

Some important definitions of dispersion are given below:

1. "Dispersion is the measure of the variation of the items." -A.L. Bowley


2. "The degree to which numerical data tend to spread about an average value is
called the variation of dispersion of the data." -Spiegel
3. Dispersion or spread is the degree of the scatter or variation of the variable about
a central value." -Brooks & Dick
4. "The measurement of the scatterness of the mass of figures in a series about an

average is called measure of variation or dispersion." -Simpson & Kajka

It is clear from above that dispersion (also known as scatter, spread or variation)

measures the extent to which the items vary from some central value. Since measures

of dispersion give an average of the differences of various items from an average,

they are also called averages of the second order. An average is more meaningful

when it is examined in the light of dispersion. For example, if the average wage of the

57
workers of factory A is Rs. 3885 and that of factory B Rs. 3900, we cannot necessarily

conclude that the workers of factory B are better off because in factory B there may be

much greater dispersion in the distribution of wages. The study of dispersion is of great

significance in practice as could well be appreciated from the following example:

Series A Series B Series C

100 100 1

100 105 489

100 102 2

100 103 3

100 90 5

Total 500 500 500

100 100 100


x
Since arithmetic mean is the same in all three series, one is likely to conclude that these

series are alike in

nature. But a close

examination shall reveal

that distributions differ

widely from one another.

In series A, (In Box-3.1)

each and every item is

perfectly represented by the

arithmetic mean or in other

words none of the items of

series A deviates from the

58
arithmetic mean and hence there is no dispersion. In series B, only one item isperfectly

represented by the arithmetic mean and the other items vary but the variation is very

small as compared to series C. In series C. not a single item is represented by the

arithmetic mean and the items vary widely from one another. In series C, dispersion is

much greater compared to series B. Similarly, we may have two groups of labourers

with the same mean salary and yet their distributions may differ widely. The mean

salary may not be so important a characteristic as the variation of the items from the

mean. To the student of social affairs the mean income is not so vitally important as to

know how this income is distributed. Are a large number receiving themean income or

are there a few with enormous incomes and millions with incomes farbelow the mean?

The three figures given in Box 3.1 represent frequency distributions with some of the

characteristics. The two curves in diagram (a) represent two

distractions with the same mean X , but with different dispersions. The two curves in

(b) represent two distributions with the same dispersion but with unequal means X l

and X 2, (c) represents two distributions with unequal dispersion. The measures of

central tendency are, therefore insufficient. They must be supported and supplemented

with other measures.

In the present chapter, we shall be especially concerned with the measures of variability

or spread or dispersion. A measure of variation or dispersion is one that measures the

extent to which there are differences between individual observation and some central

or average value. In measuring variation we shall be interested in the amount of the

variation or its degree but not in the direction. For example, a measure of 6 inches below

the mean has just as much dispersion as a measure of six inches above the mean.

59
Literally meaning of dispersion is ‘scatteredness’. Average or the measures of central

tendency gives us an idea of the concentration of the observations about the central part

of the distribution. If we know the average alone, we cannot form a complete ideaabout

the distribution. But with the help of dispersion, we have an idea about homogeneity or

heterogeneity of the distribution.

3.3 SIGNIFICANCE AND PROPERTIES OF MEASURING

VARIATION

Measures of variation are needed for four basic purposes:

1. Measures of variation point out as to how far an average is representative of

the mass. When dispersion is small, the average is a typical value in the sense

that it closely represents the individual value and it is reliable in the sense that

it is a good estimate of the average in the corresponding universe. On the other

hand, when dispersion is large, the average is not so typical, and unless the

sample is very large, the average may be quite unreliable.

2. Another purpose of measuring dispersion is to determine nature and cause of

variation in order to control the variation itself. In matters of health variations

in body temperature, pulse beat and blood pressure are the basic guides to

diagnosis. Prescribed treatment is designed to control their variation. In

industrial production efficient operation requires control of quality variation

the causes of which are sought through inspection is basic to the control of

causes of variation. In social sciences a special problem requiring the

measurement of variability is the measurement of "inequality" of the

distribution of income or wealth etc.

3. Measures of dispersion enable a comparison to be made of two or more series

with regard to their variability. The study of variation may also be looked

60
upon as a means of determining uniformity of consistency. A high degree of

variation would mean little uniformity or consistency whereas a low degree of

variation would mean great uniformity or consistency.

4. Many powerful analytical tools in statistics such as correlation analysis. the

testing of hypothesis, analysis of variance, the statistical quality control,

regression analysis is based on measures of variation of one kind or another.

A good measure of dispersion should possess the following properties

1. It should be simple to understand.

2. It should be easy to compute.

3. It should be rigidly defined.

4. It should be based on each and every item of the distribution.

5. It should be amenable to further algebraic treatment.

6. It should have sampling stability.

7. Extreme items should not unduly affect it.

3.4 MEAURES OF DISPERSION

There are five measures of dispersion: Range, Inter-quartile range or Quartile

Deviation, Mean deviation, Standard Deviation, and Lorenz curve. Among them, the

first four are mathematical methods and the last one is the graphical method. These

are discussed in the ensuing paragraphs with suitable examples.

3.5 RANGE

The simplest measure of dispersion is the range, which is the difference between the

maximum value and the minimum value of data.

Example 3.1: Find the range for the following three sets of data:

Set 1: 05 15 15 05 15 05 15 15 15 15

Set 2: 8 7 15 11 12 5 13 11 15 9

61
Set 3: 5 5 5 5 5 5 5 5 5 5

Solution: In each of these three sets, the highest number is 15 and the lowest number

is 5. Since the range is the difference between the maximum value and the minimum

value of the data, it is 10 in each case. But the range fails to give any idea about the

dispersal or spread of the series between the highest and the lowest value. Thisbecomes

evident from the above data.

In a frequency distribution, range is calculated by taking the difference between the

upper limit of the highest class and the lower limit of the lowest class.

Example 3.2: Find the range for the following frequency distribution:

Size of Item Frequency


20- 40 7
40- 60 11
60- 80 30
80-100 17
100-120 5
Total 70

Solution: Here, the upper limit of the highest class is 120 and the lower limit of the

lowest class is 20. Hence, the range is 120 - 20 = 100. Note that the range is not

influenced by the frequencies. Symbolically, the range is calculated b the formula L -

S, where L is the largest value and S is the smallest value in a distribution. The

coefficient of range is calculated by the formula: (L-S)/ (L+S). This is the relative

measure. The coefficient of the range in respect of the earlier example having three sets

of data is: 0.5.The coefficient of range is more appropriate for purposes ofcomparison

as will be evident from the following example:

Example 3.3: Calculate the coefficient of range separately for the two sets of data

given below:

Set 1 8 10 20 9 15 10 13 28

Set 2 30 35 42 50 32 49 39 33

62
Solution: It can be seen that the range in both the sets of data is the same:

Set 1 28 - 8 = 20

Set 2 50 - 30 = 20

Coefficient of range in Set 1 is:

28 – 8 = 0.55
28+8
Coefficient of range in set 2 is:
50 – 30
= 0.25
50 +30

3.5.1 LIMITATIONS OF RANGE


There are some limitations of range, which are as follows:

1. It is based only on two items and does not cover all the items in a distribution.

2. It is subject to wide fluctuations from sample to sample based on the same

population.

3. It fails to give any idea about the pattern of distribution. This was evident from

the data given in Examples 1 and 3.

4. Finally, in the case of open-ended distributions, it is not possible to compute

the range.

Despite these limitations of the range, it is mainly used in situations where one wants

to quickly have some idea of the variability or' a set of data. When the sample size is

very small, the range is considered quite adequate measure of the variability. Thus, it

is widely used in quality control where a continuous check on the variability of raw

materials or finished products is needed. The range is also a suitable measure in weather

forecast. The meteorological department uses the range by giving the maximum and the

minimum temperatures. This information is quite useful to the common man, as he can

know the extent of possible variation in the temperature on a particular day.

63
3.6 INTERQUARTILE RANGE OR QUARTILE DEVIATION

The interquartile range or the quartile deviation is a better measure of variation in a

distribution than the range. Here, avoiding the 25 percent of the distribution at both

the ends uses the middle 50 percent of the distribution. In other words, the interquartile

range denotes the difference between the third quartile and the firstquartile.

Symbolically, interquartile range = Q3- Ql

Many times the interquartile range is reduced in the form of semi-interquartile range

or quartile deviation as shown below:

Semi interquartile range or Quartile deviation = (Q3 – Ql)/2

When quartile deviation is small, it means that there is a small deviation in the central

50 percent items. In contrast, if the quartile deviation is high, it shows that the central

50 percent items have a large variation. It may be noted that in a symmetrical

distribution, the two quartiles, that is, Q3 and QI are equidistant from the median.

Symbolically,

M-QI = Q3-M

However, this is seldom the case as most of the business and economic data are

asymmetrical. But, one can assume that approximately 50 percent of the observations

are contained in the interquartile range. It may be noted that interquartile range or the

quartile deviation is an absolute measure of dispersion. It can be changed into a relative

measure of dispersion as follows:

Q3 –Q1
Coefficient of QD = Q3 +Q1

The computation of a quartile deviation is very simple, involving the computation of

upper and lower quartiles. As the computation of the two quartiles has already been

explained in the preceding chapter, it is not attempted here.

64
3.6.1 MERITS OF QUARTILE DEVIATION

The following merits are entertained by quartile deviation:

1. As compared to range, it is considered a superior measure of dispersion.

2. In the case of open-ended distribution, it is quite suitable.

3. Since it is not influenced by the extreme values in a distribution, it is

particularly suitable in highly skewed or erratic distributions.

3.6.2 LIMITATIONS OF QUARTILE DEVIATION

1. Like the range, it fails to cover all the items in a distribution.

2. It is not amenable to mathematical manipulation.

3. It varies widely from sample to sample based on the same population.

4. Since it is a positional average, it is not considered as a measure of dispersion.

It merely shows a distance on scale and not a scatter around an average.

In view of the above-mentioned limitations, the interquartile range or the quartile

deviation has a limited practical utility.

3.7 MEAN DEVIATION

The mean deviation is also known as the average deviation. As the name implies, it is

the average of absolute amounts by which the individual items deviate from the mean.

Since the positive deviations from the mean are equal to the negative deviations, while

computing the mean deviation, we ignore positive and negative signs.

Symbolically,

MD = | x | Where MD = mean deviation, |x| = deviation of an item


n

from the mean ignoring positive and negative signs, n = the total number of

observations.

65
Example 3.4:

Size of Item Frequency


2-4 20
4-6 40
6-8 30
8-10 10

Solution:

Size of Item Mid-points (m) Frequency (f) fm d from x f |d|

2-4 3 20 60 -2.6 52
4-6 5 40 200 -0.6 24
6-8 7 30 210 1.4 42
8-10 9 10 90 3.4 34
Total 100 560 152

x =
 fm = 560 = 5.6
n 100

f |d |
=
152
= 1.52
MD ( x ) =
n 100

3.7.1 MERITS OF MEAN DEVIATION


1. A major advantage of mean deviation is that it is simple to understand and

easy to calculate.

2. It takes into consideration each and every item in the distribution. As a result,

a change in the value of any item will have its effect on the magnitude of mean

deviation.

3. The values of extreme items have less effect on the value of the mean deviation.

4. As deviations are taken from a central value, it is possible to have meaningful

comparisons of the formation of different distributions.

3.7.2 LIMITATIONS OF MEAN DEVIATION


1. It is not capable of further algebraic treatment.

66
2. At times it may fail to give accurate results. The mean deviation gives best

results when deviations are taken from the median instead of from the mean.

But in a series, which has wide variations in the items, median is not a

satisfactory measure.

3. Strictly on mathematical considerations, the method is wrong as it ignores the

algebraic signs when the deviations are taken from the mean.

In view of these limitations, it is seldom used in business studies. A better measure

known as the standard deviation is more frequently used.

3.8 STANDARD DEVIATION

The standard deviation is similar to the mean deviation in that here too the deviations

are measured from the mean. At the same time, the standard deviation is preferred to

the mean deviation or the quartile deviation or the range because it has desirable

mathematical properties.

Before defining the concept of the standard deviation, we introduce another concept

viz. variance.

Example 3.5:

X X- (X-)2
20 20-18=12 4
15 15-18= -3 9
19 19-18 = 1 1
24 24-18 = 6 36
16 16-18 = -2 4
14 14-18 = -4 16
108 Total 70
Solution:

108
Mean = = 18
6

67
The second column shows the deviations from the mean. The third or the last column

shows the squared deviations, the sum of which is 70. The arithmetic mean of the

squared deviations is:

 (x −  ) 2

= 70/6=11.67 approx.
N

This mean of the squared deviations is known as the variance. It may be noted that

this variance is described by different terms that are used interchangeably: the variance

of the distribution X; the variance of X; the variance of the distribution; and

just simply, the variance.

Symbolically, Var (X) =


 (x −  ) 2

It is also written as  2 =  (x i − )
2

Where 2 (called sigma squared) is used to denote the variance.

Although the variance is a measure of dispersion, the unit of its measurement is (points).

If a distribution relates to income of families then the variance is (Rs)2 and not rupees.

Similarly, if another distribution pertains to marks of students, then the unit of

variance is (marks)2. To overcome this inadequacy, the square root of variance is taken,

which yields a better measure of dispersion known as the standard deviation. Taking

our earlier example of individual observations, we take the square root of the variance

SD or  = Variance = = 3.42 points

Symbolically,  =  11.67

In applied Statistics, the standard deviation is more frequently used than the variance.

This can also be written as:

68
 =

We use this formula to calculate the standard deviation from the individual

observations given earlier.

Example 7.6:

X X2
20 400
15 225
19 361
24 576
16 256
14 196
108 2014

Solution:

 x 2 = 2014
i x i = 108 N=6

2014 − 2014 −
11664
 = Or,  =

12084 −11664
 =
6 Or,  =

 = Or,  =

 = 3.42

Example 3.7:

The following distribution relating to marks obtained by students in an examination:

Marks Number of Students


0- 10 1
10- 20 3
20- 30 6
30- 40 10
40- 50 12
50- 60 11

69
60- 70 6
70- 80 3
80- 90 2
90-100 1
Solution:

Marks Frequency (f) Mid-points Deviations (d)/10=d’ Fd’ fd'2


0- 10 1 5 -5 -5 25
10- 20 3 15 -4 -12 48
20- 30 6 25 -3 -18 54
30- 40 10 35 -2 -20 40
40- 50 12 45 -1 -12 12
50- 60 11 55 0 0 0
60- 70 6 65 1 6 6
70- 80 3 75 2 6 12
80- 90 2 85 3 6 18
90-100 1 95 4 4 16
Total 55 Total -45 231
In the case of frequency distribution where the individual values are not known, we
use the midpoints of the class intervals. Thus, the formula used for calculating
the standard deviation is as given below:

=
N

Where mi is the mid-point of the class intervals  is the mean of the distribution, fi is

the frequency of each class; N is the total number of frequency and K is the number of

classes. This formula requires that the mean  be calculated and that deviations (mi -

) be obtained for each class. To avoid this inconvenience, the above formula can be

modified as:

 fid   fdi i 

= i =1 i =1

Where C is the class interval: fi is the frequency of the ith class and di is the deviation

of the of item from an assumed origin; and N is the total number of observations.

Applying this formula for the table given earlier,

 = 10
 − 45 
−

70
=10 4.2 − 0.669421

=18.8 marks

When it becomes clear that the actual mean would turn out to be in fraction, calculating
deviations from the mean would be too cumbersome. In such cases, an assumed
mean is used and the deviations from it are calculated. While mid- point of any
class can be taken as an assumed mean, it is advisable to choosethe mid-point
of that class that would make calculations least cumbersome. Guided by this
consideration, in Example 3.7 we have decided to choose 55 as the mid-point
and, accordingly, deviations have been taken from it. It will be seen from the
calculations that they are considerably simplified.
3.8.1 USES OF THE STANDARD DEVIATION

The standard deviation is a frequently used measure of dispersion. It enables us to

determine as to how far individual items in a distribution deviate from its mean. In a

symmetrical, bell-shaped curve:

(i) About 68 percent of the values in the population fall within: + 1 standard

deviation from the mean.

(ii) About 95 percent of the values will fall within +2 standard deviations from the

mean.

(iii) About 99 percent of the values will fall within + 3 standard deviations from

the mean.

The standard deviation is an absolute measure of dispersion as it measures variation in

the same units as the original data. As such, it cannot be a suitable measure while

comparing two or more distributions. For this purpose, we should use a relative measure

of dispersion. One such measure of relative dispersion is the coefficient of variation,

which relates the standard deviation and the mean such that the standard deviation is

expressed as a percentage of mean. Thus, the specific unit in which the standard

deviation is measured is done away with and the new unit becomes percent.

71

Symbolically, CV (coefficient of variation) = x 100

Example 3.8: In a small business firm, two typists are employed-typist A and typist

B. Typist A types out, on an average, 30 pages per day with a standard deviation of 6.

Typist B, on an average, types out 45 pages with a standard deviation of 10. Which

typist shows greater consistency in his output?


Solution: Coefficient of variation for A = x 100

6
Or A = x 100
30

Or 20% and

Coefficient of variation for B = x 100

10
B= x 100
45

or 22.2 %

These calculations clearly indicate that although typist B types out more pages, there

is a greater variation in his output as compared to that of typist A. We can say this in a

different way: Though typist A's daily output is much less, he is more consistent than

typist B. The usefulness of the coefficient of variation becomes clear in comparing

two groups of data having different means, as has been the case in the above example.

3.8.2 STANDARDISED VARIABLE, STANDARD SCORES


The variable Z = (x - x )/s or (x - )/, which measures the deviation from the mean

in units of the standard deviation, is called a standardised variable. Since both the

numerator and the denominator are in the same units, a standardised variable is

independent of units used. If deviations from the mean are given in units of the standard

deviation, they are said to be expressed in standard units or standard scores.

72
Through this concept of standardised variable, proper comparisons can be made

between individual observations belonging to two different distributions whose

compositions differ.

Example 3.9: A student has scored 68 marks in Statistics for which the average

marks were 60 and the standard deviation was 10. In the paper on Marketing, he scored

74 marks for which the average marks were 68 and the standard deviation was

15. In which paper, Statistics or Marketing, was his relative standing higher?

Solution: The standardised variable Z = (x - x )  s measures the deviation of x from

the mean x in terms of standard deviation s. For Statistics, Z = (68 - 60)  10 = 0.8

For Marketing, Z = (74 - 68)  15 = 0.4

Since the standard score is 0.8 in Statistics as compared to 0.4 in Marketing, his

relative standing was higher in Statistics.

Example 3.10: Convert the set of numbers 6, 7, 5, 10 and 12 into standard scores:

Solution:

X X2

6 36
7 49
5 25
10 100
12 144

 X = 40 X
2
= 354

x =  x  N = 40  5 = 8

354 −
 =
 x2 − or,  = 5

= 354 − 320 = 2.61 approx.


5

73
x−x 6−8
Z= = = -0.77 (Standard score)
 2.61

Applying this formula to other values:

7−8
(i) = -0.38
2.61

5−8
(ii) = -1.15
2.61

10 − 8
(iii) = 0.77
2.61

12 − 8
(iv) (iv) = 1.53
2.61

Thus the standard scores for 6,7,5,10 and 12 are -0.77, -0.38, -1.15, 0.77 and 1.53,

respectively.

3.9 LORENZ CURVE

This measure of dispersion is graphical. It is known as the Lorenz curve named after

Dr. Max Lorenz. It is generally used to show the extent of concentration of income

and wealth. The steps involved in plotting the Lorenz curve are:

1. Convert a frequency distribution into a cumulative frequency table.

2. Calculate percentage for each item taking the total equal to 100.

3. Choose a suitable scale and plot the cumulative percentages of the persons and

income. Use the horizontal axis of X to depict percentages of persons and the

vertical axis of Y to depict percent ages of income.

4. Show the line of equal distribution, which will join 0 of X-axis with 100 of Y-

axis.

5. The curve obtained in (3) above can now be compared with the straight line of

equal distribution obtained in (4) above. If the Lorenz curve is close to the line

of equal distribution, then it implies that the dispersion is much less. If, on the

74
contrary, the Lorenz curve is farther away from the line of equal distribution,

it implies that the dispersion is considerable.

The Lorenz curve is a simple graphical device to show the disparities of distribution

in any phenomenon. It is, used in business and economics to represent inequalities in

income, wealth, production, savings, and so on.

Figure 3.1 shows two Lorenz curves by way of illustration. The straight line AB is a

line of equal distribution, whereas AEB shows complete inequality. Curve ACB and

curve ADB are the Lorenz curves.

A F

Figure 3.1: Lorenz Curve

As curve ACB is nearer to the line of equal distribution, it has more equitable

distribution of income than curve ADB. Assuming that these two curves are for the

same company, this may be interpreted in a different manner. Prior to taxation, the curve

ADB showed greater inequality in the income of its employees. After thetaxation, the

company’s data resulted into ACB curve, which is closer to the line of equal

distribution. In other words, as a result of taxation, the inequality has reduced.

3.10 SKEWNESS: MEANING AND DEFINITIONS

In the above paragraphs, we have discussed frequency distributions in detail. It may

be repeated here that frequency distributions differ in three ways: Average value,

Variability or dispersion, and Shape. Since the first two, that is, average value and

75
variability or dispersion have already been discussed in previous chapters, here our

main spotlight will be on the shape of frequency distribution. Generally, there are two

comparable characteristics called skewness and kurtosis that help us to understand a

distribution. Two distributions may have the same mean and standard deviation but may

differ widely in their overall appearance as can be seen from the following:

In both these distributions the value of

mean and standard deviation is the same

( X = 15,  = 5). But it does not imply that

the distributions are alike in nature.

The distribution on the left-hand side is

a symmetrical one whereas the distribution on the right-hand side is symmetrical or

skewed. Measures of skewness help us to distinguish between different types of

distributions.

Some important definitions of skewness are as follows:

1. "When a series is not symmetrical it is said to be asymmetrical or skewed."

-Croxton & Cowden.

2. "Skewness refers to the asymmetry or lack of symmetry in the shape of a

frequency distribution." -Morris Hamburg.

3. "Measures of skewness tell us the direction and the extent of skewness. In

symmetrical distribution the mean, median and mode are identical. The more

the mean moves away from the mode, the larger the asymmetry or skewness."

-Simpson & Kalka

4. "A distribution is said to be 'skewed' when the mean and the median fall at

different points in the distribution, and the balance (or centre of gravity) is

shifted to one side or the other-to left or right." -Garrett

76
The above definitions show that the term 'skewness' refers to lack of symmetry" i.e.,

when a distribution is not symmetrical (or is asymmetrical) it is called a skewed

distribution.

The concept of skewness will be clear from the following three diagrams showing a

symmetrical distribution. a positively skewed distribution and a negatively skewed

distribution.

1. Symmetrical Distribution. It is clear from the diagram (a) that in a sym-

metrical distribution the values of mean, median and mode coincide. The spread

of the frequencies is the same on

both sides of the centre point of the curve.

2. Asymmetrical Distribution. A

distribution, which is not symmetrical, is

called a skewed distribution and such a

distribution could either be positively

skewed or negatively skewed as would be

clear from the diagrams (b) and (c).

3. Positively Skewed Distribution. In the

positively skewed distribution the value of

the mean is maximum and that of mode least-the median lies in between the two

as is clear from the diagram (b).

4. Negatively Skewed Distribution. The following is the shape of negatively

skewed distribution. In a negatively skewed distribution the value of mode is

maximum and that of mean least-the median lies in between the two. In the

positively skewed distribution the frequencies are spread out over a greater

77
range of values on the high-value end of the curve (the right-hand side) than

they are on the low-value end. In the negatively skewed distribution the position

is reversed, i.e. the excess tail is on the left-hand side. It should be noted that in

moderately symmetrical distributions the interval between the mean and the

median is approximately one-third of the interval between the mean and the

mode. It is this relationship, which provides a means of measuring the degree

of skewness.

3.11 TESTS OF SKEWNESS

In order to ascertain whether a distribution is skewed or not the following tests may
be applied. Skewness is present if:
1. The values of mean, median and mode do not coincide.

2. When the data are plotted on a graph they do not give the normal bell-

shaped form i.e. when cut along a vertical line through the centre the two

halves are not equal.

3. The sum of the positive deviations from the median is not equal to the sum

of the negative deviations.

4. Quartiles are not equidistant from the median.

5. Frequencies are not equally distributed at points of equal deviation from

the mode.

On the contrary, when skewness is absent, i.e. in case of a symmetrical distribution,

the following conditions are satisfied:

1. The values of mean, median and mode coincide.

2. Data when plotted on a graph give the normal bell-shaped form.

3. Sum of the positive deviations from the median is equal to the sum of the

negative deviations.

78
4. Quartiles are equidistant from the median.

5. Frequencies are equally distributed at points of equal deviations from the

mode.

3.12 MEASURES OF SKEWNESS

There are four measures of skewness, each divided into absolute and relative measures.

The relative measure is known as the coefficient of skewness and is more frequently

used than the absolute measure of skewness. Further, when a comparison between two

or more distributions is involved, it is the relative measure of skewness, which is used.

The measures of skewness are: (i) Karl Pearson's measure, (ii) Bowley’s measure, (iii)

Kelly’s measure, and (iv) Moment’s measure. These measures are discussed briefly

below:

3.12.1 KARL PEARON’S MEASURE

The formula for measuring skewness as given by Karl Pearson is as follows:

Skewness = Mean - Mode


Mean – Mode
Coefficient of skewness = Standard Deviation

In case the mode is indeterminate, the coefficient of skewness is:


Mean - (3 Median - 2 Mean)
Skp = Standard deviation
3(Mean - Median)
Skp = Standard deviation

Now this formula is equal to the earlier one.


3(Mean - Median) Mean - Mode
Standard deviation Standard deviation

Or 3 Mean - 3 Median = Mean - Mode

Or Mode = Mean - 3 Mean + 3 Median

Or Mode = 3 Median - 2 Mean

The direction of skewness is determined by ascertaining whether the mean is greater

than the mode or less than the mode. If it is greater than the mode, then skewness is
79
positive. But when the mean is less than the mode, it is negative. The difference between

the mean and mode indicates the extent of departure from symmetry. It is measured in

standard deviation units, which provide a measure independent of the unit of

measurement. It may be recalled that this observation was made in the preceding

chapter while discussing standard deviation. The value of coefficient of skewness is

zero, when the distribution is symmetrical. Normally, this coefficient of skewness lies

between +1. If the mean is greater than the mode, then the coefficient of skewness will

be positive, otherwise negative.

Example 3.11: Given the following data, calculate the Karl Pearson's coefficient of

skewness: x = 452 x2= 24270 Mode = 43.7 and N = 10

Solution:

Pearson's coefficient of skewness is:


SkP = Mean - Mode

Standard deviation

Mean ( x )=
 X = 452 = 45.2
N 10
x x
2 2
SD
( ) =
 x2
−   ( ) =
 x2
−  
 
 N   N 

( ) 24270  452  2 = 2427 − (45.2) 2 = 19.59


 = − 
10  10 

Applying the values of mean, mode and standard deviation in the above formula,

Skp = 45.2 – 43.7


19.59
=0.08

This shows that there is a positive skewness though the extent of skewness is

marginal.

Example 3.12: From the following data, calculate the measure of skewness using the

mean, median and standard deviation:

X 10 - 20 20 - 30 30 - 40 40 - 50 50-60 60 - 70 70 - 80
f 18 30 40 55 38 20 16

80
Solution:
2
x MVx dx f fdx fdX cf
10 - 20 15 -3 18 -54 162 18
20 - 30 25 -2 30 -60 120 48
30 - 40 35 -1 40 -40 40 88
40-50 45=a 0 55 0 0 143
50 - 60 55 1 38 38 38 181
60 - 70 65 2 20 40 80 201
70 - 80 75 3 16 48 144 217
Total 217 -28 584
a = Assumed mean = 45, cf = Cumulative frequency, dx = Deviation from assumed

mean, and i = 10

x=a+
 fdx  i
N

28
= 45 − 10 = 43.71
217

l2 − l 1
Median= l1 + (m − c)
f1

Where m = (N + 1)/2th item

= (217 + 1)/2 = 109th item

50 − 40
Median = 40 − (109 − 88)
55

10
= 40 +  21
55

= 43.82

  fd x  584
SD = −  10 = −  10
x

f  f  217  217 

= 2.69 - 0.016 10 = 16.4

Skewness = 3 (Mean - Median)

= 3 (43.71 - 43.82)

= 3 x -0.011
81
= -0.33

Coefficient of skewness

Skewness or
SD
= -0.33
16.4
= -0.02

The result shows that the distribution is negatively skewed, but the extent of skewness

is extremely negligible.

3.12.2 Bowley's Measure

Bowley developed a measure of skewness, which is based on quartile values. The


formula for measuring skewness is:
Q3 + Q1 − 2M
Skewness =
Q3 − Q1

Where Q3 and Q1 are upper and lower quartiles and M is the median. The value of this

skewness varies between +1. In the case of open-ended distribution as well as where

extreme values are found in the series, this measure is particularly useful. In a

symmetrical distribution, skewness is zero. This means that Q3 and Q1 are positioned

equidistantly from Q2 that is, the median. In symbols, Q3 - Q2 = Q2 – Q1' In contrast,

when the distribution is skewed, then Q3 - Q2 will be different from Q2 – Q1' When Q3

- Q2 exceeds Q2 – Q1' then skewness is positive. As against this; when Q3 - Q2 is less

than Q2 – Q1' then skewness is negative. Bowley’s measure of skewness can- be written

as:

Skewness = (Q3 - Q2) - (Q2 – Q1 or Q3 - Q2 - Q2 + Q1

Or Q3 + Q1 - 2Q2 (2Q2 is 2M)

However, this is an absolute measure of skewness. As such, it cannot be used while

comparing two distributions where the units of measurement are different. In view of

this limitation, Bowley suggested a relative measure of skewness as given below:

82
(Q3 − Q2 ) − (Q2 − Q1 )
Relative Skewness =
(Q3 − Q2 ) + (Q2 − Q1 )
Q3 − Q2 − Q2 − Q1
= Q3 − Q2 + Q2 − Q1
Q3 − Q1 − 2Q2
= Q3 − Q 1
Q3 − Q1 − 2M
= Q3 − Q 1

Example 3.13: For a distribution, Bowley’s coefficient of skewness is - 0.56,

Q1=16.4 and Median=24.2. What is the coefficient of quartile deviation?

Solution:

Q3 − Q1 − 2M
Bowley's coefficient of skewness is: SkB =
Q3 − Q1

Substituting the values in the above formula,

Q3 + 16.4 - (2 x 24.2)
SkB =
Q3 − 16.4

Q3 + 16.4 - 48.4
− 0.56 =
Q3 − 16.4

or - 0.56 (Q3-16.4) = Q3-32

or - 0.56 Q3 + 9.184 = Q3-32

or - 0.56 Q3 - Q3 = -32 - 9.184

- 1.56 Q3 = - 41.184

− 41.184
Q3 = = 26.4
1.56

Now, we have the values of both the upper and the lower quartiles.

Q3 − Q1
Coefficient of quartile deviation =
Q3 + Q1

26.4 − 16.4
= = 10 = 0.234 Approx.
26.4 + 16.4 42.8

Example 3.14: Calculate an appropriate measure of skewness from the following

data:
83
Value in Rs Frequency

Less than 50 40

50 - 100 80

100 - 150 130

150 – 200 60

200 and above 30

Solution: It should be noted that the series given in the question is an open-ended series.

As such, Bowley's coefficient of skewness, which is based on quartiles, would be the

most appropriate measure of skewness in this case. In order to calculate the quartiles

and the median, we have to use the cumulative frequency. The table is reproduced below

with the cumulative frequency.

Value in Rs Frequency Cumulative Frequency

Less than 50 40 40

50 - 100 80 120

100 - 150 130 250

150 - 200 60 310

200 and above 30 340

l2 − l1
Q1 = l1 + (m − c)
f1

n +1 341
Now m=( ) item = = 85.25, which lies in 50 - 100 class
4 4

100 − 50
Q1 = 50 + (85.25 − 40) = 78.28
80
n +1 341
M=( ) item = = 170.25, which lies in 100 - 150 class
4 4

84
150 − 100
M= 100 + (170.5 − 120) = 119.4
130

l2 − l1
Q3 = l1 + (m − c)
f1

m = 3(341)  4 = 255.75

200 − 150
Q3 = 150 + (255.75 − 250) = 154.79
60

Bowley's coefficient of skewness is:

Q3 + QI - 2M 154.79+ 78.28 - (2 x 119.4) -5.73


Q3 - QI = 154.79 -78.28 = 76.51

= - 0.075 approx.

This shows that there is a negative skewness, which has a very negligible magnitude.

3.12.3 Kelly's Measure

Kelly developed another measure of skewness, which is based on percentiles. The

formula for measuring skewness is as follows:

P90 − 2P50 + P10


Coefficient of skewness =
P90 − P10

D1 + D9 − 2M
Or,
D9 − D1

Where P and D stand for percentile and decile respectively. In order to calculate the

coefficient of skewness by this formula, we have to ascertain the values of 10th, 50th

and 90th percentiles. Somehow, this measure of skewness is seldom used. All the

same, we give an example to show how it can be calculated.

Example 3.15: Use Kelly's measure to calculate skewness.

Class Intervals f cf

10 - 20 18 18

85
20 - 30 30 48

86
30- 40 40 88
40- 50 55 143
50 - 60 38 181
60 – 70 20 201
70 - 80. 16 217

Solution: Now we have to calculate P10 P30 and P90.

l2 − l 1
PIO = l1 + (m − c) , where m = (n + 1)/10th item
f1

217 + 1
= 21.8th item
10

This lies in the 20 - 30 class.


30 − 20 10  3.8
20 + (21.8 − 18) = 20 + = 21.27approx.
30 30

217 + 1
P50 (median): where m = (n + 1)/2th item = = 109th item
2

This lies in the class 40 - 50. Applying the above formula:


50 − 40 10  21
40 + (109 − 88) = 40 +  21 = 43.82approx.
55 55

P90: here m = 90 (217 + 1)/100th item = 196.2th item

This lies in the class 60 - 70. Applying the above formula:


70 − 60 10 15.2
60 + (196.2 − 181) = 60 + = 67.6approx.
20 20

Kelley's skewness

P90 − 2P50 + P10


SkK
P90 − P10

67.6 - (2 x 43.82) + 21.27


=
67.6 - 21.27

88.87 - 87.64
=
46.63

= 0.027

87
This shows that the series is positively skewed though the extent of skewness is
extremely negligible. It may be recalled that if there is a perfectly symmetrical
distribution, then the skewness will be zero. One can see that the above answer
is very close to zero.
3.13 SUMMARY
The average value cannot adequately describe a set of observations, unless all the
observations are the same. It is necessary to describe the variability or dispersion
of the observations. In two or more distributions the central value may be the
same but still there can be wide disparities in the formation of distribution.
Therefore, we have to use the measures of dispersion.
Further, two distributions may have the same mean and standard deviation but may

differ widely in their overall appearance in terms of symmetry and skewness. To

88
distinguish between different types of distributions, we may use the measures of

skewness.

3.14 SELF TEST QUESTIONS


1. What do you mean by dispersion? What are the different measures of dispersion?

2. “Variability is not an important factor because even though the outcome is more certain,

you still have an equal chance of falling either above or below the median.

Therefore, on an average, the outcome will be the same.” Do you agree with this

statement? Give reasons for your answer.

3. Why is the standard deviation the most widely used measure of dispersion? Explain.

4. Define skewness and Dispersion.

5. What are the different measures of skewness? Which one is repeatedly used?

6. Measures of dispersion and skewness are complimentary to one another in

understanding a frequency distribution." Elucidate the statement.

89
90
Correlation Analysis

91
...if we have information on more than one variables, we might be interested in seeing if
there is any connection - any association - between them.

4.1 INTRODUCTION
Statistical methods of measures of central tendency, dispersion, skewness and kurtosis are

helpful for the purpose of comparison and analysis of distributions involving only onevariable

i.e. univariate distributions. However, describing the relationship between two or more

variables, is another important part of statistics.

In many business research situations, the key to decision making lies in understanding the

relationships between two or more variables. For example, in an effort to predict the behavior

of the bond market, a broker might find it useful to know whether the interest rate of bonds is

related to the prime interest rate. While studying the effect of advertising on sales, an account

executive may find it useful to know whether there is a strong relationship between advertising

dollars and sales dollars for a company.

The statistical methods of Correlation (discussed in the present lesson) and Regression (to be

discussed in the next lesson) are helpful in knowing the relationship between two or more

variables which may be related in same way, like interest rate of bonds and prime interest

rate; advertising expenditure and sales; income and consumption; crop-yield and fertilizer used;

height and weights and so on.

In all these cases involving two or more variables, we may be interested in seeing:

➢ if there is any association between the variables;

➢ if there is an association, is it strong enough to be useful;

➢ if so, what form the relationship between the two variables takes;

➢ how we can make use of that relationship for predictive purposes, that is, forecasting;

and

➢ how good such predictions will be.

92
Since these issues are inter related, correlation and regression analysis, as two sides of a

single process, consists of methods of examining the relationship between two or more

variables. If two (or more) variables are correlated, we can use information about one (or

more) variable(s) to predict the value of the other variable(s), and can measure the error

of estimations - a job of regression analysis.

4.2 WHAT IS CORRELATION?

Correlation is a measure of association between two or more variables. When two or more

variables very in sympathy so that movement in one tends to be accompanied by corresponding

movements in the other variable(s), they are said to be correlated.

“The correlation between variables is a measure of the nature and degree of

association between the variables”.

As a measure of the degree of relatedness of two variables, correlation is widely used in

exploratory research when the objective is to locate variables that might be related in some way

to the variable of interest.

4.2.1 TYPES OF CORRELATION

Correlation can be classified in several ways. The important ways of classifying correlation

are:

(i) Positive and negative,

(ii) Linear and non-linear (curvilinear) and

(iii) Simple, partial and multiple.

Positive and Negative Correlation

If both the variables move in the same direction, we say that there is a positive correlation, i.e.,

if one variable increases, the other variable also increases on an average or if one variable

decreases, the other variable also decreases on an average.

93
On the other hand, if the variables are varying in opposite direction, we say that it is a case of

negative correlation; e.g., movements of demand and supply.

Linear and Non-linear (Curvilinear) Correlation

If the change in one variable is accompanied by change in another variable in a constant ratio,

it is a case of linear correlation. Observe the following data:

X : 10 20 30 40 50
Y : 25 50 75 100 125

The ratio of change in the above example is the same. It is, thus, a case of linear correlation.

If we plot these variables on graph paper, all the points will fall on the same straight line.

On the other hand, if the amount of change in one variable does not follow a constant ratio with

the change in another variable, it is a case of non-linear or curvilinear correlation. If a couple

of figures in either series X or series Y are changed, it would give a non-linear correlation.

Simple, Partial and Multiple Correlation

The distinction amongst these three types of correlation depends upon the number of variables

involved in a study. If only two variables are involved in a study, then the correlation is said to

be simple correlation. When three or more variables are involved in a study, then it is a problem

of either partial or multiple correlation. In multiple correlation, three or more variables are

studied simultaneously. But in partial correlation we consider onlytwo variables influencing

each other while the effect of other variable(s) is held constant.

Suppose we have a problem comprising three variables X, Y and Z. X is the number of hours

studied, Y is I.Q. and Z is the number of marks obtained in the examination. In a multiple

correlation, we will study the relationship between the marks obtained (Z) and the two

variables, number of hours studied (X) and I.Q. (Y). In contrast, when we study the

94
relationship between X and Z, keeping an average I.Q. (Y) as constant, it is said to be a study

involving partial correlation.

In this lesson, we will study linear correlation between two variables.

4.2.2 CORRELATION DOES NOT NECESSARILY MEAN CAUSATION

The correlation analysis, in discovering the nature and degree of relationship between variables,

does not necessarily imply any cause and effect relationship between the variables. Two

variables may be related to each other but this does not mean that one variable causesthe

other. For example, we may find that logical reasoning and creativity are correlated, but that

does not mean if we could increase peoples’ logical reasoning ability, we would produce greater

creativity. We need to conduct an actual experiment to unequivocally demonstrate a causal

relationship. But if it is true that influencing someones’ logical reasoning ability does influence

their creativity, then the two variables must be correlated with each other. In other words,

causation always implies correlation, however converse is not true.

Let us see some situations-

1. The correlation may be due to chance particularly when the data pertain to a small

sample. A small sample bivariate series may show the relationship but such a

relationship may not exist in the universe.

2. It is possible that both the variables are influenced by one or more other variables.

For example, expenditure on food and entertainment for a given number of

households show a positive relationship because both have increased over time. But,

this is due to rise in family incomes over the same period. In other words, the two

variables have been influenced by another variable - increase in family incomes.

95
3. There may be another situation where both the variables may be influencing each

other so that we cannot say which is the cause and which is the effect. For example,

take the case of price and demand. The rise in price of a commodity may lead to a

decline in the demand for it. Here, price is the cause and the demand is the effect.

In yet another situation, an increase in demand may lead to a rise in price. Here, the

demand is the cause while price is the effect, which is just the reverse of the earlier

situation. In such situations, it is difficult to identify which variable is causing the

effect on which variable, as both are influencing eachother.

The foregoing discussion clearly shows that correlation does not indicate any causation or

functional relationship. Correlation coefficient is merely a mathematical relationship and this

has nothing to do with cause and effect relation. It only reveals co-variation between two

variables. Even when there is no cause-and-effect relationship in bivariate series and one

interprets the relationship as causal, such a correlation is called spurious or non-sense

correlation. Obviously, this will be misleading. As such, one has to be very careful in

correlation exercises and look into other relevant factors before concluding a cause-and-effect

relationship.

4.3 CORRELATION ANALYSIS

Correlation Analysis is a statistical technique used to indicate the nature and degree of

relationship existing between one variable and the other(s). It is also used along with regression

analysis to measure how well the regression line explains the variations of the dependent

variable with the independent variable.

The commonly used methods for studying linear relationship between two variables involve
both graphic and algebraic methods. Some of the widely used methods include:
1. Scatter Diagram

2. Correlation Graph

96
3. Pearson’s Coefficient of Correlation

4. Spearman’s Rank Correlation

5. Concurrent Deviation Method

4.3.1 SCATTER DIAGRAM

This method is also known as Dotogram or Dot diagram. Scatter diagram is one of the simplest

methods of diagrammatic representation of a bivariate distribution. Under this method, both

the variables are plotted on the graph paper by putting dots. The diagram so obtained is called

"Scatter Diagram". By studying diagram, we can have rough idea about the nature and degree

of relationship between two variables. The term scatter refers to thespreading of dots on the

graph. We should keep the following points in mind while interpreting correlation:

➢ if the plotted points are very close to each other, it indicates high degree of correlation.

If the plotted points are away from each other, it indicates low degree of correlation.

101
Figure 4-1 Scatter Diagrams

➢ if the points on the diagram reveal any trend (either upward or downward), thevariables

are said to be correlated and if no trend is revealed, the variables areuncorrelated.

➢ if there is an upward trend rising from lower left hand corner and going upward to the

upper right hand corner, the correlation is positive since this reveals that the values of

the two variables move in the same direction. If, on the other hand, the points depict a

downward trend from the upper left hand corner to the lower right hand corner, the

correlation is negative since in this case the values of the two variables move in the

opposite directions.

➢ in particular, if all the points lie on a straight line starting from the left bottom and going

up towards the right top, the correlation is perfect and positive, and if all the points like

on a straight line starting from left top and coming down to right bottom, the correlation

is perfect and negative.

The various diagrams of the scattered data in Figure 4-1 depict different forms of correlation.

Example 4-1

102
Given the following data on sales (in thousand units) and expenses (in thousand rupees) of a

firm for 10 month:

Month : J F M A M J J A S O
Sales: 50 50 55 60 62 65 68 60 60 50
Expenses: 11 13 14 16 16 15 15 14 13 13
a) Make a Scatter Diagram

b) Do you think that there is a correlation between sales and expenses of the

firm? Is it positive or negative? Is it high or low?

Solution:(a) The Scatter Diagram of the given data is shown in Figure 4-2
Expenses

Sales

Figure 4.2 Scatter Diagram

(a) Figure 4-2 shows that the plotted points are close to each other and reveal an upward

trend. So there is a high degree of positive correlation between sales and expenses of the firm.

4.3.2 CORRELATION GRAPH

This method, also known as Correlogram is very simple. The data pertaining to two series are

plotted on a graph sheet. We can find out the correlation by examining the direction and

closeness of two curves. If both the curves drawn on the graph are moving in the same direction,

it is a case of positive correlation. On the other hand, if both the curves are moving in opposite

direction, correlation is said to be negative. If the graph does not show anydefinite pattern

on account of erratic fluctuations in the curves, then it shows an absence of correlation.

103
Example 4-2
Find out graphically, if there is any correlation between price yield per plot (qtls); denoted by

Y and quantity of fertilizer used (kg); denote by X.

Plot No.: 1 2 3 4 5 6 7 8 9 10
Y: 3.5 4.3 5.2 5.8 6.4 7.3 7.2 7.5 7.8 8.3

X: 6 8 9 12 10 15 17 20 18 24

Solution: The Correlogram of the given data is shown in Figure 4-3


X and Y

Figure 4-3 Correlation Graph

Figure 4-3 shows that the two curves move in the same direction and, moreover, they are very

close to each other, suggesting a close relationship between price yield per plot (qtls) and

quantity of fertilizer used (kg)

Remark: Both the Graphic methods - scatter diagram and correlation graph provide a ‘feel

for’ of the data – by providing visual representation of the association between the variables.

These are readily comprehensible and enable us to form a fairly good, thoughrough idea

of the nature and degree of the relationship between the two variables. However, these methods

are unable to quantify the relationship between them. To quantify the extent of correlation, we

make use of algebraic methods - which calculate correlation coefficient.

4.3.3 PEARSON’S COEFFICIENT OF CORRELATION

A mathematical method for measuring the intensity or the magnitude of linear relationship

104
between two variables was suggested by Karl Pearson (1867-1936), a great British

Biometrician and Statistician and, it is by far the most widely used method in practice.

Karl Pearson’s measure, known as Pearsonian correlation coefficient between two variables X

and Y, usually denoted by r(X,Y) or rxy or simply r is a numerical measure of linear relationship

between them and is defined as the ratio of the covariance between X and Y, tothe product

of the standard deviations of X and Y.

Symbolically

Cov( X ,Y ) …………(4.1)
rxy =
S x .S y

when, ( X 1 ,Y1 );( X 2 ,Y2 );. ................ ( X n ,Yn ) are N pairs of observations of the variables X and

Y in a bivariate distribution,

Cov( X ,Y ) =
 ( X − X )(Y − Y ) …………(4.2a)
N

Sx = …………(4.2b)
N

and Sy = …………(4.2c)
N

Thus by substituting Eqs. (4.2) in Eq. (4.1), we can write the Pearsonian correlation
coefficient as
1
( X − X )(Y − Y )
rxy = N

rxy = ( X − X )(Y − Y ) …………(4.3)

105
If we denote, d x = X − X and d y = Y − Y

Then rxy =
d x dy
…………(4.3a)
2 2
dx d y

We can further simply the calculations of Eqs. (4.2)

We have
1
Cov( X ,Y ) = ( X − X )(Y − Y )
N

1
=
N
 XY − XY
 XY −  
1 X Y
=
N N N

=
N2
1
N  XY −  X Y  …………(4.4)

1
and S 2 =  ( X − X )2
x
N
1
= −( X ) 2
X
2

N
2

 X 2 −  X 
1
=
 N 
N  

=
N
1
2

N  X 2 − ( X )
2
 …………(4.5a)

Similarly, we have
S2 =
1

N  Y 2 − ( Y )
2
 …………(4.5b)
y
N2
So Pearsonian correlation coefficient may be found as
1
2
N  XY −  X Y 
   
rxy = N
1 N  X 2 − ( X ) 1 N  Y 2 − ( Y )
2 2

N2 N2

N  XY −  X Y
or rxy = …………(4.6)
N X − X N Y − Y
2 2

106
Remark: Eq. (4.3) or Eq. (4.3a) is quite convenient to apply if the means X and

Y come out to be integers. If X or/and Y is (are) fractional then the Eq. (4.3) or Eq. (4.3a) is

quite cumbersome to apply, since the computations of  ( X − X )2 , (Y − Y )2 and


( X − X )(Y − Y ) are quite time consuming and tedious. In such a case Eq. (4.6) may be

used provided the values of X or/ and Y are small. But if X and Y assume large values, the

calculation of  X 2 , Y 2 and  XY is again quite time consuming.

Thus if (i) X and Y are fractional and (ii) X and Y assume large values, the Eq. (4.3) and Eq.

(4.6) are not generally used for numerical problems. In such cases, the step deviation method

where we take the deviations of the variables X and Y from any arbitrary points is used. We

will discuss this method in the properties of correlation coefficient.

4.3.3.1 Properties of Pearsonian Correlation Coefficient

The following are important properties of Pearsonian correlation coefficient:

1. Pearsonian correlation coefficient cannot exceed 1 numerically. In other words it lies

between –1 and +1. Symbolically,

-1 ≤ r ≤1

Remarks: (i) This property provides us a check on our calculations. If in any problem,the

obtained value of r lies outside the limits + 1, this implies that there is some mistake in our

calculations.

(ii) The sign of r indicate the nature of the correlation. Positive value of r indicates positive

correlation, whereas negative value indicates negative correlation. r = 0 indicate absence of

correlation.

(iii) The following table sums up the degrees of correlation corresponding to various

values of r:

107
Value of r Degree of correlation
±1 perfect correlation
±0.90 or more very high degree of correlation
sufficiently high degree of
±0.75 to ±0.90 correlation
±0.60 to ±0.75 moderate degree of correlation
only the possibility of a
±0.30 to ±0.60 correlation
less than ±0.30 possibly no correlation
0 absence of correlation

2. Pearsonian Correlation coefficient is independent of the change of origin and scale.

Mathematically, if given variables X and Y are transformed to new variables U and V

by change of origin and scale, i. e.

X−A Y −B
U= and V=
h k

Where A, B, h and k are constants and h > 0, k > 0; then the correlation coefficient

between X and Y is same as the correlation coefficient between U and V i.e.,

r(X,Y) = r(U, V) => rxy = ruv

Remark: This is one of the very important properties of the correlation coefficient andis

extremely helpful in numerical computation of r. We had already stated that Eq. (4.3) and

Eq.(4.6) become quite tedious to use in numerical problems if X and/or Y are in fractions or if

X and Y are large. In such cases we can conveniently change the origin and scale (if possible)

in X or/and Y to get new variables U and V and compute the correlation between U and V by

the Eq. (4.7)

N UV − U V
rxy = ruv = …………(4.7)
N U − U N V − V
2 2

3. Two independent variables are uncorrelated but the converse is not true

108
If X and Y are independent variables then

rxy = 0

However, the converse of the theorem is not true i.e., uncorrelated variables need not

necessarily be independent. As an illustration consider the following bivariate

distribution.

X : 1 2 3 -3 -2 -1
Y : 1 4 9 9 4 1

For this distribution, value of r will be 0.

Hence in the above example the variable X and Y are uncorrelated. But if we examine

the data carefully we find that X and Y are not independent but are connected by the

relation Y = X2. The above example illustrates that uncorrelated variables need not be

independent.

Remarks: One should not be confused with the words uncorrelation and independence.

rxy = 0 i.e., uncorrelation between the variables X and Y simply implies the absence of any linear

(straight line) relationship between them. They may, however, be related in some other form

other than straight line e.g., quadratic (as we have seen in the above example), logarithmic or

trigonometric form.

4. Pearsonian coefficient of correlation is the geometric mean of the two regression

coefficients, i.e.

rxy =  bxy .byx

The signs of both the regression coefficients are the same, and so the value of r will

also have the same sign.

This property will be dealt with in detail in the next lesson on Regression Analysis.

109
5. The square of Pearsonian correlation coefficient is known as the coefficient of

determination.

Coefficient of determination, which measures the percentage variation in the dependent

variable that is accounted for by the independent variable, is a much better and useful

measure for interpreting the value of r. This property will also be dealtwith in detail

in the next lesson.

4.3.3.2 Probable Error of Correlation Coefficient

The correlation coefficient establishes the relationship of the two variables. After ascertaining

this level of relationship, we may be interested to find the extent upto which this coefficient is

dependable. Probable error of the correlation coefficient is such a measure of testing the

reliability of the observed value of the correlation coefficient, when we consider it as satisfying

the conditions of the random sampling.

If r is the observed value of the correlation coefficient in a sample of N pairs of observations

for the two variables under consideration, then the Probable Error, denoted by PE (r) is

expressed as

PE(r) = 0.6745 SE(r)

1− r2
or PE(r) = 0.6745
N

There are two main functions of probable error:

1. Determination of limits: The limits of population correlation coefficient are r ±

PE(r), implying that if we take another random sample of the size N from the same

population, then the observed value of the correlation coefficient in the secondsample

can be expected to lie within the limits given above, with 0.5 probability. When sample

size N is small, the concept or value of PE may lead to wrong

110
conclusions. Hence to use the concept of PE effectively, sample size N it should be

fairly large.

2. Interpretation of 'r': The interpretation of 'r' based on PE is as under:

➢ If r < PE(r), there is no evidence of correlation, i.e. a case of insignificant

correlation.

➢ If r > 6 PE(r), correlation is significant. If r < 6 PE(r), it is insignificant.

➢ If the probable error is small, correlation exist where r > 0.5

Example 4-3
Find the Pearsonian correlation coefficient between sales (in thousand units) and expenses (in

thousand rupees) of the following 10 firms:

Firm: 1 2 3 4 5 6 7 8 9 10
Sales: 50 50 55 60 65 65 65 60 60 50
Expenses: 11 13 14 16 16 15 15 14 13 13

Solution: Let sales of a firm be denoted by X and expenses be denoted by Y

Calculations for Coefficient of Correlation


{Using Eq. (4.3) or (4.3a)}

Firm X Y dx =X−X dy =Y−Y d x2 d y2 dx.dy

1 50 11 -8 -3 64 9 24
2 50 13 -8 -1 64 1 8
3 55 14 -3 0 9 0 0
4 60 16 2 2 4 4 4
5 65 16 7 2 49 4 14
6 65 15 7 1 49 1 7
7 65 15 7 1 49 1 7
8 60 14 2 0 4 0 0
9 60 13 2 -1 4 1 -2
10 50 13 -8 -1 64 1 8

 X Y d
2
x d
2
y
d x dy

111
= = =360 =22 =70
580 140

X=
 X = 580 = 58 and Y=
 Y = 140 = 14
N 10 N 10

Applying the Eq. (4.3a), we have, Pearsonian coefficient of correlation

rxy =
d x dy

d d
70
rxy =
360x22
70
rxy =
7920
rxy = 0.78
The value of rxy = 0.78 , indicate a high degree of positive correlation between sales and expenses.
Example 4-4
The data on price and quantity purchased relating to a commodity for 5 months is given

below:

Month : January February March April May


Prices(Rs): 10 10 11 12 12
Quantity(Kg): 5 6 4 3 3

Find the Pearsonian correlation coefficient between prices and quantity and comment on its

sign and magnitude.

Solution: Let price of the commodity be denoted by X and quantity be denoted by Y

Calculations for Coefficient of Correlation


{Using Eq. (4.6)}
Month X Y X2 Y2 XY
1 10 5 100 25 50
2 10 6 100 36 60
3 11 4 121 16 44
4 12 3 144 9 36
5 12 3 144 9 36

112
 X =55  Y =21  X 2 = 609 Y 2 = 95  XY = 226

Applying the Eq. (4.6), we have, Pearsonian coefficient of correlation

rxy =
 −   − 
5x226 − 55x21
rxy =
(5x609 − 55x55)(5x95 − 21x21)

1130 −1155
rxy =
20x34

− 25
rxy =
680

rxy = −0.98

The negative sign of r indicate negative correlation and its large magnitude indicate a very high

degree of correlation. So there is a high degree of negative correlation between prices and

quantity demanded.

Example 4-5
Find the Pearsonian correlation coefficient from the following series of marks obtained by 10

students in a class test in mathematics (X) and in Statistics (Y):

X: 45 70 65 30 90 40 50 75 85 60
Y: 35 90 70 40 95 40 60 80 80 50

Also calculate the Probable Error.

Solution:
Calculations for Coefficient of Correlation
{Using Eq. (4.7)}
X Y U V U2 V2 UV
45 35 -3 -6 9 36 18
70 90 2 5 4 25 10
65 70 1 1 1 1 1

113
30 40 -6 -5 36 25 30
90 95 6 6 36 36 36
40 40 -4 -5 16 25 20
50 60 -2 -1 4 1 2
75 80 3 3 9 9 9
85 80 5 3 25 9 15
60 50 0 -3 0 9 0

U = 2 V = −2 U 2 = 140 V 2 = 176 UV = 141

We have, defined variables U and V as

X − 60 Y − 65
U= and V =
5 5

Applying the Eq. (4.7)

N UV − (U V )
rxy = ruv =
 −   − 

10x141 − 2x(−2)
=
10x140 − 2x2 10x176 − (−2)x(−2)

1410 + 4
=
1400 − 4 1760 − 4

1414
=
2451376

= 0.9

So there is a high degree of positive correlation between marks obtained in Mathematics and

in Statistics.

Probable Error, denoted by PE (r) is given as

1− r2
PE(r) = 0.6745

114
1 − 0.9
PE(r) = 0.6745
10

PE(r) = 0.0405

So the value of r is highly significant.

4.3.4 SPEARMAN’S RANK CORRELATION

Sometimes we come across statistical series in which the variables under consideration are

not capable of quantitative measurement but can be arranged in serial order. This happens when

we are dealing with qualitative characteristics (attributes) such as honesty, beauty, character,

morality, etc., which cannot be measured quantitatively but can be arranged serially. In such

situations Karl Pearson’s coefficient of correlation cannot be used as such. Charles Edward

Spearman, a British Psychologist, developed a formula in 1904, which consists in obtaining the

correlation coefficient between the ranks of N individuals in the two attributes under study.

Suppose we want to find if two characteristics A, say, intelligence and B, say, beauty are related

or not. Both the characteristics are incapable of quantitative measurements but we can arrange

a group of N individuals in order of merit (ranks) w.r.t. proficiency in the two characteristics.

Let the random variables X and Y denote the ranks of the individuals in the characteristics A

and B respectively. If we assume that there is no tie, i.e., if no two individuals get the same

rank in a characteristic then, obviously, X and Y assume numerical values ranging from 1 to N.

The Pearsonian correlation coefficient between the ranks X and Y is called the rank correlation

coefficient between the characteristics A and B for the group of individuals.

Spearman’s rank correlation coefficient, usually denoted by ρ(Rho) is given by the equation

6 d 2
ρ =1 − …………(4.8)
N (N 2 − 1)

115
Where d is the difference between the pair of ranks of the same individual in the two

characteristics and N is the number of pairs.

Example 4-6
Ten entries are submitted for a competition. Three judges study each entry and list the ten in

rank order. Their rankings are as follows:

Entry: A B C D E F G H I J
Judge J1: 9 3 7 5 1 6 2 4 10 8
Judge J2: 9 1 10 4 3 8 5 2 7 6
Judge J3: 6 3 8 7 2 4 1 5 9 10

Calculate the appropriate rank correlation to help you answer the following questions:

(i) Which pair of judges agrees the most?


(ii) Which pair of judges disagrees the most?
Solution:
Calculations for Coefficient of Rank Correlation
{Using Eq.(4.8)}
Entry Rank by
Judges Difference in Ranks
J1 J2 J3 d(J1&J2) d2 d(J1&J3) d2 d(J2&J3) d2
A 9 9 6 0 0 +3 9 +3 9
B 3 1 3 +2 4 0 0 -2 4
C 7 10 8 -3 9 -1 1 +2 4
D 5 4 7 +1 1 -2 4 -3 9
E 1 3 2 -2 4 -1 1 +1 1
F 6 8 4 -2 4 +2 4 +4 16
G 2 5 1 -3 9 +1 1 +4 16
H 4 2 5 +2 4 -1 1 -3 9
I 10 7 9 +3 9 +1 1 -2 4
J 8 6 10 +2 4 -2 4 -4 16
d2 =48 d2 =26 d2 =88

6 d 2
 (J1 & J2) = 1 −
N (N 2 − 1)

116
6 x 48
=1 −
10(102 −1)

288
=1 −
990
=1 – 0.29
= +0.71
6 d 2
 (J1 & J3) =1 −
N (N 2 − 1)

=1 − 6 x 26
10(102 −1)

156
=1 −
990
=1 – 0.1575
= +0.8425

6 d 2
 (J2 & J3) =1 −
N (N 2 − 1)

=1 − 6 x 88
10(102 −1)

528
=1 −
990
=1 – 0.53
= +0.47

So (i) Judges J1 and J3 agree the most


(ii) Judges J2 and J3 disagree the most

Spearman’s rank correlation Eq.(4.8) can also be used even if we are dealing with variables,

which are measured quantitatively, i.e. when the actual data but not the ranks relating to two

variables are given. In such a case we shall have to convert the data into ranks. The highest

(or the smallest) observation is given the rank 1. The next highest (or the next lowest)

observation is given rank 2 and so on. It is immaterial in which way (descending or ascending)

the ranks are assigned. However, the same approach should be followed for all the variables

under consideration.

117
Example 4-7
Calculate the rank coefficient of correlation from the following data:
X: 75 88 95 70 60 80 81 50
Y: 120 134 150 115 110 140 142 100

Solution:
Calculations for Coefficient of Rank Correlation
{Using Eq.(4.8)}
X Ranks RX Y Ranks RY d = RX -RY d2
75 5 120 5 0 0
88 2 134 4 -2 4
95 1 150 1 0 0
70 6 115 6 0 0
60 7 110 7 0 0
80 4 140 3 +1 1
81 3 142 2 +1 1
50 8 100 8 0 0
d2 = 6

6 d 2
 = 1−
N (N 2 − 1)
6x6
= 1−
8(82 −1)
36
= 1−
504
= 1 – 0.07
= + 0.93
Hence, there is a high degree of positive correlation between X and Y

Repeated Ranks

In case of attributes if there is a tie i.e., if any two or more individuals are placed together in

any classification w.r.t. an attribute or if in case of variable data there is more than one item

with the same value in either or both the series then Spearman’s Eq.(4.8) for calculating the

rank correlation coefficient breaks down, since in this case the variables X [the ranks of

118
individuals in characteristic A (1st series)] and Y [the ranks of individuals in characteristic B

(2nd series)] do not take the values from 1 to N.

In this case common ranks are assigned to the repeated items. These common ranks are the

arithmetic mean of the ranks, which these items would have got if they were different from

each other and the next item will get the rank next to the rank used in computing the common

rank. For example, suppose an item is repeated at rank 4. Then the common rank to be assigned

to each item is (4+5)/2, i.e., 4.5 which is the average of 4 and 5, the ranks which these

observations would have assumed if they were different. The next item will be assigned the

rank 6. If an item is repeated thrice at rank 7, then the common rank to be assigned toeach

value will be (7+8+9)/3, i.e., 8 which is the arithmetic mean of 7,8 and 9 viz., the ranks these

observations would have got if they were different from each other. The next rank to be

assigned will be 10.

If only a small proportion of the ranks are tied, this technique may be applied together with

Eq.(4.8). If a large proportion of ranks are tied, it is advisable to apply an adjustment or a

correction factor to Eq.(4.8)as explained below:

“In the Eq.(4.8) add the factor

m(m 2 −1)
…………(4.8a)
12

d
2
to ; where m is the number of times an item is repeated. This correction factor is to be

added for each repeated value in both the series”.

Example 4-8
For a certain joint stock company, the prices of preference shares (X) and debentures (Y) are
given below:
X: 73.2 85.8 78.9 75.8 77.2 81.2 83.8
Y: 97.8 99.2 98.8 98.3 98.3 96.7 97.1

119
Use the method of rank correlation to determine the relationship between preference prices
and debentures prices.
Solution:
Calculations for Coefficient of Rank Correlation
{Using Eq. (4.8) and (4.8a)}
X Y Rank of X (XR) Rank of Y (YR) d = XR – YR d2
73.2 97.8 7 5 2 4
85.8 99.2 1 1 0 0
78.9 98.8 4 2 2 4
75.8 98.3 6 3.5 2.5 6.25
77.2 98.3 5 3.5 1.5 2.25
81.2 96.7 3 7 -4 16
83.8 97.1 2 6 -4 16

d =0 d
2
= 48.50

In this case, due to repeated values of Y, we have to apply ranking as average of 2 ranks, which

could have been allotted, if they were different values. Thus ranks 3 and 4 have been allotted

as 3.5 to both the values of Y = 98.3. Now we also have to apply correction factor

m(m 2 −1)
d
2
to , where m in the number of times the value is repeated, here m = 2.
12

 2 m(m 2 − 1)
6 d + 
2
 =  
N (N 2 − 1)
2(4 − 1)
648.5 +
 12 
=
7(72 − 1)

6 x 49
= 1-
7 x 48
= 0.125
Hence, there is a very low degree of positive correlation, probably no correlation,

between preference share prices and debenture prices.

120
Remarks on Spearman’s Rank Correlation Coefficient

1. We always have  d = 0 , which provides a check for numerical calculations.


2. Since Spearman’s rank correlation coefficient, , is nothing but Karl Pearson’s

correlation coefficient, r, between the ranks, it can be interpreted in the same way

as the Karl Pearson’s correlation coefficient.

3. Karl Pearson’s correlation coefficient assumes that the parent population from

which sample observations are drawn is normal. If this assumption is violated then

we need a measure, which is distribution free (or non-parametric). Spearman’s 

is such a distribution free measure, since no strict assumption are made about the

from of the population from which sample observations are drawn.

4. Spearman’s formula is easy to understand and apply as compared to Karl Pearson’s

formula. The values obtained by the two formulae, viz Pearsonian r and Spearman’s

 are generally different. The difference arises due to the fact that when ranking is

used instead of full set of observations, there is always some loss of information.

Unless many ties exist, the coefficient of rank correlation shouldbe only slightly

lower than the Pearsonian coefficient.

5. Spearman’s formula is the only formula to be used for finding correlation

coefficient if we are dealing with qualitative characteristics, which cannot be

measured quantitatively but can be arranged serially. It can also be used where

actual data are given. In case of extreme observations, Spearman’s formula is

preferred to Pearson’s formula.

6. Spearman’s formula has its limitations also. It is not practicable in the case of

bivariate frequency distribution. For N >30, this formula should not be used unless

the ranks are given.

121
4.3.5 CONCURRENT DEVIATION METHOD

This is a casual method of determining the correlation between two series when we are not very

serious about its precision. This is based on the signs of the deviations (i.e. the

direction of the change) of the values of the variable from its preceding value and does not take

into account the exact magnitude of the values of the variables. Thus we put a plus (+) sign,

minus (-) sign or equality (=) sign for the deviation if the value of the variable is greater than,

less than or equal to the preceding value respectively. The deviations in the values of two

variables are said to be concurrent if they have the same sign (either both deviations are positive

or both are negative or both are equal). The formula used for computing correlation coefficient

rc by this method is given by

 2c − N 
rc = + +  …………(4.9)

Where c is the number of pairs of concurrent deviations and N is the number of pairs of

deviations. If (2c-N) is positive, we take positive sign in and outside the square root in Eq. (4.9)

and if (2c-N) is negative, we take negative sign in and outside the square root in Eq. (4.9).

Remarks: (i) It should be clearly noted that here N is not the number of pairs of observations

but it is the number of pairs of deviations and as such it is one less than the number of pairs of

observations.

(ii) Coefficient of concurrent deviations is primarily based on the following principle:

“If the short time fluctuations of the time series are positively correlated or in other
words, if their deviations are concurrent, their curves would move in the same direction
and would indicate positive correlation between them”
Example 4-9

122
Calculate coefficient of correlation by the concurrent deviation method

Supply: 112 125 126 118 118 121 125 125 131 135
Price: 106 102 102 104 98 96 97 97 95 90

Solution:
Calculations for Coefficient of Concurrent Deviations
{Using Eq. (4.9)}
Supply Sign of deviation from Price Sign of deviation Concurrent
(X) preceding value (X) (Y) preceding value (Y) deviations
112 106
125 + 102 -
126 + 102 =
118 - 104 +
118 = 98 -
121 + 96 -
125 + 97 + +(c)
125 = 97 = = (c)
131 + 95 -
135 + 90 -

We have
Number of pairs of deviations, N =10 – 1 = 9
c = Number of concurrent deviations
= Number of deviations having like signs
=2
Coefficient of correlation by the method of concurrent deviations is given by:

2c − N 
rc = +

rc =

rc = +
Since 2c – N = -5 (negative), we take negative sign inside and outside the square root

123
rc = −

rc = − 0.5556

rc = −0.7
Hence there is a fairly good degree of negative correlation between supply and price.

4.4 LIMITATIONS OF CORRELATION ANALYSIS

As mentioned earlier, correlation analysis is a statistical tool, which should be properly used so

that correct results can be obtained. Sometimes, it is indiscriminately used by management,

resulting in misleading conclusions. We give below some errors frequently made in the use of

correlation analysis:

1. Correlation analysis cannot determine cause-and-effect relationship. One should not

assume that a change in Y variable is caused by a change in X variable unless one is

reasonably sure that one variable is the cause while the other is the effect. Let us take

an example. .

Suppose that we study the performance of students in their graduate examination and

their earnings after, say, three years of their graduation. We may find that these two

variables are highly and positively related. At the same time, we must not forget that

both the variables might have been influenced by some other factors such as quality of

teachers, economic and social status of parents, effectiveness of the interviewing

process and so forth. If the data on these factors are available, then it is worthwhile to

use multiple correlation analysis instead of bivariate one.

2. Another mistake that occurs frequently is on account of misinterpretation of the

coefficient of correlation. Suppose in one case r = 0.7, it will be wrong to interpret

that correlation explains 70 percent of the total variation in Y. The error can be seen

easily when we calculate the coefficient of determination. Here, the coefficient of

124
determination r2 will be 0.49. This means that only 49 percent of the total variation in

Y is explained.

Similarly, the coefficient of determination is misinterpreted if it is also used to indicate

causal relationship, that is, the percentage of the change in one variable is due to the

change in another variable.

3. Another mistake in the interpretation of the coefficient of correlation occurs when one

concludes a positive or negative relationship even though the two variables are actually

unrelated. For example, the age of students and their score in the examinationhave no

relation with each other. The two variables may show similar movements but there does

not seem to be a common link between them.

To sum up, one has to be extremely careful while interpreting coefficient of correlation. Be-

fore one concludes a causal relationship, one has to consider other relevant factors that might

have any influence on the dependent variable or on both the variables. Such an approach will

avoid many of the pitfalls in the interpretation of the coefficient of correlation. It has been

rightly said that the coefficient of correlation is not only one of the most widely used, but also

one of the widely abused statistical measures.

4.5 SELF-ASSESSMENT QUESTIONS


1. “Correlation and Regression are two sides of the same coin”. Explain.

2. Explain the meaning and significance of the concept of correlation. Does correlation

always signify casual relationships between two variables? Explain with illustration

on what basis can the following correlation be criticized?

(a) Over a period of time there has been an increased financial aid to under developed

countries and also an increase in comedy act television shows. The correlation is

almost perfect.

125
(b) The correlation between salaries of school teachers and amount of liquor sold

during the period 1940 – 1980 was found to be 0.96

3. Write short not on the following

(a) Spurious correlation

(b) Positive and negative correlation

(c) Linear and non-linear correlation

(d) Simple, multiple and partial correlation

4. What is a scatter diagram? How does it help in studying correlation between two

variables, in respect of both its nature and extent?

5. Write short note on the following

(a) Karl Pearson’s coefficient of correlation

(b) Probable Error

(c) Spearman’s Rank Correlation Coefficient

(d) Coefficient of Concurrent Deviation

6. Draw a scatter diagram from the data given below and interpret it.

X: 10 20 30 40 50 60 70 80
Y: 32 20 24 36 40 28 38 44

7. Calculate Karl Pearson’s coefficient of correlation between expenditure on advertising

(X) and sales (Y) from the data given below:

X: 39 65 62 90 82 75 25 98 36 78
Y: 47 53 58 86 62 68 60 91 51 84

8. To study the effectiveness of an advertisement a survey is conducted by calling

people at random by asking the number of advertisements read or seen in a week (X)

and the number of items purchased (Y) in that week.

X: 5 10 4 0 2 7 3 6
Y: 10 12 5 2 1 3 4 8

126
Calculate the correlation coefficient and comment on the result.

9. Calculate coefficient of correlation between X and Y series from the following data

and calculate its probable error also.

X: 78 89 96 69 59 79 68 61
Y: 125 137 156 112 107 136 123 108

10. In two set of variables X and Y, with 50 observations each, the following data are

observed:

X = 10, SD of X = 3

Y = 6, SD of Y = 2 rxy = 0.3

However, on subsequent verification, it was found that one value of X (=10) and one

value of Y (= 6) were inaccurate and hence weeded out with the remaining 49 pairs of

values. How the original value of is rxy = 0.3 affected?

11. Calculate coefficient of correlation r between the marks in statistics (X) and

Accountancy (Y) of 10 students from the following:

X: 52 74 93 55 41 23 92 64 40 71
Y: 45 80 63 60 35 40 70 58 43 64

Also determine the probable error or r.

12. The coefficient of correlation between two variables X and Y is 0.48. The covariance

is 36. The variance of X is 16. Find the standard deviation of Y.

13. Twelve entries in painting competition were ranked by two judges as shown below:

Entry: A B C D E F G H I J
Judge I: 5 2 3 4 1 6 8 7 10 9
Judge II: 4 5 2 1 6 7 10 9 3 8

Find the coefficient of rank correlation.

14. Calculate Spearman’s rank correlation coefficient between advertisement cost (X) and

sales (Y) from the following data:

127
X: 39 65 62 90 82 75 25 98 36 78
Y: 47 53 58 86 62 68 60 91 51 84

15. An examination of eight applicants for a clerical post was taken by a firm. From the

marks obtained by the applicants in the Accountancy (X) and Statistics (Y) paper,

compute rank coefficient of correlation.

Applicant: A B C D E F G H
X: 15 20 28 12 40 60 20 80
Y: 40 30 50 30 20 10 30 60

16. Calculate the coefficient of concurrent deviation from the following data:

Year: 1993 1994 1995 1996 1997 1998 1999 2000


Supply: 160 164 172 182 166 170 178 192
Price: 222 280 260 224 266 254 230 190

17. Obtain a suitable measure of correlation from the following data regarding changes in

price index of the shares A and B during nine months of a year:

Month: A M J J A S O N D
A: +4 +3 +2 -1 -3 +4 -5 +1 +2
B: -2 +5 +3 -2 -1 -3 +4 -1 -3

18. The cross-classification table shows the marks obtained by 105 students in the

subjects of Statistics and Finance:

Marks in Statistics

50-54 55-59 60-64 65-74 Total


Marks in Finance

50-59 4 6 8 7 25
60-69 - 10 12 13 35
70-79 16 9 20 - 45
80-89 - - - - -
Total 20 25 40 20 105

Find the coefficient of correlation between marks obtained in two subjects.

128
129
REGRESSION ANALYSIS

130
...if we find any association between two or more variables, we might be interested in
estimating the value of one variable for known value(s) of another variable(s)

5.1 INTRODUCTION
In business, several times it becomes necessary to have some forecast so that the management

can take a decision regarding a product or a particular course of action. In order to make a

forecast, one has to ascertain some relationship between two or more variables relevant to a

particular situation. For example, a company is interested to know how far the demand for

television sets will increase in the next five years, keeping in mind the growth of population

in a certain town. Here, it clearly assumes that the increase in population will lead to an

increased demand for television sets. Thus, to determine the nature and extent of relationship

between these two variables becomes important for the company.

In the preceding lesson, we studied in some depth linear correlation between two variables.

Here we have a similar concern, the association between variables, except that we develop it

further in two respects. First, we learn how to build statistical models of relationshipsbetween

the variables to have a better understanding of their features. Second, we extend the models to

consider their use in forecasting.

For this purpose, we have to use the technique - regression analysis - which forms the subject-

matter of this lesson.

5.2 WHAT IS REGRESSION?

In 1889, Sir Francis Galton, a cousin of Charles Darwin published a paper on heredity,

“Natural Inheritance”. He reported his discovery that sizes of seeds of sweet pea plants

appeared to “revert” or “regress”, to the mean size in successive generations. He also reported

results of a study of the relationship between heights of fathers and heights of their sons. A

straight line was fit to the data pairs: height of father versus height of son. Here, too, he found

a “regression to mediocrity” The heights of the sons represented a movement away from their

131
fathers, towards the average height. We credit Sir Galton with the idea of statistical regression.

While most applications of regression analysis may have little to do with the

“regression to the mean” discovered by Galton, the term “regression” remains. It

now refers to the statistical technique of modeling the relationship between two or

more variables. In general sense, regression analysis means the estimation or prediction

of the unknown value of one variable from the known value(s) of the other variable(s).

It is one of the most important and widely used statistical techniques in almost all

sciences - natural, social or physical.

In this lesson we will focus only on simple regression –linear regression involving only two

variables: a dependent variable and an independent variable. Regression analysis for studying

more than two variables at a time is known as multiple regressions.

5.2.1 INDEPENDENT AND DEPENDENT VARIABLES

Simple regression involves only two variables; one variable is predicted by another variable.

The variable to be predicted is called the dependent variable. The predictor is called the

independent variable, or explanatory variable. For example, when we are trying to predict

the demand for television sets on the basis of population growth, we are using the demand for

television sets as the dependent variable and the population growth as the independent or

predictor variable.

The decision, as to which variable is which sometimes, causes problems. Often the choice is

obvious, as in case of demand for television sets and population growth because it would make

no sense to suggest that population growth could be dependent on TV demand! The population

growth has to be the independent variable and the TV demand the dependent variable.

132
If we are unsure, here are some points that might be of use:

➢ if we have control over one of the variables then that is the independent. For example,

a manufacturer can decide how much to spend on advertising and expect his sales to

be dependent upon how much he spends

➢ it there is any lapse of time between the two variables being measured, then the latter

must depend upon the former, it cannot be the other way round

➢ if we want to predict the values of one variable from your knowledge of the other

variable, the variable to be predicted must be dependent on the known one

5.3 LINEAR REGRESSION

The task of bringing out linear relationship consists of developing methods of fitting a

straight line, or a regression line as is often called, to the data on two variables.

The line of Regression is the graphical or relationship representation of the best estimate of one

variable for any given value of the other variable. The nomenclature of the line depends on the

independent and dependent variables. If X and Y are two variables of which relationship is to

be indicated, a line that gives best estimate of Y for any value of X, it is called Regression line

of Y on X. If the dependent variable changes to X, then best estimateof X by any value of Y is

called Regression line of X on Y.

5.3.1 REGRESSION LINE OF Y ON X

For purposes of illustration as to how a straight line relationship is obtained, consider the

sample paired data on sales of each of the N = 5 months of a year and the marketing expenditure

incurred in each month, as shown in Table 5-1

Table 5-1
Sales Marketing Expenditure
Month (Rs lac) (Rs thousands)

133
Y X
April 14 10
May 17 12
June 23 15
July 21 20
August 25 23

Let Y, the sales, be the dependent variable and X, the marketing expenditure, the independent

variable. We note that for each value of independent variable X, there is a specific value of

the dependent variable Y, so that each value of X and Y can be seen as paired observations.

5.3.1.1 Scatter Diagram

Before obtaining a straight-line relationship, it is necessary to discover whether the relationship

between the two variables is linear, that is, the one which is best explained by a straight line. A

good way of doing this is to plot the data on X and Y on a graph so as to yielda scatter diagram,

as may be seen in Figure 5-1. A careful reading of the scatter diagram reveals that:

➢ the overall tendency of the points is to move upward, so the relationship is positive

➢ the general course of movement of the various points on the diagram can be best

explained by a straight line

➢ there is a high degree of correlation between the variables, as the points are very close
to each other

134
Figure 5-1 Scatter Diagram with Line of Best Fit

5.3.1.2 Fitting a Straight Line on the Scatter Diagram

If the movement of various points on the scatter diagram is best described by a straight line,

the next step is to fit a straight line on the scatter diagram. It has to be so fitted that on the whole

it lies as close as possible to every point on the scatter diagram. The necessary

requirement for meeting this condition being that the sum of the squares of the vertical

deviations of the observed Y values from the straight line is minimum.

As shown in Figure 5-1, if dl, d2,..., dN are the vertical deviations' of observed Y values from

the straight line, fitting a straight line requires that

d 2 + d 2 + ...................... + d 2 = 
N
d2
1 2 N j
j =1

is the minimum. The deviations dj have to be squared to avoid negative deviations canceling

out the positive deviations. Since a straight line so fitted best approximates all the points on the

scatter diagram, it is better known as the best approximating line or the line of best fit. A line

of best fit can be fitted by means of:

1. Free hand drawing method, and

2. Least square method

Free Hand Drawing:

Free hand drawing is the simplest method of fitting a straight line. After a careful

inspection of the movement and spread of various points on the scatter diagram, a

straight line is drawn through these points by using a transparent ruler such that on the

135
whole it is closest to every point. A straight line so drawn is particularly useful when

future approximations of the dependent variable are promptly required.

Whereas the use of free hand drawing may yield a line nearest to the line of best fit, the major

drawback is that the slope of the line so drawn varies from person to person because of the

influence of subjectivity. Consequently, the values of the dependent variable estimated on the

basis of such a line may not be as accurate and precise as those based on the line of best fit.

Least Square Method:

The least square method of fitting a line of best fit requires minimizing the sum of the

squares of vertical deviations of each observed Y value from the fitted line. These deviations,

such as d1 and d3, are shown in Figure 5-1 and are given by Y - Yc, where Y is the observed

value and Yc the corresponding computed value given by the fitted line

Yc = a + bX i …………(5.1)

for the ith value of X.

The straight line relationship in Eq.(5.1), is stated in terms of two constants a and b

➢ The constant a is the Y-intercept; it indicates the height on the vertical axis from

where the straight line originates, representing the value of Y when X is zero.

➢ Constant b is a measure of the slope of the straight line; it shows the absolute change in

Y for a unit change in X. As the slope may be positive or negative, it indicates the nature

of relationship between Y and X. Accordingly, b is also known as the regression

coefficient of Y on X.

Since a straight line is completely defined by its intercept a and slope b, the task of fitting the

same reduces only to the computation of the values of these two constants. Once these two

values are known, the computed Yc values against each value of X can be easily obtained by

substituting X values in the linear equation.

136
In the method of least squares the values of a and b are obtained by solving simultaneously

the following pair of normal equations

Y = aN + b X …………(5.2)

 XY = a X + b X 2 …………(5.2)

The value of the expressions -  X , Y ,  XY and  X 2 can be obtained from the given

observations and then can be substituted in the above equations to obtain the value of a and b.

Since simultaneous solving the two normal equations for a and b may quite often be

cumbersome and time consuming, the two values can be directly obtained as

a = Y − bX …………(5.3)

and
N  XY −  X Y
b= …………(5.4)
N  X − ( X )
2 2

Note: Eq. (5.3) is obtained simply by dividing both sides of the first of Eqs. (5.2) by N and
Eq.(5.4) is obtained by substituting (Y − b X ) in place of a in the second of Eqs. (5.2)

Instead of directly computing b, we may first compute value of a as

Y  X 2 −  X  XY
…………(5.5)
a=
N  X − ( X )
2 2

and

Y −a
b= …………(5.6)
X

N  XY −  X Y
Note: Eq. (5.5) is obtained by substituting for b in Eq. (5.3) and Eq.
N  X 2 − ( X )
2

(5.6) is obtained simply by rearranging Eq. (5.3)

137
Table 5-2
Computation of a and b
Y X XY X2 Y2

138
14 10 140 100 196
17 12 204 144 289
23 15 345 225 529
21 20 420 400 441
25 23 575 529 625

Y = 100  X = 80  XY = 1684  X 2 = 1398 Y 2 = 2080

So using Eqs. (5.5) and (5.4)

100x1398 − 80x1684
a=
5x1398 − (80 )2
139800 −134720
=
6990 − 6400
5080
=
590
= 8.6101695
and
5x1684 − 80x100
b=
5x1398 − (80)2
8420 − 8000
=
6990 − 6400
420
=
590
= 0.7118644

Now given a = 8.61 and b = 0.71

The regression Eq.(5.1) takes the form

Yc = 8.61 + 0.71X .................................................................... (5.1a)

138
Figure 5-2 Regression Line of Y on X

Then, to fit the line of best fit on the scatter diagram, only two computed Yc values are

needed. These can be easily obtained by substituting any two values of X in Eq. (5.1a). When

these are plotted on the diagram against their corresponding values of X, we get two points,

by joining which (by means of a straight line) gives us the required line of best fit, as shown

in Figure 5-2

Some Important Relationships

We can have some important relationships for data analysis, involving other measures such as

X , Y , Sx, Sy and the correlation coefficient rxy.

Substituting Y − b X [from Eq.(5.3)] for a in Eq.(5.1)

Yc = ( Y − b X ) +bX

or Yc - Y = b(X- X ) .......................................................... (5.7)

Dividing the numerator and denominator of Eq.(5.4) by N2, we get

 XY   X  Y 
 N  N 
N −
b=   2 
 X 2
  X 
−
N  N 
 XY − XY
or b= N 2
Sx
Cov( X ,Y )
or b= …………(5.8)
S x2

We know, coefficient of correlation, rxy is given by

Cov( X , Y )
rxy =
Sx Sy

139
or Cov( X , Y ) = rxy S x S y

So Eq. (5.8) becomes

b = r Sx S y
xy
S x2
Sy
b=r …………(5.9)
xy
Sx

S y for b in Eq.(5.7), we get


Substituting r
xy
Sx

Y -Y = r S y (X- X ) ............................................................. (5.10)

c xy
Sx

These are important relationships for data analysis.

5.3.1.3 Predicting an Estimate and its Preciseness

The main objective of regression analysis is to know the nature of relationship between two

variables and to use it for predicting the most likely value of the dependent variable

corresponding to a given, known value of the independent variable. This can be done by

substituting in Eq.(5.1a) any known value of X corresponding to which the most likely estimate

of Y is to be found.

For example, the estimate of Y (i.e. Yc), corresponding to X = 15 is

Yc = 8.61 + 0.71(15)

= 8.61 + 10.65

= 19.26

It may be appreciated that an estimate of Y derived from a regression equation will not be

exactly the same as the Y value which may actually be observed. The difference between

estimated Yc values and the corresponding observed Y values will depend on the extent of

scatter of various points around the line of best fit.

140
The closer the various paired sample points (Y, X) clustered around the line of best fit, the

smaller the difference between the estimated Yc and observed Y values, and vice-versa. On the

whole, the lesser the scatter of the various points around, and the lesser the vertical distance by

which these deviate from the line of best fit, the more likely it is that an estimated Yc valueis

close to the corresponding observed Y value.

The estimated Yc values will coincide the observed Y values only when all the points on the

scatter diagram fall in a straight line. If this were to be so, the sales for a given marketing

expenditure could have been estimated with l00 percent accuracy. But such a situation is too

rare to obtain. Since some of the points must lie above and some below the straight line, perfect

prediction is practically non-existent in the case of most business and economic situations.

This means that the estimated values of one variable based on the known values of the other

variable are always bound to differ. The smaller the difference, the greater the precision of

the estimate, and vice-versa. Accordingly, the preciseness of an estimate can be obtained only

through a measure of the magnitude of error in the estimates, called the error of estimate.

5.3.1.4 Error of Estimate

A measure of the error of estimate is given by the standard error of estimate of Y on X, denoted

as Syx and defined as


2

Syx = …………(5.11)
c

Syx measures the average absolute amount by which observed Y values depart from the

corresponding computed Yc values.

Computation of Syx becomes little cumbersome where the number of observations N is large.

In such cases Syx may be computed directly by using the equation:

141
Syx = …………(5.12)
N

By substituting the values of Y 2 , Y , and  XY from the Table 5-2, and the calculated

values of a and b

We have

2080 − 8.61x100 − 0.71x1684


Syx =

2080 − 861 −1195.64


=

23.36
=

= 4.67
= 2.16

Interpretations of Syx

A careful observation of how the standard error of estimate is computed reveals the following:

1. Syx is a concept statistically parallel to the standard deviation Sy . The only difference

between the two being that the standard deviation measures the dispersion around the

mean; the standard error of estimate measures the dispersion around the regression line.

Similar to the property of arithmetic mean, the sum of the deviations of different Y

values from their corresponding estimated Yc values is equal to zero. That is

( Yi - Y ) =  ( Yi - Yc) = 0 where i = 1, 2, ..., N.

2. Syx tells us the amount by which the estimated Yc values will, on an average, deviate

from the observed Y values. Hence it is an estimate of the average amount of error in

the estimated Yc values. The actual error (the residual of Y and Yc) may, however, be

smaller or larger than the average error. Theoretically, these errors follow a normal

distribution. Thus, assuming that n ≥ 30, Yc ± 1.Syx means that 68.27% of the estimates

142
based on the regression equation will be within 1.Syx Similarly, Yc ± 2.Syx means that

95.45% of the estimates will fall within 2.Syx

Further, for the estimated value of sales against marketing expenditure of Rs 15

thousand being Rs 19.26 lac, one may like to know how good this estimate is. Since Syx

is estimated to be Rs 2.16 lac, it means there are about 68 chances (68.27) out of 100

that this estimate is in error by not more than Rs 2.16 lac above or below Rs

19.26 lac. That is, there are 68% chances that actual sales would fall between (19.26 -

2.16) = Rs 17.10 lac and (19.26 + 2.16) = Rs 21.42 lac.

3. Since Syx measures the closeness of the observed Y values and the estimated Yc values,

it also serves as a measure of the reliability of the estimate. Greater the closeness

between the observed and estimated values of Y, the lesser the error and, consequently,

the more reliable the estimate. And vice-versa.

4. Standard error of estimate Syx can also be seen as a measure of correlation insofar as it

expresses the degree of closeness of scatter of observed Y values about the regression

line. The closer the observed Y values scattered around the regression line, the higher

the correlation between the two variables.

A major difficulty in using Syx as a measure of correlation is that it is expressed in the

same units of measurement as the data on the dependent variable. This creates problems

in situations requiring comparison of two or more sets of data in terms of correlation. It

is mainly due to this limitation that the standard error of estimate is not generally used

as a measure of correlation. However, it does serve as the basis of evolving the

coefficient of determination, denoted as r2, which provides an alternate method of

obtaining a measure of correlation.

5.3.2 REGRESSION LINE OF X ON Y

143
So far we have considered the regression of Y on X, in the sense that Y was in the role of

dependent and X in the role of an independent variable. In their reverse position, such that X

is now the dependent and Y the independent variable, we fit a line of regression of X on Y.

The regression equation in this case will be

Xc = a’ + b’Y ............................................................................ (5.13)

Where Xc denotes the computed values of X against the corresponding values of Y. a’ is the

X-intercept and b’ is the slope of the straight line.

Two normal equations to solve a’and b’ are

X = a' N + b'Y …………(5.14)

 XY = a'Y + b'Y 2 …………(5.14)

The value of a’ and b’ can also be obtained directly

a’ = X - b’ Y ............................................................................. (5.15)

and
N  XY −  X Y
b' = …………(5.16)
N  Y 2 − ( Y )
2

or

a' =
 X Y 2 − Y  XY …………(5.17)
N  Y 2 − ( Y )
2

and

X − a'
b' = …………(5.18)
Y

Cov(Y , X )
b' = …………(5.19)
S y2

144
Sx
b' = ryx S …………(5.20)

So, Regression equation of X on Y may also be written as

145
Xc - X = b’ (Y- Y ) .................................................................. (5.21)

Sx
Xc - X = r yx (Y - Y ) ......................................................... (5.22)
S
y

As before, once the values of a’ and b’ have been found, their substitution in Eq.(5.13) will

enable us to get an estimate of X corresponding to a known value of Y

Standard Error of estimate of X on Y i.e. Sxy will be

Sxy
( X − X c)
2

= .......................................................................................... (5.23)
N
or

Sxy = …………(5.24)
N

For example, if we want to estimate the marketing expenditure to achieve a sale target of Rs

40 lac, we have to obtain regression line of X on Y i. e.

Xc = a’ + b’Y

So using Eqs. (5.17) and (5.16), and substituting the values of  X , Y 2 , Y and  XY
from Table 5-2, we have

80x2080 − 100x1684
a' =
5x2080 − (100 )2
166400 − 168400
=
10400 − 10000
− 2000
=
400
= -5.00
and

5x1684 − 80x100
b' =
5x2080 − (100 )2
8420 − 8000
=
10400 − 10000

146
420
=
400
= 1.05

Now given that a’= -5.00 and b’=1.05, Regression equation (5.13) takes the form

Xc = -5.00 +1.05Y

So when Y = 40(Rs lac), the corresponding X value is

Xc = -5.00+1.05x40

= -5 + 42

= 37

That is to achieve a sale target of Rs 40 lac, there is a need to spend Rs 37 thousand on

marketing.

5.4 PROPERTIES OF REGRESSION COEFFICIENTS


As explained earlier, the slope of regression line is called the regression coefficient. It tells

the effect on dependent variable if there is a unit change in the independent variable. Since

for a paired data on X and Y variables, there are two regression lines: regression line of Y on X

and regression line of X on Y, so we have two regression coefficients:

a. Regression coefficient of Y on X, denoted by byx [b in Eq.(5.1)]

b. Regression coefficient of X on Y, denoted by bxy [b’ in Eq.(5.13)]

The following are the important properties of regression coefficients that are helpful in data

analysis

1. The value of both the regression coefficients cannot be greater than 1. However, value

of both the coefficients can be below 1 or at least one of them must be below 1, so

that the square root of the product of two regression coefficients must lie in the limit

±1.

2. Coefficient of correlation is the geometric mean of the regression coefficients, i.e.

147
r = ±..............................................................................
b. b' (5.25)

The signs of both the regression coefficients are the same, and so the value of r will

also have the same sign.

3. The mean of both the regression coefficients is either equal to or greater than the

coefficient of correlation, i.e.

b + b'
r
2

3. Regression coefficients are independent of change of origin but not of change of

scale. Mathematically, if given variables X and Y are transformed to new variables U

and V by change of origin and scale, i. e.

X−A Y −B
U= and V=
h k

Where A, B, h and k are constants, h > 0, k > 0 then

Regression coefficient of Y on X = k/h (Regression coefficient of V on U)


k
b = b
yx vu
h
and

Regression coefficient of X on Y = h/k (Regression coefficient of U on V)


h
b = b
xy uv
k

5. Coefficient of determination is the product of both the regression coefficients i.e.

r2 = b.b’

5.5 REGRESSION LINES AND COEFFICIENT OF CORRELATION


The two regression lines indicate the nature and extent of correlation between the variables.

The two regression lines can be represented as

Sy Sx
Y- Y = r (X - X ) and X- X = r (Y - Y )
Sx Sy

148
We can write the slope of these lines, as

Sy Sx
b= r and b’ = r
Sx Sy

If  is the angle between these lines, then

b − b'
tan  =
1 + bb'

S x S y  r 2 −1
= 2 
S + S 2  r
x y  

–1
 S S  r 2 −1 
x y
 2  
or  = tan S + S2 r …………(5.26)

 x y  

148
Figure 5-3 Regression Lines and Coefficient of Correlation
Eq. (5.26) reveals the following:

➢ In case of perfect positive correlation (r = +1) and in case of perfect negative correlation

(r = -1),  = 0, so the two regression lines will coincide, i.e. we have only one line, see

(a) and (b) in Figure 5-3.

The farther the two regression lines from each other, lesser will be the degree of

correlation and nearer the two regression lines, more will be the degree of correlation,

see (c) and (d) in Figure 5-3.

➢ If the variables are independent i.e. r = 0, the lines of regression will cut each other at

right angle. See (g) in Figure 5-3.

Note : Both the regression lines cut each other at mean value of X and mean value of Y i.e. at

X and Y .

5.6 COEFFICIENT OF DETERMINATION


Coefficient of determination gives the percentage variation in the dependent variable that is

accounted for by the independent variable. In other words, the coefficient of determination

gives the ratio of the explained variance to the total variance. The coefficient of determination

is given by the square of the correlation coefficient, i.e. r2. Thus,

Coefficient of determination

Explained Variance
r2 =
Total Variance

 ((Y −Y )
2

) …………(5.27)
2
r = c
2

 Y −Y
149
We can calculate another coefficient K2, known as coefficient of Non-Determination, which

is the ratio of unexplained variance to the total variance.

Un exp lained Variance


K2 =
Total Variance

 (Y − Y ) 2

( ) …………(5.28)
2
K = c
2

 Y −Y
Explained Variance
K2 = 1-
Total Variance

= 1 - r2 ................................................................................. (5.29)

The square root of the coefficient of non-determination, i.e. K gives the coefficient of

alienation

K = ± ............................................................................ (5.30)

Relation Between Syx and r:

A simple algebraic operation on Eq. (5.30) brings out some interesting points about the

relation between Syx and r. Thus, since

(Y − Y )  (Y − Y )
2
= N S2 = N S2
2
c and
yx y

So we have coefficient of Non-determination

 (Y − Y )
2

K2 = c

(Y − Y ) 2


N S yx2
K2 =
N S y2

S yx2
=
S y2
2
S yx
So 1 – r2 =
S y2

S yx
or = …………(5.31)
Sy
150
If coefficient of correlation, r, is defined as the under root of the coefficient of determination

r= r2
2
S yx
r = 1−
2
S y2

S
r = ................................................................................
1 − yx2 (5.32)
Sy

On carefully observing Eq. (5.32), it will be noticed that the ratio Syx/Sy will be large if the

coefficient of determination is small, and it will be small when the coefficient of determination

is large. Thus

✓ if r2 = r = 0, Syx/Sy =1, which means that Syx = Sy.

✓ if r2 = r = 1, Syx/Sy =0, which means that Syx = 0.

✓ when r = 0.865, Syx = 0.427 Sy means that Syx is 42.7% of Sy.

Eq. (5.32) also implies that Syx is generally less than Sy. The two can at the most be equal, but

only in the extreme situation when r = 0.

Interpretations of r2:

1. Even though the coefficient of determination, whose under root measures the degree

of correlation, is based on Syx,; it is expressed as 1 - ( Syx/Sy ). As it is a dimensionless

pure number, the unit in which Syx is measured becomes irrelevant. This facilitates

comparison between the two sets of data in terms of their coefficient of determination

r2 (or the coefficient of correlation r). This was not possible in terms of Sy x as the

units of measurement could be different.

2. The value of r2 can range between 0 and 1. When r2 = 1, all the points on the scatter

diagram fall on the regression line and the entire variations are explained by the straight

line. On the other hand, when r2 = 0, none of the points on the scatter diagramfalls on

the regression line, meaning thereby that there is no relationship between the two

variables. However, being always non-negative coefficient of determination does

151
not tell us about the direction of the relationship (whether it is positive or negative)

between the two variables.

3. When r2 = 0.7455 (or any other value), 74.55% of the total variations in sales are

explained by the marketing expenditure used. What remains is the coefficient of non-

determination K2 (= 1 - r2) = 0.2545. It means 25.45% of the total variations remain

unexplained, which are due to factors other than the changes in the marketing

expenditure.

4. r2 provides the necessary link between regression and correlation which are the two

related aspects of a single problem of the analysis of relationship between two variables.

Unlike regression, correlation quantifies the degrees of relationship between the

variables under study, without making a distinction between the dependent and

independent ones. Nor does it, therefore, help in predicting the value of one variable for

a given value of the other.

5. The coefficient of correlation overstates the degree of relationship and it’s meaning is

not as explicit as that of the coefficient of determination. The coefficient of correlation

r = 0.865, as compared to r2 = 0.7455, indicates a higher degree ofcorrelation between

sales and marketing expenditure. Therefore, the coefficient of' determination is a more

objective measure of the degree of relationship.

6. The sum of r and K never adds to one, unless one of the two is zero. That is, r + K can

be unity either when there is no correlation or when there is perfect correlation.

Except in these two extreme situations, (r + K) > 1.

5.7 CORRELATION ANALYSIS VERSUS REGRESSION ANALYSIS


Correlation and Regression are the two related aspects of a single problem of the analysis of

the relationship between the variables. If we have information on more than one variable, we

might be interested in seeing if there is any connection - any association - between them. If

152
we found such a association, we might again be interested in predicting the value of one

variable for the given and known values of other variable(s).

1. Correlation literally means the relationship between two or more variables that vary in

sympathy so that the movements in one tend to be accompanied by the corresponding

movements in the other(s). On the other hand, regression means stepping back or

returning to the average value and is a mathematical measure expressing the average

relationship between the two variables.

2. Correlation coefficient rxy between two variables X and Y is a measure of the direction

and degree of the linear relationship between two variables that is mutual. It is

symmetric, i.e., ryx = rxy and it is immaterial which of X and Y is dependent variable and

which is independent variable.

Regression analysis aims at establishing the functional relationship between the two( or

more) variables under study and then using this relationship to predict or estimate the

value of the dependent variable for any given value of the independent variable(s).It

also reflects upon the nature of the variable, i.e., which is dependent variable and which

is independent variable. Regression coefficient are not symmetric in X and Y, i.e., byx 

bxy.

3. Correlation need not imply cause and effect relationship between the variable under

study. However, regression analysis clearly indicates the cause and effect relationship

between the variables. The variable corresponding to cause is taken as independent

variable and the variable corresponding to effect is taken as dependent variable.

4. Correlation coefficient rxy is a relative measure of the linear relationship between X and

Y and is independent of the units of measurement. It is a pure number lying between

±1.

153
On the other hand, the regression coefficients, byx and bxy are absolute measures

representing the change in the value of the variable Y (or X), for a unit change in the

value of the variable X (or Y). Once the functional form of regression curve is known;

by substituting the value of the independent variable we can obtain the value of the

dependent variable and this value will be in the units of measurement of the dependent

variable.

5. There may be non-sense correlation between two variables that is due to pure chance

and has no practical relevance, e.g., the correlation, between the size of shoe and the

intelligence of a group of individuals. There is no such thing like non-sense regression.

5.8 SOLVED PROBLEMS


Example 5-1
The following table shows the number of motor registrations in a certain territory for

a term of 5 years and the sale of motor tyres by a firm in that territory for the same

period.

Year Motor Registrations No. of Tyres Sold


1 600 1,250
2 630 1,100
3 720 1,300
4 750 1,350
5 800 1,500
Find the regression equation to estimate the sale of tyres when the motor registration

is known. Estimate sale of tyres when registration is 850.

Solution: Here the dependent variable is number of tyres; dependent on motor registrations.

Hence we put motor registrations as X and sales of tyres as Y and we have to establish the

regression line of Y on X.

Calculations of values for the regression equation are given below:

154
X Y dx = X- X dy = Y- Y dx2 dx dy

600 1,250 -100 -50 10,000 5,000


630 1,100 -70 -200 4,900 14,000
720 1,300 20 0 400 0
750 1,350 50 50 2,500 2,500
800 1,500 100 200 10,000 20,000

 X = 3,500  Y = 6,500  d d d d
2
x =0 y =0 = 27,800 x d y = 41,500
x

X=
X = 3,500 Y = 6,500
= 1,300
=700 and Y=
N 5 N 5

byx = Regression coefficient of Y on X

byx =
(X − X )(Y − Y )  d d x y
=
4,1500
= 1.4928
=
 (X − X )
2
2,7800
d x2

Now we can use these values for the regression line

Y- Y = byx (X- X )

or Y – 1300 = 1.4928 (X - 700)

Y = 1.4928 X + 255.04

When X = 850, the value of Y can be calculated from the above equation, by putting X = 850

in the equation.

Y = 1.4928 x 850 + 255. 04

= 1523.92

= 1,524 Tyres

Example 5-2
A panel of Judges A and B graded seven debators and independently awarded the

following marks:

Debator Marks by A Marks by B


1 40 32

155
2 34 39

156
3 28 26
4 30 30
5 44 38
6 38 34
7 31 28

An eighth debator was awarded 36 marks by judge A, while Judge B was not present. If

Judge B were also present, how many marks would you expect him to award to the eighth

debator, assuming that the same degree of relationship exists in their judgement?

Solution: Let us use marks from Judge A as X and those from Judge B as Y. Now we have to

work out the regression line of Y on X from the calculation below:

Debtor X Y U = X-35 V = Y-30 U2 V2 UV


1 40 32 5 2 25 4 10
2 34 39 -1 9 1 81 -9
3 28 26 -7 -4 49 16 28
4 30 30 -5 0 25 0 0
5 44 38 9 8 81 64 72
6 38 34 3 4 9 16 12
7 31 28 -4 -2 16 4 8

N=7 U = 0  V = 17 U 2 = 206 V 2 = 185 UV = 121

X = A+
U = 35 + 0
= 35 and Y = A+
V = 30 + 17 = 32.43
N 7 N 7

N UV − (U V )
byx = bvu =
N  U 2 − ( U )2

7x121 - 0x17
= = 0.587
7x206 - 0

Hence regression equation can be written as

Y- Y = byx (X- X )

Y – 32.43 = 0.587 (X-35)

157
or Y = 0.587X + 11.87

When X = 36 (awarded by Judge A)

Y = 0.587 x 36 + 11.87

= 33

Thus if Judge B were present, he would have awarded 33 marks to the eighth debator.

Example 5-3
For some bivariate data, the following results were obtained.

Mean value of variable X = 53.2

Mean value of variable Y = 27.9

Regression coefficient of Y on X = - 1.5

Regression coefficient of X on Y = - 0.2

What is the most likely value of Y, when X = 60?

What is the coefficient of correlation between X and Y?

Solution: Given data indicate

X = 53.2 Y = 27.9

byx = -1.5 bxy = -0.2

To obtain value of Y for X = 60, we establish the regression line of Y on X,

Y- Y = byx (X- X )

Y – 27.9 = -1.5 (X-53.2)

or Y = -1.5X + 107.7

Putting value of X = 60, we obtain

Y = -1.5 x 60 + 107.7

= 17.7

Coefficient of correlation between X and Y is given by G.M. of byx and bxy

r2 = byx bxy

158
= (-1.5) x (–0.2)

= 0.3

So r = ± 0.3 = ± 0.5477

Since both the regression coefficients are negative, we assign negative value to the

correlation coefficient

r = - 0.5477

Example 5-4
Write regression equations of X on Y and of Y on X for the following data

X: 45 48 50 55 65 70 75 72 80 85
Y: 25 30 35 30 40 50 45 55 60 65

Solution: We prepare the table for working out the values for the regression lines.

X Y U = X-65 V = Y-45 U2 UV V2
45 25 -20 -20 400 400 400
48 30 -17 -15 289 255 225
50 35 -15 -10 225 150 100
55 30 -10 -15 100 150 225
65 40 0 -5 0 0 25
70 50 5 5 25 25 25
75 45 10 0 100 0 0
72 55 7 5 49 35 25
80 60 15 15 225 225 225
85 65 20 20 400 400 400

X = 645 Y = 435 U = 5 V = −20 U 2 = 1813 V 2 = 1415 UV = 1675

We have,

X=
X 645 Y = 435
= 43.5
= = 64.5 and Y=
N 10 N 10
N UV − (U V )

byx =
N  U 2 − ( U )
2

159
(10) x 1415 - (5) x (-20)
=
(10) x 1813 - (5) 2

14150 + 100 14250


= = = 0.787
18130 - 25 18105

Regression equation of Y on X is

Y- Y = byx (X- X )

Y – 43.5 = 0.787 (X-64.5)

or Y = 0.787X + 7.26

Similarly bxy can be calculated as

N UV − (U V )
bxy =
N  V 2 − ( V )
2

= (10) x 1415 - (5) x (-20)


(10) x 1675 - (-20)2

14150 + 100 14250


= = = 0.87
16750 - 400 16350

Regression equation of X on Y will be

X-X = bxy (Y- Y )

X – 64.5 = 0.87 (Y-43.5)

or X = 0.87Y + 26.65

Example 5-5
The lines of regression of a bivariate population are

8X – 10Y + 66 = 0

40X – 18Y = 214

The variance of X is 9. Find

(i) The mean value of X and Y

(ii) Correlation coefficient between X and Y

(iii) Standard deviation of Y

160
Solution: The regression lines given are

8X – 10Y + 66 = 0

40X – 18Y = 214

Since both the lines of regression pass through the mean values, the point ( X , Y ) will satisfy

both the equations.

Hence these equations can be written as

8 X - 10 Y + 66 = 0

40 X - 18 Y - 214 = 0

Solving these two equations for X and Y , we obtain

X = 13 and Y = 17

(ii) For correlation coefficient between X and Y, we have to calculate the values of byx and

bxy

Rewriting the equations

10Y = 8X + 66

byx = + 8/10 = + 4/5

Similarly, 40X = 18Y + 214

bxy = 18/40 = 9/20

By these values, we can now work out the correlation coefficient.

r2 = byx . bxy

= 4/5 x 9/20 = 9/25

So r = + 9 / 25

= + 0.6

Both the values of the regression coefficients being positive, we have to consider only the

positive value of the correlation coefficient. Hence r = 0.6

(iii) We have been given variance of X i.e Sx2 = 9

161
Sx = ± 3

We consider Sx = 3 as SD is always positive

Since byx = r Sy /Sx

Substituting the values of byx, r and Sx we obtain,

Sy = 4/5 x 3/0.6

= 4

Example 5-6
The height of a child increases at a rate given in the table below. Fit the straight line

using the method of least-square and calculate the average increase and the standard

error of estimate.

Month: 1 2 3 4 5 6 7 8 9 10
Height: 52.5 58.7 65 70.2 75.4 81.1 87.2 95.5 102.2 108.4

Solution: For Regression calculations, we draw the following table

Month (X) Height (Y) X2 XY


1 52.5 1 52.5
2 58.7 4 117.4
3 65.0 9 195.0
4 70.2 16 280.8
5 75.4 25 377.0
6 81.1 36 486.6
7 87.2 49 610.4
8 95.5 64 764.0
9 102.2 81 919.8
10 108.4 100 1084.0

 X =55  Y =796.2  X 2 = 385  XY = 4887.5

Considering the regression line as Y = a + bX, we can obtain the values of a and b from the

above values.

162
a=
 Y  X 2 −  X  XY
N  X 2 − ( X )
2

796.2 x 385 - 55 x 4887.5


=
10 x 385 - 55 x 55

= 45.73

N  XY −  X Y
b=
N  X 2 − ( X )
2

10 x 4887.5 - 55 x 796.2
=
10 x 385 - 55 x 55

= 6.16

Hence the regression line can be written as

Y = 45.73 + 6.16X

For standard error of estimation, we note the calculated values of the variable against the

observed values,

When X = 1, Y1 = 45.73 + 6.16 = 51.89

for X = 2, Y2 = 45.73 + 616 x 2 = 58.05

Other values for X = 3 to X = 10 are calculated and are tabulated as follows:

Month (X) Height (Y) Yi Y-Yi (Y-Yi) 2


1 52.5 51.89 0.61 0.372
2 58.7 58.05 0.65 0.423
3 65.0 64.21 0.79 0.624
4 70.2 70.37 -0.17 0.029
5 75.4 76.53 -1.13 1.277
6 81.1 82.69 -1.59 2.528
7 87.2 88.85 -1.65 2.723
8 95.5 95.01 0.49 0.240
9 102.2 101.17 1.03 1.061
10 108.4 107.33 1.07 1.145

163
2

Standard error of estimation

1 2
S yx = i

10.421
=
10
= 1.02
Example 5-7
Given X = 4Y+5 and Y = kX + 4 are the lines of regression of X on Y and of Y on X

respectively. If k is positive, prove that it cannot exceed ¼.

If k = 1/16, find the means of the two variables and coefficient of correlation between them.

Solution: Line X = 4Y + 5 is regression line of X on Y

So bxy = 4

Similarly from regression line of Y on X , Y = kX + 4,

We get byx = k

Now

r2 = bxy. byx

= 4k

Since 0  r 2  1, we obtain 0  4k  1,

1
Or 0k ,
4

1
Now for k = ,
16

1 1
r 2 = 4x =
16 4

r=+½

= ½ since byx and byx are positive

164
1
, the regression line of Y on X becomes
When k = 16

1
Y= X+4
16

Or X – 16Y + 64 = 0

Since line of regression pass through the mean values of the variables, we obtain revised

equations as

X - 4Y - 5 = 0

X - 16 Y + 64 = 0

Solving these two equations, we get

X = 28 and Y = 5.75

Example 5-8
A firm knows from its past experience that its monthly average expenses (X) on advertisement

are Rs 25,000 with standard deviation of Rs 25.25. Similarly, its average monthly product sales

(Y) have been Rs 45,000 with standard deviation of Rs 50.50. Given this information and also

the coefficient of correlation between sales and advertisement expenditure as 0.75, estimate

(i) the most appropriate value of sales against an advertisement expenditure of Rs

50,000

(ii) the most appropriate advertisement expenditure for achieving a sales target of

Rs 80,000

Solution: Given the following

X = Rs 25,000 Sx = Rs 25.25

Y = Rs 45,000 Sy = Rs 50.50

r = 0.75

165
Sy
(i) Using equation Yc - Y = r (X- X ), the most appropriate value of sales Yc for an
Sx

advertisement expenditure X = Rs 50,000 is

50.50
Yc – 45,000 = 0.75 (50,000 – 25,000)
25.25

Yc = 45,000 + 37,500

= Rs 82,500

Sx
(ii) Using equation Xc - X = r (Y - Y ), the most appropriate value of advertisement
Sy

expenditure Xc for achieving a sales target Y= Rs 80,000 is

25.25
Xc – 25,000 = 0.75 (80,000 – 45,000)
50.50

Xc = 13,125 + 25,000

= Rs 38,125

1.8 SELF-ASSESSMENT QUESTIONS


1. Explain clearly the concept of Regression. Explain with suitable examples its role in

dealing with business problems.

2. What do you understand by linear regression?

3. What is meant by ‘regression’? Why should there be in general, two lines of regression

for each bivariate distribution? How the two regression lines are useful in studying

correlation between two variables?

4. Why is the regression line known as line of best fit?

5. Write short note on

(i) Regression Coefficients

(ii) Regression Equations

(iii) Standard Error of Estimate

166
(iv) Coefficient of Determination

167
(v) Coefficient of Non-determination

6. Distinguish clearly between correlation and regression as concept used in statistical

analysis.

7. Fit a least-square line to the following data:

(i) Using X as independent variable

(ii) Using X as dependent variable

X : 1 3 4 8 9 11 14
Y : 1 2 4 5 7 8 9

Hence obtain

c) The regression coefficients of Y on X and of X on Y

d) X and Y

e) Coefficient of correlation between and X and Y

f) What is the estimated value of Y when X = 10 and of X when Y = 5?

8. What are regression coefficients? Show that r2 = byx. bxy where the symbols have their

usual meanings. What can you say about the angle between the regression lines when

(i) r = 0, (ii) r = 1 (iii) r increases from 0 to 1?

9. Obtain the equations of the lines of regression of Y on X from the following data.

X : 12 18 24 30 36 42 48
Y : 5.27 5.68 6.25 7.21 8.02 8.71 8.42

Estimate the most probable value of Y, when X = 40.

10. The following table gives the ages and blood pressure of 9 women.

Age (X) : 56 42 36 47 49 42 60 72 63

Blood Pressure(Y) 147 125 118 128 145 140 155 160 149

Find the correlation coefficient between X and Y.

(i) Determine the least square regression equation of Y on X.

168
(ii) Estimate the blood pressure of a woman whose age is 45 years.

11. Given the following results for the height (X) and weight (Y) in appropriate units of

1,000 students:

X = 68, Y = 150, Sx = 2.5, S y = 20 and r = 0.6.

Obtain the equations of the two lines of regression. Estimate the height of a student A

who weighs 200 units and also estimate the weight of the student B whose height is

60 units.

12. From the following data, find out the probable yield when the rainfall is 29”.

Rainfall Yield
Mean 25” 40 units per hectare
Standard Deviation 3” 6 units per hectare

Correlation coefficient between rainfall and production = 0.8.

13. A study of wheat prices at two cities yielded the following data:

City A City B

Average Price Rs 2,463 Rs 2,797


Standard Deviation Rs 0.326 Rs 0.207

Coefficient of correlation r is 0.774. Estimate from the above data the most likely

price of wheat

(i) at City A corresponding to the price of Rs 2,334 at City B

(ii) at city B corresponding to the price of Rs 3.052 at City A

14. Find out the regression equation showing the regression of capacity utilisation on

production from the following data:

Average Standard Deviation


Production (in lakh units) 35.6 10.5
Capacity Utilisation (in percentage) 84.8 8.5

r = 0.62

169
Estimate the production, when capacity utilisation is 70%.

15. The following table shows the mean and standard deviation of the prices of two shares

in a stock exchange.

Share Mean (in Rs) Standard Deviation (in Rs)


A Ltd. 39.5 10.8
B Ltd. 47.5 16.0
If the coefficient of correlation between the prices of two shares is 0.42, find the most

likely price of share A corresponding to a price of Rs 55, observed in the case of share

B.

16. Find out the regression coefficients of Y on X and of X on Y on the basis of following

data:

X = 50, X = 5, Y = 60, Y = 6,  XY = 350


Variance of X = 4, Variance of Y = 9

17. Find the regression equation of X and Y and the coefficient of correlation from the

following data:

X = 60, Y = 40,  XY = 1150,  X 2 = 4160, Y 2 = 1720 and N = 10.


18. By using the following data, find out the two lines of regression and from them
compute the Karl Pearson’s coefficient of correlation.
X = 250, Y = 300,  XY = 7900,  X 2 = 6500,  Y 2 = 10000, N = 10
19. The equations of two regression lines between two variables are expressed as

2X – 3Y = 0 and 4Y – 5X-8 = 0.

(i) Identify which of the two can be called regression line of Y on X and of X on Y.

(ii) Find X and Y and correlation coefficient r from the equations

20. If the two lines of regression are

4X - 5Y + 30 = 0 and 20X – 9Y – 107 = 0

Which of these is the lines of regression of X and Y. Find rxy and Sy when Sx = 3

170
21. The regression equation of profits (X) on sales (Y) of a certain firm is 3Y – 5X +108 =

0. The average sales of the firm were Rs 44,000 and the variance of profits is 9/16th of

the variance of sales. Find the average profits and the coefficient of correlationbetween

the sales and profits.

171

You might also like