0% found this document useful (0 votes)
12 views50 pages

Chapter 2 Data Processing

Uploaded by

anjaneyabajpai67
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views50 pages

Chapter 2 Data Processing

Uploaded by

anjaneyabajpai67
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Chapter-2: Data Processing

INTRODUCTION
Data is described in layman terms as the collection of facts, such as measurements, observations or just descriptions of things. Data are also the characteristics or
information, usually numerical that is collected through observation. In a more technical sense, Data are the set of values of qualitative or quantitative variables
about one or more persons or objects.

The numerical values which are collected through the measurements, observations, or descriptions need to be processed to become useful and valuable
information. In other words, the collected data needs to be compiled, analysed and presented through the different graphical methods to make it understandable
to all. For this, different measures are used. These measures which are used are:

1. Measures of Central Tendency


2. Measures of Dispersions
3. Measures of Relationships

The Measures of central tendency provide the value that is an ideal representative of a set of observations. The measures of dispersion take into account the
internal variations of the data, often around a measure of central tendency. The measures of relationship, on the other hand, provide the degree of association
between any two or more related phenomena, like rainfall and incidence of flood or fertilizer consumption and yield of crops
1. Measures of Central tendency
I. Introduction
One of the most important objectives of the statistical analysis is to get one single value that describes the characteristics of the entire mass of the unwieldy
data. Such a value is called the central value of an “Average” or the expected value of the variable. The word average is very commonly used in day to day
conversation. For example, we often talk about the average boy in a class, average height of an Indian, average income, average marks of the class etc. When
we say that ‘he is an average student, that means he is neither very good nor very bad student, just a mediocre student. However in statistics, the term average
has a different meaning.
Average in Statistics is not just mediocre but rather it is the single value that represents a group of values. Such a value is of significance because it depicts the
characteristics of the whole group. Since an average represents the entire data, it lies somewhere in between the two extremes, i.e. the largest item and the
smallest item. For this reason, an average is frequently referred to as the “Measure of central tendency”.

II.Objectives of Averaging
The main objectives of the studies of averages:
A. To get the single value that describes the characteristics of the entire group.
B. To facilitate comparisons.

III. Requisites of a good average.


A. Easy to understand E. Rigidly Defined
B. Simple to compute F. Capable of further algebraic treatment
C. Based on all the items G. Sampling stability
D. Not be unduly affected by the extreme observations
IV. Types of Averages:
The following are the types of averages:
A. Arithmetic Mean: C. Mode
i. Simple Arithmetic mean D. Geometric mean
ii. Weighted Arithmetic mean E. Harmonic mean
B. Median
Apart from these, there are other less important averages like moving averages, progressive averages, etc. These averages have a very limited field of
application and are therefore not so popular.
A. Arithmetic mean
The most popular and widely used measure of representing the entire data by one value is what most laymen call an “average” and what the statisticians
call as the Arithmetic mean. Its value is obtained by adding together all the items and by dividing this total by the number of items. Arithmetic Mean may
either be
i. Simple Arithmetic mean
ii. Weighted Arithmetic mean

Calculating both the Simple arithmetic mean and the weighted arithmetic mean is done for 3 types of Observations:

i. Individual observations iii. Continuous series observations


ii. Discrete series observations a. Direct Method.
a. Direct method. b. Short-cut Method.
b. Shortcut method.
Computing Arithmetic Mean
Calculation of Arithmetic Mean- Individual Observations (calculate the Simple arithmetic mean; the data has been given below.
Create a frequency distribution table and then find the sum of all the frequencies and then divide the sum of the frequencies by the total number of
Individuals or items. One has been done for you. Try doing the rest.)

Q. The following table gives the monthly income of 10 employees in an office:

Income (Rs.) 14780, 15760, 26690, 27750, 24840, 24920, 16100, 17810, 27050, 26950

Calculate the Arithmetic mean of the income by Direct Method and Short-cut method.

Soln.

Direct Method
Calculation of Arithmetic mean

Monthly x̄= ΣX
Employees Income (Rs.) N
(X) = 222650
1 14780 10
2 15760 =22265
3 26690
4 27750 The Average income of the employees is Rs. 22265
5 24840
6 24920
7 16100
8 17810
9 27050
10 26950
N = 10 ΣX = 222650
Short-cut Method

Σd
x̄ = A +
N

Monthly d = (X-A*)
Employees Income (Rs.) (X-22000)
(X)
1 14780 -7220
2 15760 -6240
3 26690 +4690
4 27750 +5750
5 24840 +2840
6 24920 +2920
7 16100 -5900
8 17810 -4190
9 27050 +5050
10 26950 +4950
N = 10 ΣX = 222650 Σ d = 2650

Σd
x̄ = A +
N
A = 22000 (Here it’s taken as 22000, since ΣX/N= 222650/10 = 22265; rounding off the value to nearest ‘000 value we get 22000), Σ d = 2650; N = 10

2650
x̄ = 22000 + 10

=22000 + 265
=22265
The Average income of the employee is Rs. 22265.
Calculation of Arithmetic Mean- Discrete series observations
In the discrete series, Arithmetic mean may be computed by applying
i. Direct Method
ii. Indirect Method

Calculation
Q. From the following data of marks obtained by 60 students of a class, calculate the arithmetic mean using the Direct Method as well as the Shortcut
method.

Marks 20 30 40 50 60 70
No. Of Students 8 12 20 10 6 4
Soln.
Direct Method
Let the marks be X and the number of students be F.

Marks No. of students


f.X
(X) (f)
20 8 160
30 12 360
40 20 800
50 10 500
60 6 360
70 4 280
Σf. or N = 60 Σf.X =2460

ΣfX
x̄ = ; Where f = frequency, X = the variable in the question, N = Total number of Observation or Σf.
N

ΣfX
x̄ =
N

2460
x̄ =
60

x̄ = 41
Short-cut Method

Marks No. of students d = (X-A); here


f.d
X (f) A = 40
20 8 -20 -160
30 12 -10 -120
40 20 0 0
50 10 10 +100
60 6 20 +120
70 4 30 +120
N = 60 Σf.d = 60

Σfd
x̄ = A + N ; Where A = Assumed mean (it can be taken from any values among the frequency ‘x’ or any value whether existing in the data or not
can be taken as the assumed mean and the final answer would still be the same. However nearer the assumed mean is to the actual mean,
lesser are the calculations), d = Deviation i.e. (X - A), N = total number of observations or Σf.

Σfd
x̄ = A + N

60
x̄ = 40 + 60

x̄ = 40 + 1

x̄ = 41
Calculation of Arithmetic Mean- Continuous series observations
In the Continuous series, Arithmetic mean may be computed by applying
i. Direct Method
ii. Indirect Method

Calculation
Q. From the following data compute arithmetic mean by direct and Short-cut Method

Marks 0 -10 10 – 20 20 - 30 30 – 40 40 – 50 50 - 60
No. Of Students 5 10 25 30 20 10

Soln.
Direct Method

Marks No. of students Mid-Points


f.m
(X) (f) (m)
0 - 10 5 5 25
10 - 20 10 15 150
20 - 30 25 25 625
30 - 40 30 35 1050
40 - 50 20 45 900
50 - 60 10 55 550
Σf. or N = 100 Σf.m = 3300

Σfm
x̄ = ; Where f = frequency, m = mid-point of the various classes, N = Total number of Observation or Σf.
N

3300
x̄ = 100

x̄ = 33

Therefore, the Average mark of the students is 33.


Short-cut Method

Marks No. of students Mid-Points


d = (m – 35) f.d
(X) (f) (m)
0 - 10 5 5 -30 -150
10 - 20 10 15 -20 -200
20 - 30 25 25 -10 -250
30 - 40 30 35 0 0
40 - 50 20 45 +10 +200
50 - 60 10 55 +20 +200
Σf. or N = 100 Σf.d = -200

Σfd
x̄ = A + N ; Where A* = Assumed mid-point value (it can be taken from any values among the frequency ‘x’ or any value whether existing in the
data or not can be taken as the assumed mean and the final answer would still be the same. However nearer the assumed mean is to the actual
mean, lesser are the calculations), d = Deviation i.e. (X - A), N = total number of observations or Σf.

−200
x̄ = 35 +
100

x̄ = 35 + (−2)
x̄ = 33
B. Median
The median by definition refers to the middle value in a distribution. In case of Median, one-half of the items in the distribution have a value the
size of the median value or smaller, and the other-half of the items in the distributions have a value the size of the median value or larger. The
median is just the 50th percentile value below which the 50% of the values in the sample fall. It splits the observations into two halves.

As distinct from the Arithmetic mean, which is calculated from the value of each item in the series, the median is known as positional average.
The term “position” is refers to the place of a value in the series. The place of the median in a series is such that an equal number of items lie on
either side of it.

Since its location is based on its position, Incase of odd number of observations, the median can be assumed to the value at the middle of the
series, whereas, when there is even number of observations, there is no single middle position value then the median is taken to be the
Arithmetic mean of the two middle most items. Thus, when N is odd, the median is the actual value with the reminder of the series in two equal
parts on either side of it, but when N is even, the median is derived figure, i.e. half the sum of the middle values.

i. Merits
a. It is especially useful incase of the open-end classes since only the position and not the values of the items must be known. The median is also
recommended if the distribution has unequal classes, since it is easier to compute than the mean.
b. Extreme values do not affect the median as strongly as they do the mean.
c. In markedly skewed distributions such as income distributions or the price distributions where the arithmetic mean would be distorted by the
extreme values, the median is especially useful. Consequently, the median income for some purposes be regarded as a more representative
figure, for half the income earners must be receiving atleast the median income and as many do not.
d. It is the most appropriate average in dealing with qualitative data i.e. where ranks are given or there are other types of items that are not
counted or measured but are scored.
e. The value of median can be graphed graphically where as the value of the median cannot be ascertained.
f. Perhaps the greatest advantage of median is, however, the fact that the median actually does indicate what many people incorrectly believe
the arithmetic mean indicates. The median indicates the value of the middle item in the distribution. This is a clear cut meaning and makes the
median a measure that can be easily explained.
ii. Limitations
a. For calculating median, it is necessary to arrange the data; other averages do not need any arrangement.
b. Since it is a positional average, its value is not determined by each and observation.
c. It is not capable of algebraic treatment.
d. The value of median is affected more by sampling fluctuations than the value of the arithmetic mean.
e. The median, in some cases, cannot be computed exactly as the mean. When the number of items included in a series of data is even, the
median is determined approximately as the mid-point of the two middle items.
f. It is erratic if the number of items is small.
Computing Median
Calculation of Median- Individual series observations
Q. from the following data of the wages of 7 workers, compute the median wage.
Wages (in Rs.) 14100 14150 16080 17120 15200 16160 17400

Soln.
A. Arrangement of the data set into Ascending or descending order, here we will arrange the data in ascending order.
Sl. No. 1 2 3 4 5 6 7
Wages (in Rs.) 14100 14150 15200 16080 16160 17120 17400

B. Arrangement the data in the table and calculation of the median.


Calculation of median

Wages arranged
Sl. No. in ascending
order
1 14100
2 14150
3 15200
4 16080
5 16160
6 17120
7 17400
𝑁+1
Median = Size of th item.
2
7+1 8
Median = = = 4th item.
2 2

= Rs. 16080.00

Interpretation
We thus find that the median is the middlemost item: 3 persons get a wage less than Rs. 16080 and equal number, i.e. 3 persons, get more than Rs. 16080.
Calculation of Median- Discrete series observations
Q. from the following data of the income of some individuals, find the median of the income group.
Income (in Rs.) 15000 15500 16800 18000 18500 17800
No. of Persons 24 26 20 16 6 30

Soln.
A. Arrangement of the data set into Ascending or descending order, here we will arrange the data in ascending order.
Income (in Rs.) 15000 15500 16800 17800 18000 18500
No. of Persons 24 26 20 30 16 6

B. Arrangement the data in the table and calculation of the median.


Calculation of median
Income (in Rs.) No. of Persons C.f.
15000 24 24
15500 26 50
16800 20 70
17800 30 100
18000 16 116
18500 6 122
𝑁+1
Median = Size of th item.
2
122+1 123
Median = = = 61.5th item.
2 2

Size of the 61.5th item = 16800 (since 61.5th item is not there and the closest value to the 61.5th item in the C.f. is 70th item, so we take the value of the 70th
item here i.e. 16800)

Interpretation
We see that in the question there is even number of items and the median can be any among the middle two values. We find by calculations that the median is the 3rd from
the top in the table and there are 2 income groups earning less and there are 3 income groups earning more than the median income group i.e. 16800.
Calculation of Median- Continuous series observations
Q. from the following data of marks of some students, find the median marks of the students.
Marks 45-50 40-45 35-40 30-35 25-30 20-25 15-20 10-15 5-10
No. of students 10 15 26 30 42 31 24 15 7

Soln.

A. Arrangement of the data set into Ascending or descending order, here we will arrange the data in ascending order.
Marks 5-10 10-15 15-20 20-25 25-30 30-35 35-40 40-45 45-50
No. of students 7 15 24 31 42 30 26 15 10

B. Arrangement the data in the table and calculation of the median.


Calculation of median
n
Marks No. of Students C.f. 2
−C.f.
Median = L + ∗ i ; Where L = lower limit of the median class, i.e. the
5-10 7 7 f

10-15 15 22 class in which the middle item of the distribution lies, C.f. = Cumulative frequency
15-20 24 46 of the class preceding the median class or the sum of the frequencies of all the
20-25 31 77 classes lower than the median class, f = frequency of the median class, i = class
25-30 42 119 interval of the median class.
30-35 30 149
200
35-40 26 175 2
−77
40-45 15 190 Median = 25 + ∗5
42
45-50 10 200
100−77
𝑁 Median = 25 + ∗5
42
Median = Size of 2 th item.
23
200 th Median = 25 + ∗5
Median = = 100 item. 42
2
Median = 25 + 2.74
The median class or the median lies in the class (25-30). (Since 100th item is
not there and the closest value to the 100th item in the C.f. is 119th item, so Median = 27.74
we take the class group of the 119th value of the C.f. as the median class)
The median mark of the students is 27.74.
C. Mode
The mode or the modal value is that value in a series of series of observations which occurs with the greatest frequency. For example, the mode of the
series 3, 5,8,5,4,5,9,3 would be 5, since this value occurs more than any of the others. The mode is often said to be that value which occurs most often in
the data, that is, with the highest frequency. While this statement is quite helpful in interpreting the mode, it cannot safely be applied to any distribution,
because of the vagaries of sampling. Even fairly large samples drawn from a statistical population with a single well defined mode may exhibit very erratic
fluctuations in this average if the mode is defined as that exact value in the ungrouped data of each sample which occurs most frequently. Rather it should
be thought as the value about which the items are most closely concentrated. It is the value which has the greatest frequency density in its immediate
neighbourhood. For this reason it is also called the most typical or fashionable value of a distribution.

Merits (look at the pictures sent in the group and write down the merits and limitations)
i. ....
ii. ....
iii. ....
iv. ....
v. ....

Limitations
i. ....
ii. ....
iii. ....
iv. ....
v. ....
Computing Mode
Calculation of Mode- Individual series observations
Q. Calculate the mode from the following data of the marks obtained by 10 students.
10, 27, 24, 12, 27, 27, 20, 18, 15, 30

Soln.

i. Arrange the items in Ascending or descending order.


10, 12, 15, 18, 20, 24, 27, 27, 27, 30

ii. Create a statistical table for the calculation of mode

Calculation of mode

Size of the item No. of times it occurs


10 1
12 1
15 1
18 1
20 1
24 1
27 3
30 1

Since the number 27occurs the maximum number of times, i.e. 3, the modal marks is 27.
Calculation of Mode- Continuous series observations

Q. Calculate the mode from the following data.

Marks No. of Students


0-10 3
10-20 5
20-30 7
30-40 10
40-50 12
50-60 15
60-70 12
70-80 6
80-90 2
90-100 8

Soln.

i. Convert to Cumulative frequency.

Cumulative Frequency
Marks No. of Students (f)
(C.f.)
0-10 3 3
10-20 5 8
20-30 7 15
30-40 10 25
40-50 12 37
50-60 15 52
60-70 12 64
70-80 6 70
80-90 2 72
90-100 8 80

ii. Since this is a continuous data, we have to see which class has the highest frequency among all the class groups. The class with the highest
value will be the modal class.
Cumulative Frequency
Marks No. of Students (f)
(C.f.)
0-10 3 3
10-20 5 8
20-30 7 15
30-40 10 25
40-50 12 37
50-60 15 52
60-70 12 64
70-80 6 70
80-90 2 72
90-100 8 80

By inspection, the class 50 – 60 has the highest frequency and hence is the modal class for the entire data set.
𝑓1−𝑓ₒ
Mₒ = 𝐿 + ∗ 𝑖; Where L = Lower limit of the modal class, f1 = Frequency of the modal class, fₒ = Frequency of the class preceding the
2𝑓1−𝑓ₒ−𝑓2
modal class, f2= frequency of the class succeeding the modal class.
15−12
Mₒ = 50 + ∗ 10
2∗15−12−12
3
Mₒ = 50 + ∗ 10
6

Mₒ = 50 + 5
Mₒ = 55
2. Measures of Dispersions
I. Introduction
The various measures of Central Value discussed in the previous portion give us one single figure that represents the entire data. But the average alone
cannot adequately describe the set of observations, unless all the observations are same. It is necessary to describe the variability or the dispersions of
the observations. In two or more distributions the central value maybe same but still there can be wide disparities in the formation of the distribution.
Measures of Dispersion help us to in studying this important characteristic of a distribution.

Dispersion measure the extent to which the items vary from some central value. Since Measures of Dispersions give an averages of the differences of
various items from an average, they are also known as the averages of the 2 nd order.

II. Significance of measuring Variation


Measures of Variations are needed for four basic purposes:
A. To determine the reliability of an average. C. To compare two or more series with regard to their
B. To serve as a basis for the control of the variability. variability.
D. To facilitate the use of other statistical measures.

III. Properties of good Measure of Variation


A. It should be easy to understand. E. It should be amenable to further algebraic treatment.
B. It should be easy to compute. F. It should have sampling stability.
C. It should be rigidly defined. G. It should not be unduly affected by extreme items.
D. It should be based on each and every item of the
distribution.
IV. Methods of studying Variations
A. The Range
B. the Inter-quartile Range and the Semi-interquartile or the Quartile Deviation
C. The Mean Deviation or Average Deviation
D. The Standard Deviation
E. The Variance
F. The Lorenz curve
A. The Range
The range is the simplest method of studying dispersion. It is defined as the difference between the value of the smallest item and the largest Item
included in the distribution.
The Range is calculated by the following method:
Range = L − S ; Where L = Largest Item and S = Smallest item

The Relative Measure corresponding to the range is called the co-efficient of range. It is calculated by the following method:
L−S
Co − efficient of range =
L+S

Merits

i. It is the simplest method and the most easy to compute.


ii. It takes minimum time to calculate the value of range. Hence, if one is interested in getting a quick rather than an accurate
picture of variability, one may compute range.

Limitations

i. It is not based on each and every item of the distribution.


ii. It is subject to fluctuations of considerable magnitude from sample to sample.
iii. Range cannot tell us anything about the character of the distribution within the two extreme observations.

Uses
Despite serious limitations, range is useful in the following cases:

i. Quality control.
ii. Fluctuations in the share prices.
iii. Weather forecast
iv. Everyday life.
Calculation of Range – Individual
Q. the following are the prices of the shares of XYZ Co. Ltd. from Monday to Saturday:
Price
Day
(Rs.)
Monday 200
Tuesday 210
Wednesday 208
Thursday 160
Friday 220
Saturday 250

Calculate the range and its co-efficient.


Soln.

𝐈. 𝐑𝐚𝐧𝐠𝐞 = 𝐋 − 𝐒

Here, L = 250 and S = 160

Range = 250 − 160

Range = 90

Therefore the range is Rs. 90.

𝐋−𝐒
II. 𝐂𝐨 − 𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐭 𝐨𝐟 𝐫𝐚𝐧𝐠𝐞 = 𝐋+𝐒

250−160
Co − efficient of range = 250+160

90
Co − efficient of range = 410

Co − efficient of range = 0.22


B. the Inter-Quartile Range and the Semi-interquartile Range
i. Inter-Quartile Range

The Range which is discussed before has certain limitations. It is based on the two extreme items and it fails to take into the scatter within the range. From
this there is a reason to believe that if the dispersion of the extreme items is discarded, the range would be more instructive. For this purpose another
measure called the interquartile range has been developed, which includes the middle 50% of the distribution i.e. one quarter of the distribution from the
lower end and one quarter from the upper end of the observations is excluded from computing the interquartile range. In other words, the interquartile
range represents the difference between the 3rd quartile and the 1st quartile.

It is represented as 𝐼𝑛𝑡𝑒𝑟𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑟𝑎𝑛𝑔𝑒 = 𝑄3 − 𝑄1

ii. Semi-Interquartile range

The Interquartile range is, very often, reduced to the form of the semi-interquartile range or Quartile Deviation, by dividing it by 2. It gives an average
amount by which the two quartiles differ from the median. In asymmetrical distribution, the two quartiles are equi-distant from median. The median covers
exactly 50% of the observations. That is why it is also known as Q2. When the Q.D. is small, it describes high uniformity or small variation of the central 50%
items, and High Q.D. means that the variations among the central items is large.

The Q. D. Is calculated by the formula:

𝑄3−𝑄1
𝑄. 𝐷. = ( )
2

The Q.D. is an absolute measure of dispersion. The relative measure of dispersion is the co-efficient of dispersion, it is calculated by

𝑄 3−𝑄1
2 𝑄3−𝑄1
𝐶𝑜 − 𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑄. 𝐷. = 𝑄 3+𝑄1 =( )
𝑄3+𝑄1
2

Merits
i. It has a special utility in measuring variation in case of the open end distributions or one in which the data may be ranked but measured
quantitatively.

ii. It is also useful in measuring the erratic or badly skewed distributions, where the other measures of dispersions would be warped by extreme
values. It is not affected by extreme values.
Limitations
i. Q.D. ignores the 50 % of the observations, i.e. 1st 25 % and last 25%.
ii. It is not a very capable mathematical manipulation.
iii. Its value is very much affected by the sampling fluctuations.
iv. It is not a true measure of dispersion as it does not show the scatter around an average but rather a distance on a scale.
Calculation of Quartile Deviation – Individual series observation

Q. Find out the Quartile Deviation and its co-efficient from the following data.
Roll no. 1 2 3 4 5 6 7
Marks 20 28 40 12 30 15 50

Soln.

Calculation of the Quartile Deviation

I. Arrangement of marks.

Marks 12 15 20 28 30 40 50

II. Calculating the Q1 So the size of the 6th item is 40. Thus Q3 = 40.
𝑁+1
Q1 = size of ( )th item I.V. calculating the Quartile Deviation
4

7+1 𝑄3−𝑄1
Q1 = size of the ( ) th item 𝑄. 𝐷. =
4 2

Q1 = size of the 2nd item 𝑄. 𝐷. = (


40−15
)
2
So the size of the 2nd item is 15. Thus Q1 = 15
Q.D. = 12.5
III. Calculating the Q.3.
V. Co-efficient of Q.D
Q3 = size of ((3𝑁 + 1)/4)th item
40−15
Coefficient of Q.D. = 𝐶𝑜 − 𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑄. 𝐷. =
3∗ 7+1 40+15
Q3 = size of the th item
4
15
24
= ( ) = 0.455
Q3 = size of the item 25
4

Q3 = size of the 6th item


Calculation of Quartile Deviation – Continuous series observation

Q. Calculate the Q.D. and the co-efficient of Q.D. from the following data.

Wages (in Rs./day) 33-35 35-37 37-39 39-41 41-43


No. of Wage earners 14 62 99 18 7

Soln.

Calculation of Q.D. and its Co-efficient

Wages (in Rs./day) f C.f.


33-35 14 14
35-37 62 76
37-39 99 175
39-41 18 193
41-43 7 200

200
Size of the Quartile 1 class 4
− 14 Size of the Quartile 3 class
𝑄. 1 = 35 + ∗2
62
𝑁 200 th 3𝑁
Q.1= size of the 4 th item = =50 item Q.3 = size of the th item
4 4
50− 14
𝑄. 1 = 35 + ∗2
62 3∗200
The Q.1 lies in the class 35-37 (since the Q.3 = size of the th item
4
location of Q.1 class is 50th item and there is no Q.1. = 35 +
64
∗2
62 600
50 in the observations so the nearest value i.e. Q.3= size of the th item.
4
62 is taken as the frequency of the quartile Q.1= 35+ (1.0323)*2
class and 76 is taken as the C.f. of the quartile Q.3. = size of the 150th item
class.) Q.1 = 35 + 2.064
The Q3 lies in the class 37-39 (since the
𝑛
− 𝐶.𝑓 𝑛
Q.1. = 37.064 location of Q.3 class is 150th item and there is
4
𝑄. 1 = 𝐿 + ∗ 𝑖 ; L = 35, 4 =50, C.f. = no 150 in the observations so the nearest value
𝑓

14, f = 62, i = 2 i.e.99 is taken as the frequency of the quartile


class and 175 is taken as the C.f. of the quartile
class.)
3𝑛 74 𝑄3−𝑄1
4
− 𝐶.𝑓 𝑄. 3 = 37 + ∗2 Co-efficient of Q.D. = ( )
𝑄. 3 = 𝐿 + ∗ 𝑖 ; L = 37, 99 𝑄3+𝑄1
𝑓
3𝑛 𝑄. 3 = 37 + 0.7474 ∗ 2 𝐶𝑜 − 𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑄. 𝐷.
= 150, C.f. = 76, f = 99, i = 2 38.4949 − 37.064
4
𝑄. 3 = 37 + 1.4949 =
38.49490 + 37.064
3∗200
4
− 76
𝑄. 3 = 37 + ∗2 𝑄. 3 = 38.4949 𝐶𝑜 − 𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑄. 𝐷. =
1.4309
= 0.0189
99 75.5589
𝑄3−𝑄1
𝑄. 𝐷. = ( )
600 2
− 76
4
𝑄. 3 = 37 + ∗2 38.4949−37.064
99 𝑄. 𝐷. = ( )
2
150− 76 1.4309
𝑄. 3 = 37 + ∗2 𝑄. 𝐷. = ( )
99
2

𝑄. 𝐷. = 0.71545
C. The Mean Deviation
The two methods of dispersions discussed before, namely Range and Q.D., are not Measures of dispersions in true sense as they do not show scatterness
around the average. However to study the formation of the distribution, we should take deviation from an average. The other two measures help us in
achieving this goal.

The Mean deviation is also known as the Average deviation. It is the average difference between the items in a distribution and the median or mean of that
series. Theoretically there is an advantage in taking the deviations from the median because the sum of the deviations of the items from the median is
minimum when signs are ignored. However in practice, the arithmetic mean is more frequently used in calculating the value of average deviation and this is
the reason why it is also called mean deviation.

Merits
i. It is simple to understand and easy to compute.
ii. It is based on each and every item of the data. Consequently change in the value of any item would change the value of the mean deviation.
iii. It is less affected by the values of extreme items than the standard Deviation

Limitations
i. The greatest drawback is that it ignores the algebraic signs while taking the deviations of the items. This is mathematically wrong and makes it
non-algebraic.
ii. This method will not give us the accurate result. The reason is that the mean deviation gives us the best result when deviations are taken from
median. But median is not a satisfactory measure when the degree of variability is high. And if we compute the mean deviation from the mean
that is also not desirable because the sum of the deviations from mean (ignoring the signs) is greater than the sum of the deviations from
median (ignoring the signs). If the Mean Deviation is computed from the mode that is also not scientific because the value of mode cannot
always be determined.
iii. It is not capable of further algebraic treatment.
iv. It is rarely used in sociological studies.
Calculation of Mean Deviation – Individual series observation

Q. Calculate the Mean Deviation and its co-efficient of the income earned by a group of individuals.

Group (Rs.) 14000 14800 15200 16000 18800

1
Soln. 𝑀𝑒𝑎𝑛 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = ∗𝛴 𝑋−𝐴
𝑛

1
Or 𝑀𝑒𝑎𝑛 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = ∗ 𝛴│𝐷│
𝑛

𝛴 𝐷│
Or 𝑀𝑒𝑎𝑛 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 =
𝑁

I. Computing the median of the series


𝛴 𝐷│
𝑁+1 5+1 6 𝑀𝑒𝑎𝑛 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = ; │D│= Deviation from the median ignoring
Median = size of th item = = = 3rd item 𝑁
2 2 2
signs
rd
Size of the 3 item = 15200 6000
𝑀𝑒𝑎𝑛 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = = 1200
5
II. Computing the mean deviation of the series Interpretation
Deviation (D) (Deviation This means that the average deviation of the individual incomes from the
│D│(ignoring
Sl. No. Income (Rs.) from the median value median incomes is Rs. 1200.
the signs)
i.e. 15200)
1 14000 -1200 1200 Co-efficient of Mean Deviation
2 14800 -400 400
𝑀𝑒𝑎𝑛 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 1200
3 15200 0 0 Co-efficient of Mean Deviation = =
𝑚𝑒𝑑𝑖𝑎𝑛 15200
4 16000 +800 800
5 18800 +3600 3600 = 0.0789
N=5 Σ│D│= 6000
Calculation of Mean Deviation – Discrete series observation

Q. Calculate the Mean Deviation and its co-efficient from the following series.

X 10 11 12 13 14
Y 3 12 18 12 3
Soln.
1
𝑀𝑒𝑎𝑛 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = ∗ 𝛴𝑓 𝑋 − 𝐴
𝑛

1
Or 𝑀𝑒𝑎𝑛 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = ∗ 𝛴𝑓│𝐷│
𝑛

𝛴𝑓 𝐷│
Or 𝑀𝑒𝑎𝑛 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 =
𝑁

Deviation (D) (Deviation


│D│(ignoring
x F from the median value f│D│ C.f.
the signs)
i.e. (X-A)
10 3 -2 2 6 3
11 12 -1 1 12 15
12 18 0 0 0 33
13 12 +1 1 12 45
14 3 +2 2 6 48
N = 48 Σf│D│= 36

I. Computing the median of the series 𝛴𝑓 𝐷│


𝑀𝑒𝑎𝑛 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = ; │D│= Deviation from the median
𝑁
𝑁+1 48+1 49
Median = size of th item = = = 24.5th item ignoring signs
2 2 2

36
Size of the 24.5th item (in the X) = 12, hence median = 12 𝑀𝑒𝑎𝑛 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = = 0.75
48
II. Computing the mean deviation of the series Interpretation
This means that the average deviation from the median is 0.75.
Calculation of Mean Deviation –Continuous series observation

Q. Calculate the Mean Deviation and its co-efficient from the following series.

Size 0-10 10-20 20-30 30-40 40-50 50-60 60-70


Frequency 7 12 18 25 16 14 8
Soln.

M.P. (m) Deviation (D) (Deviation


│D│(ignoring
Size Frequency (f) C.f. 𝑈𝐶𝐿 − 𝐿𝐶𝐿 from the median value f│D│
( ) the signs)
2 i.e. (X-A), here 35.2)
0-10 7 7 5 -30.2 30.2 211.4
10-20 12 19 15 -20.2 20.2 242.4
20-30 18 37 25 -10.2 10.2 183.6
30-40 25 62 35 -0.2 0.2 5.0
40-50 16 78 45 +9.8 9.8 156.8
50-60 14 92 55 +19.8 19.8 277.2
60-70 8 100 65 +29.8 29.8 238.4
N = 100 Σf│D│ = 1314.8

100
I. Computing the median of the series 2
−37 II. Computing the mean deviation of the
Median = 30 + ∗ 10 series
25
𝑁
Location of Median class = size of th item =
2
100 50−37 𝛴𝑓 𝐷│
th
= 50 item Median = 30 +
25
∗ 10 𝑀𝑒𝑎𝑛 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = ; │D│=
2 𝑁
Deviation from the median ignoring signs
Median class = 30-40, frequency of the median 13
Median = 30 + ∗ 10
25 1314 .8
class = 25, C.f. of the median class = 62, C.f. of 𝑀𝑒𝑎𝑛 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = = 13.148
100
the class preceding the median class = 37 Median = 30 + 0.52 ∗ 10
𝑁
2
−𝐶.𝑓. Median = 30 + 5.2
Median = 𝐿 + ∗𝑖
𝑓
Median = 35.2
D. The Standard Deviation
The standard Deviation concept was introduced by Karl Pearson in 1823. It is by far the most widely used measure of studying the dispersion. Its significance
lies in the fact it is free from those defects from which the earlier methods suffered and also satisfies most of the properties of good measure of dispersion. It
is also known as Root mean square deviation for the reason that it is the square root of the mean of the squared deviation from the arithmetic mean. The
Standard deviation id denoted by the small Greek letter σ (read as sigma). The Standard deviation measures the absolute dispersion, the greater the standard
deviation, the greater will be the magnitude of the deviations of the values from their mean. A small Standard Deviation means a high degree of uniformity of
the observation as well as homogeneity of a series and vice versa is also same. Hence the Standard Deviation is extremely useful in judging the
representativeness of the mean.

Merits
i. It is the best measure because of its mathematical characteristics.
ii. It is based on each and every item of the distribution.
iii. It is amenable to algebraic treatment and is less affected by fluctuations of the sampling than most other measures of dispersions.
iv. It is possible to calculate the combined Standard deviation of two or more groups. This is not possible with any other measure.
v. For comparing the variability of two or more distributions co-efficient of variations is considered as the most appropriate and this is based on mean
and standard deviation.
vi. Standard deviation is used for advanced statistical work.

Limitations
i. As compared to other measures of dispersions, the Standard deviation is difficult to measure.
ii. It gives more weight to extreme items and less weight to items near mean. It is because of the fact that the squares of the deviations which are big in
size would be proportionately greater than the squares of those deviations which are comparatively small.
Calculation of Standard Deviation – Individual series observation (Actual mean method)
Q. from the given question, calculate the standard Deviation.

ΣX
Sl. No. X X = (N ) x = (X- X̄) x²

1 240 -6.1 37.21


2 260 +13.9 193.21
3 290 +43.9 1927.21
4 245 -1.1 1.21
5 255 +8.9 79.21
246.1
6 288 +41.9 1755.61
7 272 +25.9 670.81
8 263 +16.9 285.61
9 277 +30.9 954.61
10 251 +4.9 24.01
N=10 ΣX = 2641

i. Calculating Actual mean (x̄)


ΣX = 2641
N=10
ΣX 2461
x̄ = ( N ) = ( ) = 246.1
10

ii. Calculating Standard Deviation:


ΣX
σ= (N )

2461
σ= ( )
10

σ= 246.1
σ= 15.69
Calculation of Standard Deviation – Individual series observation (Assumed mean method)
Q. from the given question, calculate the standard Deviation.

Sl. No. X d = (X- A) d²

1 240 -24 576


2 260 -4 16
3 290 +26 676
4 245 -19 361
5 255 -9 81
6 288 +24 576
7 272 +8 64
8 263 -1 1
9 277 +13 169
10 251 -13 169
N=10 ΣX = 2641 Σd =+1 Σ d² =2689

d2 Σd 2
σ = ( Σ − )
N N

2689 1 2
σ = ( Σ − )
10 10

σ = (268.9 − 0.01)

σ = 16.398
Calculation of Standard Deviation –Discrete series observation (Actual mean method)
Q. from the given question, calculate the standard Deviation.

X 3.5 4.5 55.5 6.5 7.5 8.5 9.5

Y 3 7 22 60 85 32 8

Soln. Calculation of the Standard Deviation

ΣfX
X f f.X X=( ) x = (X- X̄) x² fx²
N

3.5 3 10.5 -3.5898 12.8867 38.6601


4.5 7 31.5 -2.5898 6.7071 46.9497
5.5 22 121 -1.5898 2.5275 55.605
6.5 60 390 7.0898 -0.5898 0.3479 20.874
7.5 85 637.5 +0.4102 0.1683 14.3055
8.5 32 272 +1.4102 1.9887 63.6384
9.5 8 76 +2.4102 5.8091 46.4728
N = 217 Σ f.X= 1538.5 286.5055

i. Calculating Actual mean (x̄) x̄ = 7.0898


ΣfX
x̄ = N ; Where f = frequency, X = the variable in the question, N =
Total number of Observation or Σf. ii. Calculating Standard Deviation σ
Σfx 2
ΣfX
σ= N
x̄ =
N 286.5055
σ= 217
1538 .5
x̄ = σ = 1.3203
217
σ = 1.1490
Calculation of Standard Deviation –Discrete series observation (Assumed mean method)
Q. from the given question, calculate the standard Deviation.

X 3.5 4.5 55.5 6.5 7.5 8.5 9.5

Y 3 7 22 60 85 32 8

Soln. Calculation of the Standard Deviation

X f d= (X- A) d² f.d fd²

3.5 3 -3 9 -9 27
4.5 7 -2 4 -14 14
5.5 22 -1 1 -22 22
6.5 60 0 0 0 0
7.5 85 1 1 +85 85
8.5 32 2 4 +64 128
9.5 8 3 9 +24 72
N = 217 Σf.d.= +128 Σ fd²= 362
Here, A = 6.5 (it can be taken from any values among the frequency ‘x’ or any value whether existing in the data or not can be taken as the
assumed mean and the final answer would still be the same. However nearer the assumed mean is to the actual mean, lesser are the
calculations),

Σ fd2 Σfd 2
𝜎= −
N N

362 128 2
𝜎= −
217 217

𝜎 = 1.668 − 0.348 =1.149


Calculation of Standard Deviation –Discrete series observation (Step-deviation method)
Q. from the given question, calculate the standard Deviation.

Salaries(in Rs. ‘000) 45 50 55 60 65 70 75 80

Number of persons 3 5 8 7 9 7 4 7

Soln. Calculation of the Standard Deviation

No. of Persons
Salaries (X) d= (X- 60)/5 d² f.d fd²
(f)
45 3 -3 9 -9 27
50 5 -2 4 -10 20
55 8 -1 1 -8 8
60 7 0 0 0 0
65 9 +1 1 +9 9
70 7 +2 4 +14 28
70 4 +3 9 +12 36
80 7 +4 16 +28 112
N = 50 Σf.d.= +36 Σ fd²= 240
Here, X = Mid points = Salaries in ‘000, Number of persons = Frequencies (f), A = 60 (it can be taken from any values among the frequency ‘x’ or any
value whether existing in the data or not can be taken as the assumed mean and the final answer would still be the same. However nearer the
𝑋−𝐴
assumed mean is to the actual mean, lesser are the calculations),𝑑 = , i = interval between two mid points.
𝑖

Σ fd2 Σfd 2
𝜎= − ∗i
N N

240 36 2
𝜎= − ∗5
50 50

𝜎 = 4.8 − 0.5184 ∗ 5 = 10.35


Calculation of Standard Deviation –Continuous series observation (Step-deviation method)
Q. from the given question, calculate the standard Deviation.

Wages(in Rs. ‘000) 0-10 10-20 20-30 30-40 40-50 50-60

Number of workers 12 17 23 39 16 3

Soln. Calculation of the Standard Deviation

Wages (in Rs. ‘000) Mid-points (m) X − 35


𝑈𝐶𝐿−𝐿𝐶𝐿 No. of Persons (f) d= d² f.d fd²
(X) =( 2 ) 10
0-10 5 12 -3 9 -36 108
10-20 15 17 -2 4 -34 68
20-30 25 23 -1 1 -23 23
30-40 35 39 0 0 0 0
40-50 45 16 +1 1 +16 16
50-60 55 3 +2 4 +6 12
N = 110 Σf.d.= -71 Σ fd²= 227

Here, X = Mid points = Salaries in ‘000, Number of persons = Frequencies (f), A = 60 (it can be taken from any values among the frequency ‘x’ or any
value whether existing in the data or not can be taken as the assumed mean and the final answer would still be the same. However nearer the
assumed mean is to the actual mean, lesser are the calculations or see which among the two middle most values of X has the higher
𝑋−𝐴
frequency), 𝑑 = , i = interval between two mid points.
𝑖

Σfd −71
𝜎 = 2.064 − 0.417 ∗ 10 = 12.83 x̄ = A + ∗ i =35 + ∗ 10 =
Σ fd2 Σfd 2 N 110
𝜎= − ∗i
N N 35 + −0.6454 ∗ 10 = 35 + −6.4545 =
Co-efficient of Variation
28.5454, σ = 12.83
𝜎
2 Co-efficient of Variation= ∗ 100 12.83
𝜎=
227
− −
71
∗ 10
x̄ Co-efficient of Variation= 28.5454
∗ 100 =
110 110
0.4495 ∗ 100 = 44.9459
D. Variance
Both the variance and the Standard deviation are measures of variability in population. These two measures are closely related as is clear from the formula:
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = σ² . Variance is the average squared deviation from the arithmetic mean or in more simple words it is the average of the squared differences
from the mean and Standard Deviation is the square root of the variance.
Calculation of Variance
Q. from the given question, calculate the Variance.

Marks Obtained 10-14 14-18 18-22 22-26 26-30 30-34 34-38 38-42 42-46 46-50 50-54 54-58

No. of Student 2 4 4 8 12 16 10 8 4 6 2 4

Soln. Calculation of the Standard Deviation


Mid-points (m) No. of Students m − 32
Marks obtained (X) 𝑈𝐶𝐿−𝐿𝐶𝐿 d= d² f.d fd²
=( 2 ) (f) 4
10-14 12 2 -5 25 -10 50
14-18 16 4 -4 16 -16 64
18-22 20 4 -3 9 -12 36
22-26 24 8 -2 4 -16 32
26-30 28 12 -1 1 -12 12
30-34 32 16 0 0 0 0
34-38 36 10 +1 1 +10 10
38-42 40 8 +2 4 +16 32
42-46 44 4 +3 9 +12 36
46-50 48 6 +4 16 +24 96
50-54 52 2 +5 25 +10 50
54-58 56 4 +6 36 +24 144
N = 80 Σf.d.= +30 Σ fd²= 562

Here, X = Mid points = Salaries in ‘000, Number of persons = Frequencies (f), Assumed mid-point (m) = 60 (it can be taken from any values among the
frequency ‘x’ or any value whether existing in the data or not can be taken as the assumed mean and the final answer would still be the same.
𝑋−𝐴
However nearer the assumed mean is to the actual mean, lesser are the calculations),𝑑 = , i = interval between two mid points.
𝑖

Σ fd2 Σfd 2 562 30 2


𝜎= − ∗ i² = − ∗ 42 = 7.025 − 0.141 ∗ 16 = 6.884 x 16 = 110.144.
N N 80 80
E. Lorenz Curve
The Lorenz curve was devised by Max O. Lorenz, who was an economic statistician. It is a graphic method of studying the dispersion. The curve was 1st used
by him to measure the distribution of wealth and income. However its study now includes the study of the distribution of profits, wages, turnover, etc.
However, still the most common use of this curve is in the study of the degree of the inequality in the distribution of income and wealth between different
countries or between different periods of time. It is a cumulative percentage curve in which the percentage of the items is combined with the percentage of
other things as wealth, profits, turnover etc.

Steps for constructing the Lorenz curve

i. The size of the items and their frequencies are both cumulated. (C.f. for both the class group and the frequency is taken). Taking grand total (or the
last value in the C.f.) for each as 100, percentages are obtained for taking these various cumulative values.
ii. Next, on the X-axis start from 0 to 100 and take the percent of cumulative frequencies.
iii. On the Y-axis start from 0 to 100 and take the percent of the cumulated frequencies values of the variable.
iv. Draw a diagonal line joining O (0, 0) with point P (100, 100) as shown in the diagram below. The line OP will make an angle of 45° with the Y-axis and is
called as the line of equal distribution. Any point on this diagonal shows that same percent on X as on Y.
v. Plot the percentage of the Cumulated frequency values of the variable (Y) against the percentages of the corresponding cumulated frequencies values
of the variable (X) for the given distribution and join these points with a smooth free hand curve. For any given distribution, this will never cross the
line of equal distribution OP. It will always lie below the OP unless the distribution is uniform in which case it will coincide with OP. The greater the
variability, the greater is the distance of the curve from OP.
Calculation of Lorenz Curve
Q. from the given question, calculate the Lorenz Curve.

Profits earned (Rs.’000)


No. of Companies in area A

6 6
25 11
60 13
84 14
105 15
150 17
170 10
400 14

Area A
Profits
Cumulative Cumulative
Cumulative Cumulative No. of
Rs. ‘000 no. of Percentage
Profit Percentage Companies
companies
6
6 6 0.6 6 6
17
25 31 3.1 11 17
30
60 91 9.1 13 30
44
84 175 17.5 14 44
59
105 280 28.0 15 59
76
150 430 43.0 17 76
86
170 600 60.0 10 86
100
400 1000 100.0 14 100
Page for Graphical Construction of Lorenz Curve
3. Measures of Correlation
In the previous measures of central tendency or measures of dispersions, we have discussed relating to only one variable. In practice however, we
can come across a large number of problems which use two or more than two variables. If the quantities vary in such a way that movements in
one is accompanied by movements in other, these are correlated. The degree of relationship between variables under consideration is measured
through the correlation analysis. The measure of correlation called the correlation coefficient or the correlation index summarizes in one figure
the direction and the degree of correlation. The correlation analysis refers to the techniques used in measuring the closeness of the relationship
between the variables. Thus the correlation is the statistical device which helps in analysing the co-variation of two or more variables.
The problem of analysing the relation between different series should be broken down in three steps:
i. Determining whether the relation exists and, if it does, measuring it.
ii. Testing whether it is significant.
iii. Establishing the cause and effect relation, if any.

Need to study correlation analysis


The study of the correlation is of immense use in practical life because of the following reason:

i. Most of the variables show some kind of relationship. With the help of the correlation analysis, we can measure in one figure the degree
of the relationship existing between the variable.
ii. Once we know that two variables are closely related, we can estimate the value of one variable given the value of another. This is known
with the help of the regression analysis.
iii. Correlation analysis to the understanding of the economic behaviour, aids in locating the critically important variables on which other
depend, may reveal to the economist the connection by which disturbances spread and suggest to him the paths through which stabilising
forces may become effective.

Types of correlation
These are the important ways by which correlation can be classified:

i. Positive and Negative.


Whether the correlation is positive or negative depends upon the change of the variables. If both the variables are varying in the same
direction i.e. if one variable increases, the other also increases (on an average), correlation is said to be positive. The negative correlation
occurs when the one variable increases and the other variable decreases. This is also known as inverse relationship.
ii. Simple, partial and multiple.
In simple correlation, only two variables are studied, whereas in partial and multiple correlations, three or more variables are studied. In
partial correlation, more than two variables are recognised but only two are considered to affect each other and in multiple correlations,
three or more variables are studied simultaneously.

iii. Linear and non-linear.


If the amount of change in one variable tends to bear constant ratio to the amount of change in the other variable, then the change is said
to be linear correlation. Example:
X: 10 20 30 40 50
Y: 70 140 210 280 350
If the amount of change in one variable does not bear constant ratio to the amount of change in the other variable, then the change is said
to be non-linear. Example:
If we doubled the amount of rainfall, it is not necessary that the production of grains will also double.

Methods of studying correlation


The various methods of studying the correlation are given below:

i. Scatter Diagram Method


ii. Graphic method
iii. Karl Pearson’s Co-efficient of correlation
iv. Spearman’s Rank Correlation coefficient method
v. Concurrent Deviation method
vi. Methods of least squares.

But here we will only discuss the spearman’s rank correlation.


A. Spearman’s Rank correlation Coefficient
The Karl Pearson’s method is based on the assumption that the population being studied is normally distributed. When it is known that the population
being studied is normally distributed. When it is known that the population is not normal or when the shape of the distribution is not known, there is
need for a measure of correlation that involves no assumption about the parameter of the population.
It is possible to avoid making any assumptions about the population being studied by ranking the observation according to size and basing the
calculation on the ranks rather than upon their original observations. It does not matter which way the items are ranked, item number one may be
the largest or the smallest. Using the ranks rather than actual observations gives the coefficient of the rank correlation.
This method of finding out the covariablity or the lack of it between two variables was developed by British Psychologist Charles Edward Spearman in
1904. This measure is especially useful when quantitative measures for certain factors cannot be fixed but the individual in the group can be arranged
in order thereby obtaining for each individual a number indicating their rank in the group. Spearman’s rank correlation coefficient is defined as:
6∗ ΣD 2 6∗ ΣD 2
R=1− Or R = 1 − ; Where R = rank co-efficient of correlation and D = Difference of rank between paired items in two
N∗ N 2 −1 N 3 −N
series.

Features of Spearman’s Correlation coefficient


i. The sum of the differences of the ranks between two variables shall be zero. Symbolically, Σd = 0.
ii. Spearman’s Correlation coefficient is distribution-free or non-parametric because no strict assumptions are made about the form of the
population from which sample observations are drawn.
iii. The spearman’s correlation coefficient is nothing but Karl Pearson’s coefficient between the ranks. Hence it can be interpreted in the same
manner as personian correlation coefficient.

Merits
i. This method is simpler to understand and easier to apply compared to the Karl Pearson’s methods. The answers obtained by this method and
the Karl Pearson’s method will be same provided no value is repeated i.e. all items are different.

ii. Where the data are of a qualitative nature like honesty, efficiency, intelligence, etc., this method can be used with great advantage. For
example, the workers of two factories can be ranked in order of their efficiency and the degree of correlation can be established by applying
this method.

iii. This is the only method that can be used where we are given the ranks and not the actual data.

iv. Even where the actual data are given, rank method can be applied for ascertaining correlation.

v. Rank correlation is very useful when the data are non-normally distributed.
Limitations
This method is however associated with few limitations like:

i. This method cannot be used for finding out the correlation in a grouped distribution.
ii. Where the number of items exceeds 30 the calculations become quite tedious and require a lot of time. Therefore, this method should not be
applied where N exceeds 30 unless we are given the ranks and not the actual values of the variables.

In correlation we have two types of problems:

i. When ranks are given.


ii. When ranks are not given.

i. When Ranks are given


Where actual ranks are given to us the steps required for computing rank correlation are:
a) Take the differences of the two ranks, i.e. (𝑅₁ − 𝑅₂) and denote these differences by D.
b) Square these differences and obtain the total𝛴𝐷².
6∗𝛴𝐷 2
c) Apply the formula 𝑅 = 1 − 𝑁 3 −𝑁

ii. When Ranks are given


When we are given the actual data and not the ranks, it will be necessary to assign ranks. The Ranks can be assigned by taking either the highest value
as 1 or the lowest value as 1. But whether we start with the lowest value or the highest value, we must follow the same method in case of both the
variables.
Calculation of Spearman’s Rank correlation coefficient (when rank is given)
Q. Two ladies were asked to rank 7 different types of lipsticks. The ranks given by them have been given as follows.

Lipstick A B C D E F G
Neelu 2 1 4 3 5 7 6
Neena 1 3 2 4 5 6 7
Calculate the Spearman’s Rank correlation coefficient.

Soln: Calculation of Spearman’s Rank correlation coefficient

X Y
𝐷 = (𝑅₁ − 𝑅₂) D²
𝑅₁ 𝑅₂
2 1 +1 1
1 3 -2 4
4 2 +2 4
3 4 -1 1
5 5 0 0
7 6 +1 1
6 7 -1 1
𝛴𝐷²= 12

6∗𝛴𝐷 2 6∗12 72
𝑅 =1− 𝑁 3 −𝑁
=1− 73 −7
=1 − 336
= 1 − 0.214 = 0.786
Calculation of Spearman’s Rank correlation coefficient (when rank is not given)
Q. Calculate the spearman’s coefficient of correlation between the marks assigned to 10 students by judges X and Y in a certain competitive test as shown
below

Sl. No. 1 2 3 4 5 6 7 8 9 10
Marks by
52 53 42 60 45 41 37 38 25 27
judge X
Marks by
65 68 43 38 77 48 35 30 35 50
judge Y

Soln: Calculation of Spearman’s Rank correlation coefficient

Marks by Marks by
x 𝑅𝑥 y Ry 𝐷 = 𝑅𝑥 − 𝑅𝑦 D²
Judge X Judge Y
52 25 1 65 25 1 0 0
53 27 2 68 50 7 -5 25
42 37 3 43 35 3 0 0
60 38 4 38 30 2 2 4
45 41 5 77 48 6 -1 1
41 42 6 48 43 5 1 1
37 45 7 35 77 10 -3 9
38 52 8 30 65 8 0 0
25 53 9 25 68 9 0 0
27 60 10 50 38 4 6 36
𝛴𝐷²= 76

6∗𝛴𝐷 2 6∗76 456


𝑅 =1− 𝑁 3 −𝑁
=1− 10 3 −10
=1 − 990
= 1 − 0.4606 = 0.5394
Calculation of Spearman’s Rank correlation coefficient (when rank is equal)
Q. Calculate the spearman’s coefficient from the given data

Candidate 1 2 3 4 5 6 7 8
judge X 20 22 28 23 30 30 23 24
judge Y 28 24 24 25 26 27 32 30

Soln: Calculation of Spearman’s Rank correlation coefficient

Candidate Judge X 𝑅𝑥 Judge Y Ry 𝐷 = 𝑅𝑥 − 𝑅𝑦 D²


1
1 20 28 6 5 25
2
2 22 24 1.5 0.5 0.25
6
3 28 24 1.5 4.5 20.25
3.5
4 23 25 3 0.5 0.25
7.5
5 30 26 4 3.5 12.35
7.5
6 30 23 5 2.5 6.25
3.5
7 23 32 8 -4.5 20.25
5
8 24 30 7 -2 4

𝛴𝐷²= 88.50

1 1
6∗ 𝛴𝐷 2 + 𝑚 13 −𝑚 1 + 𝑚 23 −𝑚 2 +⋯ 6∗ 88.50+0.5+0.5+0.5
12 12
𝑅 =1−
𝑅 =1−( ) 504
𝑁 3 −𝑁
𝑅= 540
𝑅 =1− = 1 – 1.071= -0.071
1 1 1
504
6∗ 88.50+ 23 −2 + 23 −2 + 23 −2
12 12 12
1−
83 −8

You might also like