Chapter 2 Data Processing
Chapter 2 Data Processing
INTRODUCTION
Data is described in layman terms as the collection of facts, such as measurements, observations or just descriptions of things. Data are also the characteristics or
information, usually numerical that is collected through observation. In a more technical sense, Data are the set of values of qualitative or quantitative variables
about one or more persons or objects.
The numerical values which are collected through the measurements, observations, or descriptions need to be processed to become useful and valuable
information. In other words, the collected data needs to be compiled, analysed and presented through the different graphical methods to make it understandable
to all. For this, different measures are used. These measures which are used are:
The Measures of central tendency provide the value that is an ideal representative of a set of observations. The measures of dispersion take into account the
internal variations of the data, often around a measure of central tendency. The measures of relationship, on the other hand, provide the degree of association
between any two or more related phenomena, like rainfall and incidence of flood or fertilizer consumption and yield of crops
1. Measures of Central tendency
I. Introduction
One of the most important objectives of the statistical analysis is to get one single value that describes the characteristics of the entire mass of the unwieldy
data. Such a value is called the central value of an “Average” or the expected value of the variable. The word average is very commonly used in day to day
conversation. For example, we often talk about the average boy in a class, average height of an Indian, average income, average marks of the class etc. When
we say that ‘he is an average student, that means he is neither very good nor very bad student, just a mediocre student. However in statistics, the term average
has a different meaning.
Average in Statistics is not just mediocre but rather it is the single value that represents a group of values. Such a value is of significance because it depicts the
characteristics of the whole group. Since an average represents the entire data, it lies somewhere in between the two extremes, i.e. the largest item and the
smallest item. For this reason, an average is frequently referred to as the “Measure of central tendency”.
II.Objectives of Averaging
The main objectives of the studies of averages:
A. To get the single value that describes the characteristics of the entire group.
B. To facilitate comparisons.
Calculating both the Simple arithmetic mean and the weighted arithmetic mean is done for 3 types of Observations:
Income (Rs.) 14780, 15760, 26690, 27750, 24840, 24920, 16100, 17810, 27050, 26950
Calculate the Arithmetic mean of the income by Direct Method and Short-cut method.
Soln.
Direct Method
Calculation of Arithmetic mean
Monthly x̄= ΣX
Employees Income (Rs.) N
(X) = 222650
1 14780 10
2 15760 =22265
3 26690
4 27750 The Average income of the employees is Rs. 22265
5 24840
6 24920
7 16100
8 17810
9 27050
10 26950
N = 10 ΣX = 222650
Short-cut Method
Σd
x̄ = A +
N
Monthly d = (X-A*)
Employees Income (Rs.) (X-22000)
(X)
1 14780 -7220
2 15760 -6240
3 26690 +4690
4 27750 +5750
5 24840 +2840
6 24920 +2920
7 16100 -5900
8 17810 -4190
9 27050 +5050
10 26950 +4950
N = 10 ΣX = 222650 Σ d = 2650
Σd
x̄ = A +
N
A = 22000 (Here it’s taken as 22000, since ΣX/N= 222650/10 = 22265; rounding off the value to nearest ‘000 value we get 22000), Σ d = 2650; N = 10
2650
x̄ = 22000 + 10
=22000 + 265
=22265
The Average income of the employee is Rs. 22265.
Calculation of Arithmetic Mean- Discrete series observations
In the discrete series, Arithmetic mean may be computed by applying
i. Direct Method
ii. Indirect Method
Calculation
Q. From the following data of marks obtained by 60 students of a class, calculate the arithmetic mean using the Direct Method as well as the Shortcut
method.
Marks 20 30 40 50 60 70
No. Of Students 8 12 20 10 6 4
Soln.
Direct Method
Let the marks be X and the number of students be F.
ΣfX
x̄ = ; Where f = frequency, X = the variable in the question, N = Total number of Observation or Σf.
N
ΣfX
x̄ =
N
2460
x̄ =
60
x̄ = 41
Short-cut Method
Σfd
x̄ = A + N ; Where A = Assumed mean (it can be taken from any values among the frequency ‘x’ or any value whether existing in the data or not
can be taken as the assumed mean and the final answer would still be the same. However nearer the assumed mean is to the actual mean,
lesser are the calculations), d = Deviation i.e. (X - A), N = total number of observations or Σf.
Σfd
x̄ = A + N
60
x̄ = 40 + 60
x̄ = 40 + 1
x̄ = 41
Calculation of Arithmetic Mean- Continuous series observations
In the Continuous series, Arithmetic mean may be computed by applying
i. Direct Method
ii. Indirect Method
Calculation
Q. From the following data compute arithmetic mean by direct and Short-cut Method
Marks 0 -10 10 – 20 20 - 30 30 – 40 40 – 50 50 - 60
No. Of Students 5 10 25 30 20 10
Soln.
Direct Method
Σfm
x̄ = ; Where f = frequency, m = mid-point of the various classes, N = Total number of Observation or Σf.
N
3300
x̄ = 100
x̄ = 33
Σfd
x̄ = A + N ; Where A* = Assumed mid-point value (it can be taken from any values among the frequency ‘x’ or any value whether existing in the
data or not can be taken as the assumed mean and the final answer would still be the same. However nearer the assumed mean is to the actual
mean, lesser are the calculations), d = Deviation i.e. (X - A), N = total number of observations or Σf.
−200
x̄ = 35 +
100
x̄ = 35 + (−2)
x̄ = 33
B. Median
The median by definition refers to the middle value in a distribution. In case of Median, one-half of the items in the distribution have a value the
size of the median value or smaller, and the other-half of the items in the distributions have a value the size of the median value or larger. The
median is just the 50th percentile value below which the 50% of the values in the sample fall. It splits the observations into two halves.
As distinct from the Arithmetic mean, which is calculated from the value of each item in the series, the median is known as positional average.
The term “position” is refers to the place of a value in the series. The place of the median in a series is such that an equal number of items lie on
either side of it.
Since its location is based on its position, Incase of odd number of observations, the median can be assumed to the value at the middle of the
series, whereas, when there is even number of observations, there is no single middle position value then the median is taken to be the
Arithmetic mean of the two middle most items. Thus, when N is odd, the median is the actual value with the reminder of the series in two equal
parts on either side of it, but when N is even, the median is derived figure, i.e. half the sum of the middle values.
i. Merits
a. It is especially useful incase of the open-end classes since only the position and not the values of the items must be known. The median is also
recommended if the distribution has unequal classes, since it is easier to compute than the mean.
b. Extreme values do not affect the median as strongly as they do the mean.
c. In markedly skewed distributions such as income distributions or the price distributions where the arithmetic mean would be distorted by the
extreme values, the median is especially useful. Consequently, the median income for some purposes be regarded as a more representative
figure, for half the income earners must be receiving atleast the median income and as many do not.
d. It is the most appropriate average in dealing with qualitative data i.e. where ranks are given or there are other types of items that are not
counted or measured but are scored.
e. The value of median can be graphed graphically where as the value of the median cannot be ascertained.
f. Perhaps the greatest advantage of median is, however, the fact that the median actually does indicate what many people incorrectly believe
the arithmetic mean indicates. The median indicates the value of the middle item in the distribution. This is a clear cut meaning and makes the
median a measure that can be easily explained.
ii. Limitations
a. For calculating median, it is necessary to arrange the data; other averages do not need any arrangement.
b. Since it is a positional average, its value is not determined by each and observation.
c. It is not capable of algebraic treatment.
d. The value of median is affected more by sampling fluctuations than the value of the arithmetic mean.
e. The median, in some cases, cannot be computed exactly as the mean. When the number of items included in a series of data is even, the
median is determined approximately as the mid-point of the two middle items.
f. It is erratic if the number of items is small.
Computing Median
Calculation of Median- Individual series observations
Q. from the following data of the wages of 7 workers, compute the median wage.
Wages (in Rs.) 14100 14150 16080 17120 15200 16160 17400
Soln.
A. Arrangement of the data set into Ascending or descending order, here we will arrange the data in ascending order.
Sl. No. 1 2 3 4 5 6 7
Wages (in Rs.) 14100 14150 15200 16080 16160 17120 17400
Wages arranged
Sl. No. in ascending
order
1 14100
2 14150
3 15200
4 16080
5 16160
6 17120
7 17400
𝑁+1
Median = Size of th item.
2
7+1 8
Median = = = 4th item.
2 2
= Rs. 16080.00
Interpretation
We thus find that the median is the middlemost item: 3 persons get a wage less than Rs. 16080 and equal number, i.e. 3 persons, get more than Rs. 16080.
Calculation of Median- Discrete series observations
Q. from the following data of the income of some individuals, find the median of the income group.
Income (in Rs.) 15000 15500 16800 18000 18500 17800
No. of Persons 24 26 20 16 6 30
Soln.
A. Arrangement of the data set into Ascending or descending order, here we will arrange the data in ascending order.
Income (in Rs.) 15000 15500 16800 17800 18000 18500
No. of Persons 24 26 20 30 16 6
Size of the 61.5th item = 16800 (since 61.5th item is not there and the closest value to the 61.5th item in the C.f. is 70th item, so we take the value of the 70th
item here i.e. 16800)
Interpretation
We see that in the question there is even number of items and the median can be any among the middle two values. We find by calculations that the median is the 3rd from
the top in the table and there are 2 income groups earning less and there are 3 income groups earning more than the median income group i.e. 16800.
Calculation of Median- Continuous series observations
Q. from the following data of marks of some students, find the median marks of the students.
Marks 45-50 40-45 35-40 30-35 25-30 20-25 15-20 10-15 5-10
No. of students 10 15 26 30 42 31 24 15 7
Soln.
A. Arrangement of the data set into Ascending or descending order, here we will arrange the data in ascending order.
Marks 5-10 10-15 15-20 20-25 25-30 30-35 35-40 40-45 45-50
No. of students 7 15 24 31 42 30 26 15 10
10-15 15 22 class in which the middle item of the distribution lies, C.f. = Cumulative frequency
15-20 24 46 of the class preceding the median class or the sum of the frequencies of all the
20-25 31 77 classes lower than the median class, f = frequency of the median class, i = class
25-30 42 119 interval of the median class.
30-35 30 149
200
35-40 26 175 2
−77
40-45 15 190 Median = 25 + ∗5
42
45-50 10 200
100−77
𝑁 Median = 25 + ∗5
42
Median = Size of 2 th item.
23
200 th Median = 25 + ∗5
Median = = 100 item. 42
2
Median = 25 + 2.74
The median class or the median lies in the class (25-30). (Since 100th item is
not there and the closest value to the 100th item in the C.f. is 119th item, so Median = 27.74
we take the class group of the 119th value of the C.f. as the median class)
The median mark of the students is 27.74.
C. Mode
The mode or the modal value is that value in a series of series of observations which occurs with the greatest frequency. For example, the mode of the
series 3, 5,8,5,4,5,9,3 would be 5, since this value occurs more than any of the others. The mode is often said to be that value which occurs most often in
the data, that is, with the highest frequency. While this statement is quite helpful in interpreting the mode, it cannot safely be applied to any distribution,
because of the vagaries of sampling. Even fairly large samples drawn from a statistical population with a single well defined mode may exhibit very erratic
fluctuations in this average if the mode is defined as that exact value in the ungrouped data of each sample which occurs most frequently. Rather it should
be thought as the value about which the items are most closely concentrated. It is the value which has the greatest frequency density in its immediate
neighbourhood. For this reason it is also called the most typical or fashionable value of a distribution.
Merits (look at the pictures sent in the group and write down the merits and limitations)
i. ....
ii. ....
iii. ....
iv. ....
v. ....
Limitations
i. ....
ii. ....
iii. ....
iv. ....
v. ....
Computing Mode
Calculation of Mode- Individual series observations
Q. Calculate the mode from the following data of the marks obtained by 10 students.
10, 27, 24, 12, 27, 27, 20, 18, 15, 30
Soln.
Calculation of mode
Since the number 27occurs the maximum number of times, i.e. 3, the modal marks is 27.
Calculation of Mode- Continuous series observations
Soln.
Cumulative Frequency
Marks No. of Students (f)
(C.f.)
0-10 3 3
10-20 5 8
20-30 7 15
30-40 10 25
40-50 12 37
50-60 15 52
60-70 12 64
70-80 6 70
80-90 2 72
90-100 8 80
ii. Since this is a continuous data, we have to see which class has the highest frequency among all the class groups. The class with the highest
value will be the modal class.
Cumulative Frequency
Marks No. of Students (f)
(C.f.)
0-10 3 3
10-20 5 8
20-30 7 15
30-40 10 25
40-50 12 37
50-60 15 52
60-70 12 64
70-80 6 70
80-90 2 72
90-100 8 80
By inspection, the class 50 – 60 has the highest frequency and hence is the modal class for the entire data set.
𝑓1−𝑓ₒ
Mₒ = 𝐿 + ∗ 𝑖; Where L = Lower limit of the modal class, f1 = Frequency of the modal class, fₒ = Frequency of the class preceding the
2𝑓1−𝑓ₒ−𝑓2
modal class, f2= frequency of the class succeeding the modal class.
15−12
Mₒ = 50 + ∗ 10
2∗15−12−12
3
Mₒ = 50 + ∗ 10
6
Mₒ = 50 + 5
Mₒ = 55
2. Measures of Dispersions
I. Introduction
The various measures of Central Value discussed in the previous portion give us one single figure that represents the entire data. But the average alone
cannot adequately describe the set of observations, unless all the observations are same. It is necessary to describe the variability or the dispersions of
the observations. In two or more distributions the central value maybe same but still there can be wide disparities in the formation of the distribution.
Measures of Dispersion help us to in studying this important characteristic of a distribution.
Dispersion measure the extent to which the items vary from some central value. Since Measures of Dispersions give an averages of the differences of
various items from an average, they are also known as the averages of the 2 nd order.
The Relative Measure corresponding to the range is called the co-efficient of range. It is calculated by the following method:
L−S
Co − efficient of range =
L+S
Merits
Limitations
Uses
Despite serious limitations, range is useful in the following cases:
i. Quality control.
ii. Fluctuations in the share prices.
iii. Weather forecast
iv. Everyday life.
Calculation of Range – Individual
Q. the following are the prices of the shares of XYZ Co. Ltd. from Monday to Saturday:
Price
Day
(Rs.)
Monday 200
Tuesday 210
Wednesday 208
Thursday 160
Friday 220
Saturday 250
𝐈. 𝐑𝐚𝐧𝐠𝐞 = 𝐋 − 𝐒
Range = 90
𝐋−𝐒
II. 𝐂𝐨 − 𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐭 𝐨𝐟 𝐫𝐚𝐧𝐠𝐞 = 𝐋+𝐒
250−160
Co − efficient of range = 250+160
90
Co − efficient of range = 410
The Range which is discussed before has certain limitations. It is based on the two extreme items and it fails to take into the scatter within the range. From
this there is a reason to believe that if the dispersion of the extreme items is discarded, the range would be more instructive. For this purpose another
measure called the interquartile range has been developed, which includes the middle 50% of the distribution i.e. one quarter of the distribution from the
lower end and one quarter from the upper end of the observations is excluded from computing the interquartile range. In other words, the interquartile
range represents the difference between the 3rd quartile and the 1st quartile.
The Interquartile range is, very often, reduced to the form of the semi-interquartile range or Quartile Deviation, by dividing it by 2. It gives an average
amount by which the two quartiles differ from the median. In asymmetrical distribution, the two quartiles are equi-distant from median. The median covers
exactly 50% of the observations. That is why it is also known as Q2. When the Q.D. is small, it describes high uniformity or small variation of the central 50%
items, and High Q.D. means that the variations among the central items is large.
𝑄3−𝑄1
𝑄. 𝐷. = ( )
2
The Q.D. is an absolute measure of dispersion. The relative measure of dispersion is the co-efficient of dispersion, it is calculated by
𝑄 3−𝑄1
2 𝑄3−𝑄1
𝐶𝑜 − 𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑄. 𝐷. = 𝑄 3+𝑄1 =( )
𝑄3+𝑄1
2
Merits
i. It has a special utility in measuring variation in case of the open end distributions or one in which the data may be ranked but measured
quantitatively.
ii. It is also useful in measuring the erratic or badly skewed distributions, where the other measures of dispersions would be warped by extreme
values. It is not affected by extreme values.
Limitations
i. Q.D. ignores the 50 % of the observations, i.e. 1st 25 % and last 25%.
ii. It is not a very capable mathematical manipulation.
iii. Its value is very much affected by the sampling fluctuations.
iv. It is not a true measure of dispersion as it does not show the scatter around an average but rather a distance on a scale.
Calculation of Quartile Deviation – Individual series observation
Q. Find out the Quartile Deviation and its co-efficient from the following data.
Roll no. 1 2 3 4 5 6 7
Marks 20 28 40 12 30 15 50
Soln.
I. Arrangement of marks.
Marks 12 15 20 28 30 40 50
II. Calculating the Q1 So the size of the 6th item is 40. Thus Q3 = 40.
𝑁+1
Q1 = size of ( )th item I.V. calculating the Quartile Deviation
4
7+1 𝑄3−𝑄1
Q1 = size of the ( ) th item 𝑄. 𝐷. =
4 2
Q. Calculate the Q.D. and the co-efficient of Q.D. from the following data.
Soln.
200
Size of the Quartile 1 class 4
− 14 Size of the Quartile 3 class
𝑄. 1 = 35 + ∗2
62
𝑁 200 th 3𝑁
Q.1= size of the 4 th item = =50 item Q.3 = size of the th item
4 4
50− 14
𝑄. 1 = 35 + ∗2
62 3∗200
The Q.1 lies in the class 35-37 (since the Q.3 = size of the th item
4
location of Q.1 class is 50th item and there is no Q.1. = 35 +
64
∗2
62 600
50 in the observations so the nearest value i.e. Q.3= size of the th item.
4
62 is taken as the frequency of the quartile Q.1= 35+ (1.0323)*2
class and 76 is taken as the C.f. of the quartile Q.3. = size of the 150th item
class.) Q.1 = 35 + 2.064
The Q3 lies in the class 37-39 (since the
𝑛
− 𝐶.𝑓 𝑛
Q.1. = 37.064 location of Q.3 class is 150th item and there is
4
𝑄. 1 = 𝐿 + ∗ 𝑖 ; L = 35, 4 =50, C.f. = no 150 in the observations so the nearest value
𝑓
𝑄. 𝐷. = 0.71545
C. The Mean Deviation
The two methods of dispersions discussed before, namely Range and Q.D., are not Measures of dispersions in true sense as they do not show scatterness
around the average. However to study the formation of the distribution, we should take deviation from an average. The other two measures help us in
achieving this goal.
The Mean deviation is also known as the Average deviation. It is the average difference between the items in a distribution and the median or mean of that
series. Theoretically there is an advantage in taking the deviations from the median because the sum of the deviations of the items from the median is
minimum when signs are ignored. However in practice, the arithmetic mean is more frequently used in calculating the value of average deviation and this is
the reason why it is also called mean deviation.
Merits
i. It is simple to understand and easy to compute.
ii. It is based on each and every item of the data. Consequently change in the value of any item would change the value of the mean deviation.
iii. It is less affected by the values of extreme items than the standard Deviation
Limitations
i. The greatest drawback is that it ignores the algebraic signs while taking the deviations of the items. This is mathematically wrong and makes it
non-algebraic.
ii. This method will not give us the accurate result. The reason is that the mean deviation gives us the best result when deviations are taken from
median. But median is not a satisfactory measure when the degree of variability is high. And if we compute the mean deviation from the mean
that is also not desirable because the sum of the deviations from mean (ignoring the signs) is greater than the sum of the deviations from
median (ignoring the signs). If the Mean Deviation is computed from the mode that is also not scientific because the value of mode cannot
always be determined.
iii. It is not capable of further algebraic treatment.
iv. It is rarely used in sociological studies.
Calculation of Mean Deviation – Individual series observation
Q. Calculate the Mean Deviation and its co-efficient of the income earned by a group of individuals.
1
Soln. 𝑀𝑒𝑎𝑛 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = ∗𝛴 𝑋−𝐴
𝑛
1
Or 𝑀𝑒𝑎𝑛 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = ∗ 𝛴│𝐷│
𝑛
𝛴 𝐷│
Or 𝑀𝑒𝑎𝑛 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 =
𝑁
Q. Calculate the Mean Deviation and its co-efficient from the following series.
X 10 11 12 13 14
Y 3 12 18 12 3
Soln.
1
𝑀𝑒𝑎𝑛 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = ∗ 𝛴𝑓 𝑋 − 𝐴
𝑛
1
Or 𝑀𝑒𝑎𝑛 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = ∗ 𝛴𝑓│𝐷│
𝑛
𝛴𝑓 𝐷│
Or 𝑀𝑒𝑎𝑛 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 =
𝑁
36
Size of the 24.5th item (in the X) = 12, hence median = 12 𝑀𝑒𝑎𝑛 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = = 0.75
48
II. Computing the mean deviation of the series Interpretation
This means that the average deviation from the median is 0.75.
Calculation of Mean Deviation –Continuous series observation
Q. Calculate the Mean Deviation and its co-efficient from the following series.
100
I. Computing the median of the series 2
−37 II. Computing the mean deviation of the
Median = 30 + ∗ 10 series
25
𝑁
Location of Median class = size of th item =
2
100 50−37 𝛴𝑓 𝐷│
th
= 50 item Median = 30 +
25
∗ 10 𝑀𝑒𝑎𝑛 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = ; │D│=
2 𝑁
Deviation from the median ignoring signs
Median class = 30-40, frequency of the median 13
Median = 30 + ∗ 10
25 1314 .8
class = 25, C.f. of the median class = 62, C.f. of 𝑀𝑒𝑎𝑛 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = = 13.148
100
the class preceding the median class = 37 Median = 30 + 0.52 ∗ 10
𝑁
2
−𝐶.𝑓. Median = 30 + 5.2
Median = 𝐿 + ∗𝑖
𝑓
Median = 35.2
D. The Standard Deviation
The standard Deviation concept was introduced by Karl Pearson in 1823. It is by far the most widely used measure of studying the dispersion. Its significance
lies in the fact it is free from those defects from which the earlier methods suffered and also satisfies most of the properties of good measure of dispersion. It
is also known as Root mean square deviation for the reason that it is the square root of the mean of the squared deviation from the arithmetic mean. The
Standard deviation id denoted by the small Greek letter σ (read as sigma). The Standard deviation measures the absolute dispersion, the greater the standard
deviation, the greater will be the magnitude of the deviations of the values from their mean. A small Standard Deviation means a high degree of uniformity of
the observation as well as homogeneity of a series and vice versa is also same. Hence the Standard Deviation is extremely useful in judging the
representativeness of the mean.
Merits
i. It is the best measure because of its mathematical characteristics.
ii. It is based on each and every item of the distribution.
iii. It is amenable to algebraic treatment and is less affected by fluctuations of the sampling than most other measures of dispersions.
iv. It is possible to calculate the combined Standard deviation of two or more groups. This is not possible with any other measure.
v. For comparing the variability of two or more distributions co-efficient of variations is considered as the most appropriate and this is based on mean
and standard deviation.
vi. Standard deviation is used for advanced statistical work.
Limitations
i. As compared to other measures of dispersions, the Standard deviation is difficult to measure.
ii. It gives more weight to extreme items and less weight to items near mean. It is because of the fact that the squares of the deviations which are big in
size would be proportionately greater than the squares of those deviations which are comparatively small.
Calculation of Standard Deviation – Individual series observation (Actual mean method)
Q. from the given question, calculate the standard Deviation.
ΣX
Sl. No. X X = (N ) x = (X- X̄) x²
2461
σ= ( )
10
σ= 246.1
σ= 15.69
Calculation of Standard Deviation – Individual series observation (Assumed mean method)
Q. from the given question, calculate the standard Deviation.
d2 Σd 2
σ = ( Σ − )
N N
2689 1 2
σ = ( Σ − )
10 10
σ = (268.9 − 0.01)
σ = 16.398
Calculation of Standard Deviation –Discrete series observation (Actual mean method)
Q. from the given question, calculate the standard Deviation.
Y 3 7 22 60 85 32 8
ΣfX
X f f.X X=( ) x = (X- X̄) x² fx²
N
Y 3 7 22 60 85 32 8
3.5 3 -3 9 -9 27
4.5 7 -2 4 -14 14
5.5 22 -1 1 -22 22
6.5 60 0 0 0 0
7.5 85 1 1 +85 85
8.5 32 2 4 +64 128
9.5 8 3 9 +24 72
N = 217 Σf.d.= +128 Σ fd²= 362
Here, A = 6.5 (it can be taken from any values among the frequency ‘x’ or any value whether existing in the data or not can be taken as the
assumed mean and the final answer would still be the same. However nearer the assumed mean is to the actual mean, lesser are the
calculations),
Σ fd2 Σfd 2
𝜎= −
N N
362 128 2
𝜎= −
217 217
Number of persons 3 5 8 7 9 7 4 7
No. of Persons
Salaries (X) d= (X- 60)/5 d² f.d fd²
(f)
45 3 -3 9 -9 27
50 5 -2 4 -10 20
55 8 -1 1 -8 8
60 7 0 0 0 0
65 9 +1 1 +9 9
70 7 +2 4 +14 28
70 4 +3 9 +12 36
80 7 +4 16 +28 112
N = 50 Σf.d.= +36 Σ fd²= 240
Here, X = Mid points = Salaries in ‘000, Number of persons = Frequencies (f), A = 60 (it can be taken from any values among the frequency ‘x’ or any
value whether existing in the data or not can be taken as the assumed mean and the final answer would still be the same. However nearer the
𝑋−𝐴
assumed mean is to the actual mean, lesser are the calculations),𝑑 = , i = interval between two mid points.
𝑖
Σ fd2 Σfd 2
𝜎= − ∗i
N N
240 36 2
𝜎= − ∗5
50 50
Number of workers 12 17 23 39 16 3
Here, X = Mid points = Salaries in ‘000, Number of persons = Frequencies (f), A = 60 (it can be taken from any values among the frequency ‘x’ or any
value whether existing in the data or not can be taken as the assumed mean and the final answer would still be the same. However nearer the
assumed mean is to the actual mean, lesser are the calculations or see which among the two middle most values of X has the higher
𝑋−𝐴
frequency), 𝑑 = , i = interval between two mid points.
𝑖
Σfd −71
𝜎 = 2.064 − 0.417 ∗ 10 = 12.83 x̄ = A + ∗ i =35 + ∗ 10 =
Σ fd2 Σfd 2 N 110
𝜎= − ∗i
N N 35 + −0.6454 ∗ 10 = 35 + −6.4545 =
Co-efficient of Variation
28.5454, σ = 12.83
𝜎
2 Co-efficient of Variation= ∗ 100 12.83
𝜎=
227
− −
71
∗ 10
x̄ Co-efficient of Variation= 28.5454
∗ 100 =
110 110
0.4495 ∗ 100 = 44.9459
D. Variance
Both the variance and the Standard deviation are measures of variability in population. These two measures are closely related as is clear from the formula:
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = σ² . Variance is the average squared deviation from the arithmetic mean or in more simple words it is the average of the squared differences
from the mean and Standard Deviation is the square root of the variance.
Calculation of Variance
Q. from the given question, calculate the Variance.
Marks Obtained 10-14 14-18 18-22 22-26 26-30 30-34 34-38 38-42 42-46 46-50 50-54 54-58
No. of Student 2 4 4 8 12 16 10 8 4 6 2 4
Here, X = Mid points = Salaries in ‘000, Number of persons = Frequencies (f), Assumed mid-point (m) = 60 (it can be taken from any values among the
frequency ‘x’ or any value whether existing in the data or not can be taken as the assumed mean and the final answer would still be the same.
𝑋−𝐴
However nearer the assumed mean is to the actual mean, lesser are the calculations),𝑑 = , i = interval between two mid points.
𝑖
i. The size of the items and their frequencies are both cumulated. (C.f. for both the class group and the frequency is taken). Taking grand total (or the
last value in the C.f.) for each as 100, percentages are obtained for taking these various cumulative values.
ii. Next, on the X-axis start from 0 to 100 and take the percent of cumulative frequencies.
iii. On the Y-axis start from 0 to 100 and take the percent of the cumulated frequencies values of the variable.
iv. Draw a diagonal line joining O (0, 0) with point P (100, 100) as shown in the diagram below. The line OP will make an angle of 45° with the Y-axis and is
called as the line of equal distribution. Any point on this diagonal shows that same percent on X as on Y.
v. Plot the percentage of the Cumulated frequency values of the variable (Y) against the percentages of the corresponding cumulated frequencies values
of the variable (X) for the given distribution and join these points with a smooth free hand curve. For any given distribution, this will never cross the
line of equal distribution OP. It will always lie below the OP unless the distribution is uniform in which case it will coincide with OP. The greater the
variability, the greater is the distance of the curve from OP.
Calculation of Lorenz Curve
Q. from the given question, calculate the Lorenz Curve.
6 6
25 11
60 13
84 14
105 15
150 17
170 10
400 14
Area A
Profits
Cumulative Cumulative
Cumulative Cumulative No. of
Rs. ‘000 no. of Percentage
Profit Percentage Companies
companies
6
6 6 0.6 6 6
17
25 31 3.1 11 17
30
60 91 9.1 13 30
44
84 175 17.5 14 44
59
105 280 28.0 15 59
76
150 430 43.0 17 76
86
170 600 60.0 10 86
100
400 1000 100.0 14 100
Page for Graphical Construction of Lorenz Curve
3. Measures of Correlation
In the previous measures of central tendency or measures of dispersions, we have discussed relating to only one variable. In practice however, we
can come across a large number of problems which use two or more than two variables. If the quantities vary in such a way that movements in
one is accompanied by movements in other, these are correlated. The degree of relationship between variables under consideration is measured
through the correlation analysis. The measure of correlation called the correlation coefficient or the correlation index summarizes in one figure
the direction and the degree of correlation. The correlation analysis refers to the techniques used in measuring the closeness of the relationship
between the variables. Thus the correlation is the statistical device which helps in analysing the co-variation of two or more variables.
The problem of analysing the relation between different series should be broken down in three steps:
i. Determining whether the relation exists and, if it does, measuring it.
ii. Testing whether it is significant.
iii. Establishing the cause and effect relation, if any.
i. Most of the variables show some kind of relationship. With the help of the correlation analysis, we can measure in one figure the degree
of the relationship existing between the variable.
ii. Once we know that two variables are closely related, we can estimate the value of one variable given the value of another. This is known
with the help of the regression analysis.
iii. Correlation analysis to the understanding of the economic behaviour, aids in locating the critically important variables on which other
depend, may reveal to the economist the connection by which disturbances spread and suggest to him the paths through which stabilising
forces may become effective.
Types of correlation
These are the important ways by which correlation can be classified:
Merits
i. This method is simpler to understand and easier to apply compared to the Karl Pearson’s methods. The answers obtained by this method and
the Karl Pearson’s method will be same provided no value is repeated i.e. all items are different.
ii. Where the data are of a qualitative nature like honesty, efficiency, intelligence, etc., this method can be used with great advantage. For
example, the workers of two factories can be ranked in order of their efficiency and the degree of correlation can be established by applying
this method.
iii. This is the only method that can be used where we are given the ranks and not the actual data.
iv. Even where the actual data are given, rank method can be applied for ascertaining correlation.
v. Rank correlation is very useful when the data are non-normally distributed.
Limitations
This method is however associated with few limitations like:
i. This method cannot be used for finding out the correlation in a grouped distribution.
ii. Where the number of items exceeds 30 the calculations become quite tedious and require a lot of time. Therefore, this method should not be
applied where N exceeds 30 unless we are given the ranks and not the actual values of the variables.
Lipstick A B C D E F G
Neelu 2 1 4 3 5 7 6
Neena 1 3 2 4 5 6 7
Calculate the Spearman’s Rank correlation coefficient.
X Y
𝐷 = (𝑅₁ − 𝑅₂) D²
𝑅₁ 𝑅₂
2 1 +1 1
1 3 -2 4
4 2 +2 4
3 4 -1 1
5 5 0 0
7 6 +1 1
6 7 -1 1
𝛴𝐷²= 12
6∗𝛴𝐷 2 6∗12 72
𝑅 =1− 𝑁 3 −𝑁
=1− 73 −7
=1 − 336
= 1 − 0.214 = 0.786
Calculation of Spearman’s Rank correlation coefficient (when rank is not given)
Q. Calculate the spearman’s coefficient of correlation between the marks assigned to 10 students by judges X and Y in a certain competitive test as shown
below
Sl. No. 1 2 3 4 5 6 7 8 9 10
Marks by
52 53 42 60 45 41 37 38 25 27
judge X
Marks by
65 68 43 38 77 48 35 30 35 50
judge Y
Marks by Marks by
x 𝑅𝑥 y Ry 𝐷 = 𝑅𝑥 − 𝑅𝑦 D²
Judge X Judge Y
52 25 1 65 25 1 0 0
53 27 2 68 50 7 -5 25
42 37 3 43 35 3 0 0
60 38 4 38 30 2 2 4
45 41 5 77 48 6 -1 1
41 42 6 48 43 5 1 1
37 45 7 35 77 10 -3 9
38 52 8 30 65 8 0 0
25 53 9 25 68 9 0 0
27 60 10 50 38 4 6 36
𝛴𝐷²= 76
Candidate 1 2 3 4 5 6 7 8
judge X 20 22 28 23 30 30 23 24
judge Y 28 24 24 25 26 27 32 30
𝛴𝐷²= 88.50
1 1
6∗ 𝛴𝐷 2 + 𝑚 13 −𝑚 1 + 𝑚 23 −𝑚 2 +⋯ 6∗ 88.50+0.5+0.5+0.5
12 12
𝑅 =1−
𝑅 =1−( ) 504
𝑁 3 −𝑁
𝑅= 540
𝑅 =1− = 1 – 1.071= -0.071
1 1 1
504
6∗ 88.50+ 23 −2 + 23 −2 + 23 −2
12 12 12
1−
83 −8