0% found this document useful (0 votes)
13 views36 pages

3 - Measures of Variation

The document discusses measures of variation and dispersion in data sets, emphasizing the importance of understanding variability alongside central tendency. It covers various measures such as range, interquartile range, variance, standard deviation, and coefficient of variation, providing examples and calculations for clarity. Additionally, it explains the five-number summary and the identification of outliers using quartiles and IQR.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views36 pages

3 - Measures of Variation

The document discusses measures of variation and dispersion in data sets, emphasizing the importance of understanding variability alongside central tendency. It covers various measures such as range, interquartile range, variance, standard deviation, and coefficient of variation, providing examples and calculations for clarity. Additionally, it explains the five-number summary and the identification of outliers using quartiles and IQR.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 36

1

Measures of Variation / Dispersion/ Spread


• Arithmetic mean is a concise method of
presentation of data but inadequate as it gives no
indication of its reliability.
• It is possible that means of two data sets are same
but even than two data sets may be quite different
with respect to variation among values within
each data set
Data 1 Data 2 By comparing mean, both
49 20 data sets look same, but
50 50 quite different in terms of
variability among values
51 80
within each data 2
Measures of variation / dispersion/ spread
• Measures of variation measure the variation
present among the values in a data set with a
single number so measures of variation are
summary measures of spread of values in the
data.
• A measure of central tendency along with a
measure of dispersion gives an adequate
description of statistical data.

3
Measures of Variability
• Distance Based Measures of Spread
– Range
– Interquartile Range
• Centre Based Measures of Spread
– Variance
– Standard Deviation
– Coefficient of Variation
Range
The range of a data set is the difference between the
largest and smallest data values.
Range  X Largest  X Smallest
Example: The following data sets represent the plant height
in cm for two verities of wheat Pak 81 and LU 26

Pak 81 80 82 83 85 86 88 89 90
LU 26 50 81 83 85 85 87 88 90

Range Pak81=90-80= 10 cm
Range LU 26=90-50= 40 cm
Dot Plots for Both Data Sets

Dot Plot of Height


90

80
Height

70

60

50

LU 26 PAK 81
Var
Disadvantages of the Range

•Ignores the intermediate observations


7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5

•Sensitive to outliers
1,1,1,1,2,2,2,2,3,3,4,5
Range = 5 - 1 = 4

1,1,1,1,2,2,2,2,3,3,4,120
Range = 120 - 1 = 119

9
InterQuartile Range (IQR)
• The interquartile range of a data set is the difference
between the third quartile (largest quartile) and the
first quartile(smallest quartile).
• It is the range for the middle 50% of the data.
• It overcomes the sensitivity to extreme data values.

Interquartile range( IQR) = Q3 - Q1

(a)Not effected by presence of outlier in the data


(b)Based on 50% intermediate observations
IQR
Example: The following data sets represent the plant height
in cm for two verities of wheat Pak 81 and LU 26

Pak 81 80 82 83 85 86 88 89 90
LU 26 50 81 83 85 85 87 88 90

Range Pak81=90-80= 10 cm
Range LU 26=90-50= 40 cm

IQR Pak81= 88.75 – 82.25 = 6.50 cm


IQR LU 26= 87.75 – 81.50 = 6.25 cm
Median
minimum
Q1 (Q2) Q3 maximum

25% 25% 25% 25%

80 82.25 85.50 88.75 90

IQR
88.75 – 82.25= = 6.50 cm
Median
Q1 (Q2) Q3 maximum
minimum
25% 25% 25% 25%

50 81.50 85 87.75
90
IQR
87.75 – 81.50= = 6.25 cm
Center Base Measures
Idea:- Select single value as reference value (ideal reference
value is mean of the data) take deviations of values from
mean and take sum of these deviations
Example:- Following data represent the yield per plot of
three wheat verities A,B and C. Compare the yield
performance of three verities
XA XB XC
40 30 10
50 50 50
60 70 90

13
XA XB XC X A  XA X B  XB  X C  XC 

40 30 10 -10 -20 -40


50 50 50 0 0 0
60 70 90 10 20 40
Problem:- Sum of deviations from mean is always 0
 (X  X ) 0
(Due to cancelling sign problem) regardless of the spread of
values in the data. Hence deviation of values from mean can
not be used as a measure of spread in the data
Dot Plot
90

80

70

60
ld

50
ie
Y

40

30

20

10

0
14
A B C
Solution: Squared the deviations and then sum the squared
deviations to get rid of cancelling sign problem
 ( X  X )
2

X  XB  XC  XC 
2
X  XA
2 2
XA XB XC A B

40 30 10 100 400 1600


50 50 50 0 0 0
60 70 90 100 400 1600
Variance: Average of the Squared deviations from mean
2
2 å (X A - X A) 200
S A = = = 66.67 Kg 2
n 3
2
2 å (X B - XB) 800
S B = = = 266.67 Kg 2
n 3
2
2 å (X C - XC ) 3200
S C = = = 1066.67 Kg152
n 3
Alternative formula (Desk formula) for
variance
X X2
40 1600
50 2500
60 3600
150 7700

2
2 å (X A - X A)
S = = 66.67
n
1
S2 = ç
æ ( å X ) 2ö
÷ 1 æ
ç (150) 2ö
÷

ç
çå X 2
-
n ø 3è
÷
÷
÷
= ç
ç
7700 -
3 ø
÷
÷
÷
= 66.67
è

16
Variance
• The variance is a measure of variability that
utilizes all the data.
• It is based on the squared difference between
the value of each observation (xi) and the
mean of the data
• The variance is denoted by s2.
Problem With Variance
Variance measures the variation in the data as
the square of the units of measurements of
the data so it is difficult to interpret it
precisely.

Solution:- Take positive square root of the


variance known as standard deviation
denoted by S.
It has the same units as the measurements
themselves
18
Standard Deviation
• The standard deviation of a data set is the positive
square root of the variance.
• It is measured in the same units as the data,
making it more easily comparable, than the
variance, to the mean.
• The standard deviation is denoted S.

S A  66.67  8.16 Kg
S B  266.67 16.33 Kg
SC  1066.67 32.66 Kg
Coefficient of Variation (CV)
• Shows relative variability, that is, variability
relative to the magnitude of the data i.e variation
relative to mean
• Always in percentage (%)
• Unitfree measure of variation
• Can be used to compare two or more sets of data
– measured in different units
– same units but different average size
S
CV= ×100
X
Coefficient of Variation
The following data represent length (in inches) and
weight (in Kg) for a sample of 10 fish of same
species after using a particular type of fish feed
Fish 1 2 3 4 5 6 7 8 9 10
Weight 1.8 1.9 2.1 2.4 2.5 2.6 2.7 2.8 3.1 3.2
Length 11 12 12 13 15 15 16 17 18 18

Which characteristic weight or length is relatively


more variable
Standard deviation
(S) Mean CV
Weight 0.472 kg 2.51 kg 18.82
Length 2.584 inches 14.70 inches 17.58
Example: Following data represent the height in feet of sugarcane and wheat
plants after applying a particular type of fertilizer, Compare plants of which crop
are more variable in height

Sugarcane 10.3 12.1 12.3 12.5 12.6 12.8 13.0 13.2


Wheat 3.1 3.5 3.8 3.9 4.0 4.3 4.5 4.9
Dot Plot of Height
14

SD=0.90 feet S SD=0.567 feet


12Mean=12.23 CV= ×100 Mean=4.0 feet
feet X
10
CV=14.21%
Height

8
CV=7.37%

2
SugarCane Wheat
Crop
Five Number Summary
The five number summary of a data set consist of
1. Minimum value
2. Maximum value
3. Q1
4. Q2
5. Q3.
Graph of Five number summary is called Box-
Whisker Plot

24
Five Number Summary
Example: The following data set shows the marks obtained by
students
25 41 27 32 43 66 35 31 15 5
34 26 32 38 16 30 38 30 20 21
Determine the five number summary.
The array of the above data is given below:
5 15 16 20 21 25 26 27 30 30
31 32 32 34 35 37 38 41 43 66

1. Minimum value 05
2. Maximum value 66
3. Q1 22
4. Q2 30.5
25
5. Q3.
70
Construction of Box Whisker Plot
60
1. Start The box From Q1 and ends at Q3
2. Within the box Draw a line to represent 50
Q2
3. Draw lower whisker to Min. Value upto Q1
40
4. Draw upper Whisker from Q3 upto
Max.Value
30
1. Q1=22.0 Q3=36.5
20
2. Q2=30.5
3. Minimum Value=5.0 10

4. Maximum Value=66.0
0
26
Interpretation of Box-Whisker Plot 70

Box-Whisker Plot is useful to identify


60
•From upper and lower whiskers;
Maximum and Minimum Values in the data
50
•From line within box i.e Q2 ;
Average Size of the data 40
•From length of the graph Range=Max-Min
•From length of the box i.e Q3-Q1=IQR 30

Variability in the data i.e lengthy box indicates more


variability 20

•From Position of line within box


10
Shape of the data
Line At the center of the box-------Symmetrical
0
Line above center of the box-------Negatively 27
Outliers
The outliers are the values that fall well outside the
overall pattern of the data. It may be
• The result of a measurement or recording error
• A member from a different population than the rest of
the sample.
• Simply an unusual extreme value.

Professor Jhon Wilder Tukey suggested a method for


defining outliers. We can use quartiles and the IQR =
Q3-Q1 to identify the outliers.

28
Determine Inner and Outer Fences
If Q1=22.0 Q2=30.5
Q3=36.5
The inner fences and outer fences are defined as follows:
Lower Inner Fence Q1  1.5IQR  0.25
Inner Fences : 
Upper Inner Fence Q 3  1.5IQR  58.25

Lower Outer Fence Q1  3IQR   21.5


Outer Fences : 
Upper Outer Fence Q 3  3IQR  80.0

That is, in a box and whisker diagram inner fences are


constructed to the bottom and top of the box at a distance
of 1.5 times the IQR and the outer fences are constructed to
the bottom and top of the box at a distance 3 times the IQR.

29
Identification of Suspected and Sure Outliers
80
1. The values that lie within inner Only 66 is
fences are normal values mild outlier 70
2. The values that lie outside inner *
fences, but inside outer fences 60

are possible / suspected / mild


50
outliers
3. The values that lie outside outer 40
fences are sure outliers
30
Question : If marks is 96 instead of 66 can
this value be considered as outlier 20

Plot each suspected outliers with an asterisk 10


and each outliers with a hollow dot.
0

30
Example: A study of the effects of smoking on sleep
patterns is conducted. The measure observed is the
time, in minutes, that it takes to fall asleep. These
data are obtained:
Smokers:
69.3 56.0 22.1 47.6 53.2 48.1 52.7 34.4
60.2 43.8 23.2 13.8
Nonsmokers:
28.6 25.1 26.4 34.9 29.8 28.4 38.5 30.2
30.6 31.8 41.6 21.1 36.0 37.9 13.9
Graphical Analysis
Dot Plot of Time Boxplot of Time
70 70

60 60

50 50

T im e
T im e

40 40

30 30

20 20

10 10
Non-Smoker Smoker Non-Smoker Smoker
Habit Habit
Statistical Analysis
Statistics
Habit MeanStDev CoefVar Median
Non-Smoker 30.32 7.13 23.51 30.20
Smoker 43.70 16.93 38.74 47.85
Standard Variable
• A variable that has mean “0” and Variance “1” is called standard
variable
• Values of standard variable is called standard scores
• Values of standard variable i.e standard scores are unit-less
• Construction
Varable  Mean of variable
Z
Standard deviation of variable

35
2 2 X 
X 
32
8
X (X  X ) Z (Z  Z ) n 4
54
S x2  13.5
3 25 -1.3624 1.8561 4
S x 3.67
6 4 -0.5450 0.2970
X X X8
11 9 0.81741 0.6682 Z 
Sx 3.67
12 16 1.0899 1.1879
Z
 Z
0
32 54 0 4.009 n
2 4.009
Sz  1
4

Variable Z has mean “0” and variance “1” so Z is a


standard variable

Standard Score at X 3
X  X 3 8
Z   1.3624
Sx 3.67 36
Using z scores to evaluate performance
The industry in which sales rep Mr.Bilal works has
mean annual sales=$2,500 with standard deviation
=$500.
The industry in which sales rep Mr. Perviz works has
mean annual sales=$4,800 with standard
deviation=$600.
Last year Mr.Bilal’s sales were $4,000 and
Which of the representatives
Mr. Perviz’s would you hire if
sales were $6,000.
you had one sales position to fill?

37
Standard Units
Sales person Bilal Sales person Perviz

XB= $2,500 XP =$4,800

S= $500 SP = $600

XB= $4,000 XP= $6,000


XB  XB XP  XP
ZB  ZP 
SB SP
4,000  2,500 6,000  4,800
ZB  3 ZP  2
500 600

Mr.Bilal is the best choice 38


Example:- Following data represent the performance of
different batsman in ODI and T-20 matches. Evaluate
in which format (ODI or T-20) Babar Azam perform
better as compare to otherODI T-20
batsman

Imam 42 25
Hadir 54 34
Azhar 30 16
Shafiq 45 41
Asif 62 36
Babar Azam 55 40
mean= 48 32
SD= 10.41 8.851
Z= 0.673 0.904

You might also like