0% found this document useful (0 votes)
80 views63 pages

Class 5.2 B Business Statistics Measures of Dispersion

The document discusses measures of dispersion, including range, standard deviation, and variance. It provides definitions and formulas for calculating each measure. Range is defined as the difference between the highest and lowest values. Standard deviation calculates the average amount of variation from the mean. Variance is the square of the standard deviation and reflects the degree of spread in the data. The document explains how to calculate each measure by hand and why sample sizes use n-1 rather than n to provide less biased estimates of population values.

Uploaded by

Priya Chugh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views63 pages

Class 5.2 B Business Statistics Measures of Dispersion

The document discusses measures of dispersion, including range, standard deviation, and variance. It provides definitions and formulas for calculating each measure. Range is defined as the difference between the highest and lowest values. Standard deviation calculates the average amount of variation from the mean. Variance is the square of the standard deviation and reflects the degree of spread in the data. The document explains how to calculate each measure by hand and why sample sizes use n-1 rather than n to provide less biased estimates of population values.

Uploaded by

Priya Chugh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 63

CLASS 5.

2 B
BUSINESS STATISTICS
MEASURES OF DISPERSION

RESEARCH SCHOLAR
PRIYA CHUGH
• Variability describes how far apart data points lie from
each other and from the center of a distribution.
• Along with measures of central tendency, measures of
variability give you descriptive statistics that summarize
your data.
• Variability is also referred to as spread, scatter or
dispersion.
Why does variability matter?
While the central tendency, or average, tells you where most of your points lie, variability summarizes how far
apart they are. This is important because it tells you whether the points tend to be clustered around the center
or more widely spread out.
Low variability is ideal because it means that you can better predict information about the population based on
sample data. High variability means that the values are less consistent, so it’s harder to make predictions.
Data sets can have the same central tendency but different levels of variability or vice versa. If you know only
the central tendency or the variability, you can’t say anything about the other aspect. Both of them together
give you a complete picture of your data.

Example: Variability in normal distributions. You are investigating the amounts of time spent on phones daily by 3 different
groups of people.
•Sample A: high school students,
•Sample B: college students,
•Sample C: adult full-time employees.

All three of your samples have the same average phone use, at 195 minutes or 3 hours and 15 minutes. This is the x-axis value
where the peak of the curves are.
Although the data follows a normal distribution, each sample has different spreads. Sample A has the largest variability while
Sample C has the smallest variability.
Range
The range tells you the spread of your data from the lowest to the highest value in the distribution. It’s the
easiest measure of variability to calculate.
To find the range, simply subtract the lowest value from the highest value in the data set.

Data
72 110 134 190 238 287 305 324
(minutes)

Range example: You have 8 data points from Sample Set.


The highest value (H) is 324 and the lowest (L) is 72.
R = H – L
R = 324 – 72 = 252
The range of your data is 252 minutes.

Because only 2 numbers are used, the range doesn’t give you any information about the
distribution of values. It’s best used in combination with other measures.
Standard deviation
Average deviation of every item from mean
• The standard deviation is the average amount of variability in your dataset.
• It tells you, on average, how far each score lies from the mean.
• The larger the standard deviation, the more variable the data set is.
• There are six steps for finding the standard deviation by hand:
1.List each score and find their mean.
2.Subtract the mean from each score to get the deviation from the mean.
3.Square each of these deviations.
4.Add up all of the squared deviations.
5.Divide the sum of the squared deviations by n – 1 (for a sample) or N (for a population).
6.Find the square root of the number you found.
•Steps 1–4
•Step 5
•Step 6
Standard deviation example
Step 1: Data (minutes) Step 2: Deviation from mean Steps 3 + 4: Squared deviation

72 72 – 207.5 = -135.5 18360.25


110 110 – 207.5 = -97.5 9506.25
134 134 – 207.5 = -73.5 5402.25
190 190 – 207.5 = -17.5 306.25
238 238 – 207.5 = 30.5 930.25
287 287 – 207.5 = 79.5 6320.25
305 305 – 207.5 = 97.5 9506.25
324 324 – 207.5 = 116.5 13572.25
Mean = 207.5 Sum = 0 Sum of squares = 63904
Standard deviation example
Because you’re dealing with a sample, you use n – 1.
n – 1 = 7
63904 / 7 = 9129.14
Standard deviation examples = √9129.14 = 95.54
The standard deviation of your data is 95.54. This means that on average, each score deviates from the mean by
95.54 points

Standard deviation formula for populations


If you have data from the entire population, use the population standard deviation formula:
Formula Explanation

•σ = population standard deviation


•∑ = sum of…
•X = each value
•μ = population mean
•N = number of values in the population
Standard deviation formula for samples
If you have data from a sample, use the sample
standard deviation formula:

Formula Explanation

•s = sample standard deviation


•∑ = sum of…
•X = each value
•x̅ = sample mean
•n = number of values in the sample
Why use n – 1 for sample standard deviation?

• Samples are used to make statistical inferences about the population that they


came from.
• When you have population data, you can get an exact value for population
standard deviation. Since you collect data from every population member, the
standard deviation reflects the precise amount of variability in your distribution,
the population.
• But when you use sample data, your sample standard deviation is always used
as an estimate of the population standard deviation. Using n in this formula
tends to give you a biased estimate that consistently underestimates variability.
• Reducing the sample n to n – 1 makes the standard deviation artificially large,
giving you a conservative estimate of variability.
• While this is not an unbiased estimate, it is a less biased estimate of standard
deviation: it is better to overestimate rather than underestimate variability in
samples.
Variance
• The variance is the average of squared deviations from the mean. A
deviation from the mean is how far a score lies from the mean.
• Variance is the square of the standard deviation.
• Variance reflects the degree of spread in the data set. The more
spread the data, the larger the variance is in relation to the mean.
• Variance exampleTo get variance, square the standard deviation.
• s = 95.5
• s2 = 95.5 x 95.5 = 9129.14
• The variance of your data is 9129.14.
• To find the variance by hand, perform all of the steps for standard
deviation except for the final step.
Formula Explanation
•s2 = sample variance
•Σ = sum of…
•Χ = each value
•x̄ = sample mean
•n = number of values in the sample
Variance formula for populations

           
Formula Explanation

•            •σ2 = population variance


•Σ = sum of…
•Χ = each value
•μ = population mean
•Ν = number of values in the population
Biased versus unbiased estimates of variance
• An unbiased estimate in statistics is one that doesn’t consistently give you either
high values or low values – it has no systematic bias.
• Just like for standard deviation, there are different formulas for population and
sample variance. But while there is no unbiased estimate for standard deviation,
there is one for sample variance.
• If the sample variance formula used the sample n, the sample variance would be
biased. Reducing the sample n to n – 1 makes the variance artificially larger.
• In this case, bias is not only lowered but totally removed. The sample variance
formula gives completely unbiased estimates of variance.
Mean Deviation Definition
The mean deviation is defined as a statistical measure that is used to calculate the average deviation from the mean
value of the given data set. The mean deviation of the data values can be easily calculated using the below procedure. 
Step 1: Find the mean value for the given data values
Step 2: Now, subtract the mean value from each of the data values given (Note: Ignore the minus symbol)
Step 3: Now, find the mean of those values obtained in step 2.
Mean Deviation Formula
The formula to calculate the mean deviation for the given data set is given below.
Mean Deviation = [Σ |X – µ|]/N
Here, 
Σ represents the addition of values
X represents each value in the data set
µ represents the mean of the data set
N represents the number of data values
| | represents the absolute value, which ignores the “-” symbol
where N=∑ni=1fi
If the central tendency is mean then,

In case of median
Mean Deviation Examples Example 1: 
Determine the mean deviation for the data values 5, 3,7, 8, 4, 9.
Solution:
Given data values are 5, 3, 7, 8, 4, 9.
We know that the procedure to calculate the mean deviation.
First, find the mean for the given data:
Mean, µ = ( 5+3+7+8+4+9)/6
µ = 36/6
µ=6
Therefore, the mean value is 6.
Now, subtract each mean from the data value, and ignore the minus symbol if any
(Ignore”-”)
5–6=1
3–6=3
7–6=1
8–6=2
4–6=2
9–6=3
Now, the obtained data set is 1, 3, 1, 2, 2, 3.
Finally, find the mean value for the obtained data set
Therefore, the mean deviation is 
= (1+3 + 1+ 2+ 2+3) /6
= 12/6
=2
Hence, the mean deviation for 5, 3,7, 8, 4, 9 is 2.
Example 2:
In a foreign language class, there are 4 languages, and the
frequencies of students learning the language and the
frequency of lectures per week are given as:

Language Sanskrit Spanish French English

No. of students(xi) 6 5 9 12

Frequency of
lectures(fi) 5 7 4 9
Calculate the mean deviation about the mean for the given
data.
Solution: The following table gives us a tabular
representation of data and the calculations
When data is symmetrically distributed, the left-hand side, and right-hand side, contain the same number of
observations. (If the dataset has 90 values, then the left-hand side has 45 observations, and the right-hand side has
45 observations.). But, what if not symmetrical distributed? That data is called asymmetrical data, and that time
skewness comes into the picture.
Shape of data: Skewness and Kurtosis
“Skewness essentially measures the symmetry of the distribution, while kurtosis determines the heaviness of the
distribution tails.”
The understanding shape of data is a crucial action. It helps to understand where the most information is lying
and analyze the outliers in a given data. In The types of skewness and kurtosis and Analyze the shape of data in
the given dataset.
 
Skewness 
In statistics, skewness is a degree of asymmetry observed in a probability distribution that deviates from the
symmetrical normal distribution (bell curve) in a given set of data.
The normal distribution helps to know a skewness. When we talk about normal distribution, data symmetrically
distributed. The symmetrical distribution has zero skewness as all measures of a central tendency lies in the
middle.        
Types of skewness 
1. Positive skewed or right-skewed  
In statistics, a positively skewed distribution is a sort of
distribution where the mean, median, and mode of the
distribution are positive rather than negative or zero. It
refers to the distribution model where the tail of the
distribution is spreading on the right side.
Types of skewness 
1. Positive skewed or right-skewed  
In statistics, a positively skewed distribution is a sort of
distribution where, unlike symmetrically distributed data
where all measures of the central tendency (mean, median,
and mode) equal each other, with positively skewed data,
the measures are dispersing, which means Positively
Skewed Distribution is a type of distribution where the
mean, median, and mode of the distribution are positive
rather than negative or zero.
In positively skewed, the mean of the data is greater than
the median (a large number of data-pushed on the right-
hand side). In other words, the results are bent towards
the lower side. The mean will be more than the median as
the median is the middle value and mode is always the
highest value
The extreme positive skewness is not desirable for
distribution, as a high level of skewness can cause
misleading results. The data transformation tools are
helping to make the skewed data closer to a normal
distribution. For positively skewed distributions, the
famous transformation is the log transformation. The log
transformation proposes the calculations of the natural
logarithm for each value in the dataset.
2. Negative skewed or left-skewed
A negatively skewed distribution is the straight reverse of
a positively skewed distribution. In statistics, negatively
skewed distribution refers to the distribution model where
the tail of the distribution is spreading on the left side.
In negatively skewed, the mean of the data is less than the
median (a large number of data-pushed on the left-hand
side). Negatively Skewed Distribution is a type of
distribution where the mean, median, and mode of the
distribution are negative rather than positive or zero.
Median is the middle value, and mode is the highest value,
and due to unbalanced distribution median will be higher
than the mean.
Calculate the skewness coefficient of the sample
Pearson’s first coefficient of skewness
Subtract a mode from a mean, then divides the difference
by standard deviation.
Pearson’s first coefficient of skewness is helping if the
data present high mode (negatively skewed).
But, if the data have low mode or various modes,
Pearson’s first coefficient is not preferred, and Pearson’s
second coefficient may be superior, as it does not rely on
the mode.

Pearson’s second coefficient of skewness


Multiply the difference by 3, and divide the product by
standard deviation.
If the skewness is between -0.5 & 0.5, the data are nearly symmetrical.
If the skewness is between -1 & -0.5 (negative skewed) or between 0.5 & 1(positive skewed), the data are
slightly skewed.
If the skewness is lower than -1 (negative skewed) or greater than 1 (positive skewed), the data are
extremely skewed.

Kurtosis
Kurtosis refers to the degree of presence of outliers in the distribution.
Kurtosis is a statistical measure, whether the data is heavy-tailed or light-tailed in a normal distribution
In finance, kurtosis is used as a measure of financial risk.
A large kurtosis is associated with a high level of risk for
an investment because it indicates that there are high
probabilities of extremely large and extremely small
returns. On the other hand, a small kurtosis signals a
moderate level of risk because the probabilities of
extreme returns are relatively low
Excess Kurtosis

Excess kurtosis can be positive (Leptokurtic distribution), negative (Platykurtic distribution), or near to zero
(Mesokurtic distribution). Types of excess kurtosis
1.Leptokurtic or heavy-tailed distribution (kurtosis more than normal distribution).
2.Mesokurtic (kurtosis same as the normal distribution).
3.Platykurtic or short-tailed distribution (kurtosis less than normal distribution).

Leptokurtic (kurtosis > 3)


Leptokurtic is having very long and skinny tails, which means there are more chances of outliers.
Positive values of kurtosis indicate that distribution is peaked and possesses thick tails. An extreme positive
kurtosis indicates a distribution where more of the numbers are located in the tails of the distribution instead of
around the mean.
platykurtic (kurtosis < 3)

Platykurtic having a lower tail and stretched around center


tails means most of the data points are present in high
proximity with mean. A platykurtic distribution is flatter
(less peaked) when compared with the normal
distribution.
Mesokurtic (kurtosis = 3)
Mesokurtic is the same as the normal distribution, which
means kurtosis is near to 0. In Mesokurtic, distributions
are moderate in breadth, and curves are a medium peaked
height.
Summary
The skewness is a measure of symmetry or asymmetry of
data distribution, and kurtosis measures whether data is
heavy-tailed or light-tailed in a normal distribution. Data
can be positive-skewed (data-pushed towards the right
side) or negative-skewed (data-pushed towards the left
side).
When data skewed, the tail region may behave as an
outlier for the statistical model, and outliers
unsympathetically affect the model’s performance
especially regression-based models. Some statistical
models are hardy to outliers like Tree-based models, but
it will limit the possibility to try other models. So there is
a necessity to transform the skewed data to close
enough to a Normal distribution.
Excess kurtosis can be positive (Leptokurtic distribution),
negative (Platykurtic distribution), or near to zero
(Mesokurtic distribution). Leptokurtic distribution (kurtosis
more than normal distribution).Mesokurtic distribution
(kurtosis same as the normal distribution).Platykurtic
distribution (kurtosis less than normal distribution).
The coefficient of variation (CV)
The coefficient of variation (CV) is a relative measure of variability that indicates the size of a standard deviation in relation to
its mean.
It is a standardized, unitless measure that allows you to compare variability between disparate groups and characteristics. It is
also known as the relative standard deviation (RSD).
How to Calculate the Coefficient of Variation
Calculating the coefficient of variation involves a simple ratio. Simply take the standard deviation and divide it by the mean.

                  
Higher values indicate that the standard deviation is relatively large compared to the mean.
For example, a pizza restaurant measures its delivery time in minutes. The mean delivery time is 20 minutes and the standard
deviation is 5 minutes.
Interpreting the Coefficient of Variation
For the pizza delivery example, the coefficient of variation is 0.25. This value tells you the relative size of the standard deviation
compared to the mean.  Analysts often report the coefficient of variation as a percentage. In this example, the standard deviation is
25% the size of the mean.
If the value equals one or 100%, the standard deviation equals the mean. Values less than one indicate that the standard deviation
is smaller than the mean (typical), while values greater than one occur when the S.D. is greater than the mean.

In general, higher values represent a greater degree of relative variability.


Absolute versus Relative Measures of Variability
In another post, I talk about the standard deviation, interquartile range, and range.These statistics are absolute measures of
variability. They use the variable’s unit of measurement to describe the variability.
For the five minute standard deviation in the pizza delivery example, we know that the typical delivery occurs five minutes
before or after the mean delivery time.
The interquartile range is the third quartile (Q3) minus the first quartile (Q1). This gives us the range of the
middle half of a data set.
Interquartile range example: To find the interquartile range of your 8 data points, you first find the values at Q1 and Q3.
Multiply the number of values in the data set (8) by 0.25 for the 25th percentile (Q1) and by 0.75 for the 75th percentile (Q3).
Q1 position: 0.25 x 8 = 2
Q3 position: 0.75 x 8 = 6
Q1 is the value in the 2nd position, which is 110. Q3 is the value in the 6th position, which is 287.
IQR = Q3 – Q1
IQR = 287 – 110 = 177
The interquartile range of your data is 177 minutes.
Just like the range, the interquartile range uses only 2 values in its calculation. But the IQR is less affected by
outliers: the 2 values come from the middle half of the data set, so they are unlikely to be extreme scores.
The IQR gives a consistent measure of variability for skewed as well as normal distributions.
Five-number summary
Every distribution can be organized using a five-number summary:
•Lowest value
•Q1: 25th percentile
•Q2: the median
•Q3: 75th percentile
•Highest value (Q4)
These five-number summaries can be easily visualized using box and whisker plots.
Box and whisker plot example For each of our samples, the horizontal lines in a box show Q1, the median and Q3, while the
whiskers at the end show the highest and lowest values.
Quartile deviation is one of the measures of dispersion. Before getting into a deeper understanding, let’s recall
quartiles and how we can define them. Quartiles are the values that divide a list of numerical data into three-quarters,
such as Q1, Q2 and Q3. The middle part of the three quarters measures the central point of distribution and shows the
data values near the midpoint (or the central value; this is referred to as the median). The lower part of the quarters
indicates just half the information set, which comes under the median, and the upper part shows the remaining half,
which falls above the median. Thus, the quartiles represent the distribution or dispersion of the given data set.

Quartile Deviation in Statistics can be defined as the statistic that measures the dispersion. Here, the Dispersion is the
state of getting dispersed or spread. Statistical dispersion means the extent to which numerical data is likely to vary
about an average value. In other words, dispersion helps to understand the distribution of the data.

Quartile Deviation Definition


The Quartile Deviation can be defined mathematically as half of the difference between the upper and lower quartile.
Here, quartile deviation can be represented as QD; Q3 denotes the upper quartile and Q1 indicates the lower quartile.
Quartile Deviation is also known as the Semi Interquartile range.
Quartile Deviation Formula
Suppose Q1 is the lower quartile, Q2 is the median, and Q3 is the upper quartile for the given data set, then its quartile
deviation can be calculated using the following formula.
QD = (Q3 – Q1)/2
In the next section, you will learn how to calculate these quartiles for both ungrouped and grouped data separately.
Quartile Deviation for Ungrouped Data
For an ungrouped data, quartiles can be obtained using
the following formulas,
Q1 = [(n+1)/4]th item
Q2 = [(n+1)/2]th item
Q3 = [3(n+1)/4]th item
Where n represents the total number of observations in
the given data set.
Also, Q2 is the median of the given data set, Q1 is the
median of the lower half of the data set and Q 3 is the
median of the upper half of the data set.
Before, estimating the quartiles, we have to arrange the
given data values in ascending order. If the value of n is
even, we can follow the similar procedure of finding the
median
Quartile Deviation for Grouped Data
For a grouped data, we can find the quartiles using the formula,

Here,
Qr = the rth quartile
l1 = the lower limit of the quartile class
l2 = the upper limit of the quartile class
f = the frequency of the quartile class
c = the cumulative frequency of the class preceding the
quartile class
N = Number of observations in the given data set
Quartile Deviation Example
Let’s understand the quartile deviation of ungrouped and grouped data with the help of examples given below.
Example 1:
Find the quartiles and quartile deviation of the following data:
17, 2, 7, 27, 15, 5, 14, 8, 10, 24, 48, 10, 8, 7, 18, 28
Solution:
Given data:
17, 2, 7, 27, 15, 5, 14, 8, 10, 24, 48, 10, 8, 7, 18, 28
Ascending order of the given data is:
2, 5, 7, 7, 8, 8, 10, 10, 14, 15, 17, 18, 24, 27, 28, 48 Number of data values = n = 16
Q2 = Median of the given data set
n is even, median = (1/2) [(n/2)th observation and (n/2 + 1)th observation]
= (1/2)[8th observation + 9th observation]
= (10 + 14)/2
= 24/2
= 12
Q2 = 12
Now, lower half of the data is:
2, 5, 7, 7, 8, 8, 10, 10 (even number of observations)
Q1 = Median of lower half of the data
= (1/2)[4th observation + 5th observation]
= (7 + 8)/2
= 15/2
= 7.5
Also, the upper half of the data is:
14, 15, 17, 18, 24, 27, 28, 48 (even number of
observations)
Q3= Median of upper half of the data
= (1/2)[4th observation + 5th observation]
= (18 + 24)/2
= 42/2
= 21
Quartile deviation = (Q3 – Q1)/2
= (21 – 7.5)/2
= 13.5/2
= 6.75
Therefore, the quartile deviation for the given data set is
6.75.
Example 2:
Calculate the quartile deviation for the following distribution.

Class 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100
Frequency 5 3 4 3 3 4 7 9 7 8
Solution:
Let us calculate the cumulative frequency for the given distribution of data.

Class Frequency Cumulative Frequency

0 – 10 5 5

10 – 20 3 5+3=8

20 – 30 4 8 + 4 = 12

30 – 40 3 12 + 3 = 15

40 – 50 3 15 + 3 = 18

50 – 60 4 18 + 4 = 22

60 – 70 7 22 + 7 = 29

70 – 80 9 29 + 9 = 38

80 – 90 7 38 + 7 = 45

90 – 100 8 45 + 8 = 53
Here, N = 53
We know that,

Here, N = 53
We know that,
Finding Q1:
r=1
N/4 = 53/4 = 13.25
Thus, Q1 lies in the interval 30 – 40.
In this case, quartile class = 30 – 40
l1 = the lower limit of the quartile class = 30
l2 = the upper limit of the quartile class = 40
f = the frequency of the quartile class = 3
c = the cumulative frequency of the class preceding the quartile class = 12
Now, by substituting these values in the formula we get:
Q1 = 30 + [(13.25 – 12)/3] × (40 – 30)
= 30 + (1.25/3) × 10
= 30 + (12.5/3)
= 30 + 4.167
= 34.167
Finding Q3:
r=3
3N/4 = 3 × 13.25 = 39.75
Thus, Q3 lies in the interval 80 – 90.
In this case, quartile class = 80 – 90
l1 = the lower limit of the quartile class = 80
l2 = the upper limit of the quartile class = 90
f = the frequency of the quartile class = 7
c = the cumulative frequency of the class preceding the
quartile class = 38
Now, by substituting these values in the formula we get:
Q3 = 80 + [(39.75 – 38)/7] × (90 – 80)
= 80 + (1.75/7) × 10
= 80 + (17.5/7)
= 80 + 2.5
= 82.5
Finally, the quartile deviation = (Q3 – Q1)/2
QD = (82.5 – 34.167)/2
= 48.333/2
= 24.1665
Hence, the quartile deviation of the given distribution is
24.167 (approximately).
Interquartile range
The interquartile range gives you the spread of the
middle of your distribution.
For any distribution that’s ordered from low to high, the
interquartile range contains half of the values. While the
first quartile (Q1) contains the first 25% of values, the
fourth quartile (Q4) contains the last 25% of values.
THANK YOU

You might also like