0% found this document useful (0 votes)
7 views57 pages

Descriptive Stats Part2

Uploaded by

khadija kanwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views57 pages

Descriptive Stats Part2

Uploaded by

khadija kanwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

1

Chapter Three: Numerical Measures of the Data

3-2 Measures of Dispersion( variation)


o the spread or variability in the data.
❑ Learning objectives
◦ The range of a variable
◦ The variance of a variable
◦ The standard deviation of a variable
◦ Use the Empirical Rule
❑ Comparing two sets of data
❑ The measures of central tendency (mean, median, mode) measure the
differences between the “average” or “typical” values between two sets
of data
❑ The measures of dispersion in this section measure the differences
between how far “spread out” the data values are.

3-2
Statistics103110
Chapter Three: Numerical Measures of the Data

Variability -- provides a quantitative measure of the degree to


which scores in a distribution are spread out or clustered
together.
o Tells how meaningful measures of central tendency are
o Help to see which scores are outliers (extreme scores)
Why do we Study Dispersion?
A direct comparison of two sets of data based only on two
measures of central tendency such as the mean and the
median can be misleading since an average does not tell us
anything about the spread of the data.
Comparison of two outdoor paints : 6 gallons of each brand
have been tested and the data obtained show how long ( in
months) each brand will last before fading .
Brand A : 10 60 50 30 40 20
Brand B : 35 45 30 35 40 25
Calculate the mean for each brand :

3-3
Statistics103110
Chapter Three: Numerical Measures of the Data

Measures of dispersion are :


1. The range ,
2. The interquartile range ,
3. The variance and standard deviation ,
4. The coefficient of variation
The range (R) of a variable is the difference between the largest
data value and the smallest data value
R = highest value – lowest value.
Properties of the range
1. Only two values are used in the calculation.
2. It is influenced by extreme values.
3. It is easy to compute and understand.

3-4
Statistics103110
Chapter Three: Numerical Measures of the Data

Example
A ▪ Compute the range of 6, 1, 2, 6, 11, 7, 3, 3
▪ The largest value is 11

U ▪ The smallest value is 1


▪ Subtracting the two … 11 – 1 = 10 … the range is 10

S Relative measure of Range called coefficient of Range


Coeff .of Range =
H −L
H +L

T
N
20

3-5 15

10

Statistics 103110
5

0
Chapter Three: Numerical Measures of the Data

The variance of a variable


The variance is based on the deviation from
the mean
( xi – μ ) for populations
( xi – x ) for samples
To treat positive differences and negative
differences, we square the deviations
( xi – μ )2 for populations
( xi – x )2 for samples

3-6
Statistics103110
Chapter Three: Numerical Measures of the Data

The population variance of a variable is the sum of


the squared deviations of the data values from the mean
divided by the number in the population

 ( X − ) 2 X = individual value

 2
= where
 = population mean
N N = population size

The population variance is represented by σ2

Standard deviation: The square root of the variance.


 =
2

3-7
Example #1
Calculate the population variance from the following 5
observations: 50, 55, 45, 60, 40.

Solution:
Use the following data for the calculation of population variance.

There are a total of 5 observations. Hence, N=5.


µ=(50+55+45+60+40)/5 =250/5 =50

8
9
Standard Deviation

10
Chapter Three: Numerical Measures of the Data

Properties of the variance and standard deviation


1. it is the typical or approx. average distance from the
mean
2. if it is small, then scores are clustered close to mean; if
it is large, they are scattered far from mean
3. it describes how variable or spread out the scores are.
4. it is very influenced by extreme scores
5. The measurement units of the variance are square of
the original units. While the measurement of the SD is
same as the original data
6. All values are used in the calculation.
7 . Variance and St. dev are always greater than or equal
to zero. They are equal zero only if all observations are
the same.
3-11
Statistics103110
Chapter Three: Numerical Measures of the Data

The sample variance of a variable is the sum of the squared


deviations of data values from the mean divided by one less than the
number in the sample
(X - X)2
The sample variance is represented by s2 s2 = n-1
Sample standard deviation (s)
s = s2
We say that this statistic has n – 1 degrees of freedom
Example;- Find the variance and standard deviation for the
following sample: 16, 19, 15, 15, 14.

3-12
Statistics103110
Chapter Three: Numerical Measures of the Data

Symbols for Standard Deviation

Sample Population
Textbook s  Book
x
Some graphics Some graphics
calculators
Sx calculators

Some non-graphics
calculators
xn-1 xn
Some non-graphics
calculators

Articles in professional journals and reports often use SD for standard


deviation and VAR for variance.

3-13
Statistics103110
Chapter Three: Numerical Measures of the Data

Sample Variance for Grouped and


Ungrouped Data
For grouped data, use the class midpoints for
the observed value in the different classes.
For ungrouped data, use the same formula
with the class midpoints, Xm, replaced with
the actual observed X value.
Example:-
Find the variance and SD for the following
data set
2,3,4,5,2,2,2,3,2,4,3,2,5,2,3,3,4,2,5,4,4,3,3,2,
5,2
3-14
Statistics103110
Chapter Three: Numerical Measures of the Data

Sample variance of ungrouped

3-15
Statistics103110
Chapter Three: Numerical Measures of the Data

Step one put the data I ungrouped frequency


table
Value (x) Frequency f x2 f .x f .x 2

2 10 4 20 40
3 7 9 21 63
4 5 16 20 80
5 4 25 20 100
Total 26 81 283
nf ( x ) − ( fx ) 2
2
26( 283) − 812
s =
2
=
n( n − 1) 26( 26 − 1)
797
= = 1.2262
650

s = 1.2262 = 1.1073
3-16
Statistics103110
Sample variance of grouped data

17
Chapter Three: Numerical Measures of the Data

Example:- find the variance and SD for the frequency distribution of


the data representing number of miles that 20 runners run during
one week
Class Freq. f

5- 11 1

11-17 2

17-23 3

23-29 5

29-35 4

35-41 3

41-47 2

3-18 total
Statistics103110
Chapter Three: Numerical Measures of the Data

Example:- find the variance and SD for the frequency distribution of


the data representing number of miles that 20 runners run during
one week
Class Freq. f Midpoint f .xm
2
xm f .xm2
xm
5- 11 1 8 8 64 64

11-17 2 14 28 196 392

17-23 3 20 60 400 1200

23-29 5 26 130 676 3380

29-35 4 32 128 1024 4096

35-41 3 38 114 1444 4332

41-47 2 44 88 1936 3872

3-19 total 20 556 17336


Statistics103110
Chapter Three: Numerical Measures of the Data

nf (x ) − ( fx)
2
20(17336) − 556
2 2
s =
2
=
n(n − 1) 20(20 − 1)
37584
= = 98.905
380
s = 98.905 = 9.95

3-20
Statistics103110
Chapter Three: Numerical Measures of the Data

Interpretation and Uses of the


Standard Deviation
The standard deviation is used to measure
the spread of the data. A small standard
deviation indicates that the data is clustered
close to the mean, thus the mean is
representative of the data. A large standard
deviation indicates that the data are spread
out from the mean and the mean is not
representative of the data.

3-21
Statistics103110
Chapter Three: Numerical Measures of the Data

Coefficient of Variation :- C.V .


The relative measure of St. Dev. is the coefficient of
variation which is defined to be the standard deviation
divided by the mean. The result is expressed as a
percentage.
 Or s
C.V . =
C.V . = .100% .100%
 x
Important note:
The coefficient of variation should only be computed
for data measured on a ratio scale.
See the following example

3-22
Statistics103110
Example :

A  To see why the coefficient of variation should not be


applied to interval level data, compare the same set of
temperatures in Celsius and Fahrenheit:
U Celsius: [0, 10, 20, 30, 40]
Fahrenheit: [32, 50, 68, 86, 104]

S  The CV of the first set is 15.81/20 = 0.79. For the


second set (which are the same temperatures) it is
28.46/68 = 0.42
T  So the coefficient of variation does not have any
meaning for data on an interval scale.

N
20

15

10

5
23
0
Chapter Three: Numerical Measures of the Data

Advantages
The coefficient of variation is useful because the
standard deviation of data must always be
understood in the context of the mean of the data.
The coefficient of variation is a unitless
(dimensionless )number. So when comparing
between data sets with different units or widely
different means, one should use the coefficient of
variation for comparison instead of the standard
deviation.
Disadvantages
When the mean value is near zero, the coefficient of
variation is sensitive to small changes in the mean,
limiting its usefulness.
.
3-24
Statistics103110
Chapter Three: Numerical Measures of the Data

Example:- Data about the annual salary (000’s) and age of CEO’s in a
number of firms has been collected.The means and standard deviations are
as follows: Mean SD
Salary 404.2 220.5
Age 51.47 8.92

•Which distribution has more dispersion? Is direct comparison


appropriate?
Salary and age are measured in different units and the means show that
there is also a significant difference in magnitude.
Direct comparison is not appropriate Mean SD C.V.
Salary 404.2 220.5 54.55%
Age 51.47 8.92 17.33%

Comparing CV’s we can now see clearly that the dispersion or variability
relative to the mean is greater for CEO annual salary than for age.

3-25
Statistics103110
Chapter Three: Numerical Measures of the Data

Measure of position:
A Measures of position are used to locate the relative position
of a data value in the data set

U 1- Standard Scores
To compare values of different units a z-score for each value
is needed to be obtained then compared
S A z-score or standard score for each value is obtained by
For sample z =
x− x

T
s
or
For population x− 
z =

N The z-score represents the number SD that a data value falls
above or below the mean. 20

3-26 15

10

Statistics103110
5

0
Chapter Three: Numerical Measures of the Data

Standard Scores (or z-scores) specify the exact location


A of a score within a distribution relative to the mean
• The sign (- or +) tells whether the score is above or below
the mean
U • The numerical value tells the distance from the mean in
terms of standard deviations

S E.g., a z-score of -1.3 tells us that the raw score fell 1.3
standard deviations below the mean.

T Raw score is the original, untransformed score.


To make them more meaningful, raw scores can be converted
to z-scores.
N
20

3-27 15

10

Statistics103110
5

0
Chapter Three: Numerical Measures of the Data

A Characteristics of Standard Scores


1. The shape of the distribution of standard scores is
the same as the shape of the distribution of raw
U scores (the only thing that changes is the units on
the x-axis)
2. The mean of a set of standard scores = 0.
S 3. The St. deviation of a set of standard scores = 1.
4. A standard score of greater than +3 or less

T than - 3 is an extreme score, or an outlier.

N
20

3-28 15

10

Statistics103110
5

0
Chapter Three: Numerical Measures of the Data

Example:- A student scored 65 on a statistics exam that


A had a mean of 50 and a standard deviation of 10.
Compute the z-score.
z = (65 – 50)/10 = 1.5.
U That is, the score of 65 is 1.5 standard deviations above
the mean.
Above - since the z-score is positive.
S Assume that this student scored 70 on a math exam
that had a mean of 80 and a standard deviation of 5 .
Compute the z-score .
T Z= ( 70-80)/5=-2
That is, the score of 70 is 2 standard deviations below

N the mean.
below - since the z-score is positive.
20

3-29 15

10

Statistics103110
5

0
Chapter Three: Numerical Measures of the Data

Example:- a student scored 65 on a calculus test


A that had a mean of 50 and a SD of 10. she scored
30 on statistics test with a mean of 25 and variance
U of 25, compare relative positions of the two tests.

S zCal =
x − x 65 − 50
s
=
10
= 1.5

30 − 25
= = 1.0
T z stat
5
Since the z-score for calculus is larger , her relative
position in the calculus class is higher than her
N relative position in the statistics class.

20

3-30

15

10

Statistics103110
5

0
Chapter Three: Numerical Measures of the Data

2. Quartiles
A Quartiles divide the data set into 4 groups.
Quartiles are denoted by Q1, Q2, and Q3.

U The median is the same as Q2.


Finding the Quartiles
th
Procedure: Let Qk be the k quartile and n the sample

S size.
Step 1: Arrange the data in order.
Step 2: Compute c = ({n+1}k)/4.

T Step 3: If c is not a whole number, round off to whole number. use


the value halfway between xc and xc+1.
Step 4: If c is a whole number then the value of xc is the position

N
value of the required percentile.

20

3-31 15

10

Statistics103110
5

0
Chapter Three: Numerical Measures of the Data

A Example:
For the following data set: 2, 3, 5, 6, 8, 10, 12

U Find Q1 and Q3
n = 7, so for Q1 we have c = ((7+1) 1)/4 = 2.
Hence the value of Q1 is the 2nd value.
S Thus Q1 for the data set is 3.
for Q3 we have c = ((7+1) 3)/4 = 6.

T Hence the value of Q3 is the 6th value.


Thus Q3 for the data set is 10.

N
20

3-32 15

10

Statistics103110
5

0
Chapter Three: Numerical Measures of the Data

A Example: Find Q1 and Q3 for the following data set:


2, 3, 5, 6, 8, 10, 12, 15, 18.
Note: the data set is already ordered.
U n = 9, so for Q1 we have c = ((9+1) 1)/4 = 2.5.
Hence the value of Q1 is the halfway between the 2nd
value and 3rd value. 3+5
S Q1 =
2
=4

for Q3 we have c = ((9+1) 3)/4 = 7.5.


T Hence the value of Q3 is the halfway between the 7th
value and 8th value
12 + 15
N Q3 =
2
= 13.5
20

3-33 15

10

Statistics103110
5

0
Chapter Three: Numerical Measures of the Data

A Example:
For the following data set: 2, 3, 5, 6, 8, 10, 12

U Find Q1 and Q3
The median for the above data is 6
The median for the lower group of data which is less than
S median is 3
So the value of Q1 is the 2nd value which means that Q1
=3.
T The median for the upper group of data which is grater
than median is 10

N So the value of Q3 is the 6th value which means that Q3


=10.
20

3-34 15

10

Statistics103110
5

0
Chapter Three: Numerical Measures of the Data

The Interquartile Range (IQR)


A The Interquartile Range , IQR = Q3 – Q1.
the Interquartile Range (IQR), also called
U the midspread , middle fifty or inner 50% data
range, is a measure of statistical dispersion
S (variation), being equal to the difference between the
third and first quartiles.

T
N
20

3-35 15

10

Statistics103110
5

0
Chapter Three: Numerical Measures of the Data

Outliers
A An outlier is an extremely high or an extremely low data
value when compared with the rest of the data values.

U To determine whether a data value can be


considered as an outlier:
Step 1: Compute Q1 and Q3.

S Step 2: Find the IQR = Q3 – Q1.


Step 3: Compute (1.5)(IQR).
Step 4: Compute Q1 – (1.5)(IQR) and Q3 + (1.5)(IQR).

T they are called lower fence and upper fence


Step 5: Compare the data value (say X) with

N lower and upper fences


If X < lower fence or if X > upper fence ,
then X is considered as an outlier.
20

3-36 15

10

Statistics103110
5

0
Chapter Three: Numerical Measures of the Data

Example
A Given the data set 5, 6, 12, 13, 15, 18, 22, 50,

U can the value of 50 be considered as an


outlier?
Q1 = 9, Q3 = 20, IQR = 11. Verify.
S (1.5)(IQR) = (1.5)(11) = 16.5.
9 – 16.5 = – 7.5 and 20 + 16.5 = 36.5.
T The value of 50 is outside the range (– 7.5 to
36.5), hence 50 is an outlier.

N
20

3-37 15

10

Statistics103110
5

0
Chapter Three: Numerical Measures of the Data

Measure of Dispersion tells us about the variation of the


A data set.
Skewness tells us about the direction of variation of the
data set.
U Definition:
Skewness is a measure of symmetry, or more precisely, the

S lack of symmetry.
Coefficient of Skewness
Unitless number that measures the degree and direction of
T symmetry of a distribution
There are several ways of measuring Skewness:
Pearson’s coefficient of Skewness
N sk 2 =
3(mean − median )
3-38
20

15

10
s
Statistics103110
5

0
 Skewness in statistics represents an
A imbalance and asymmetry from the mean
of a data distribution. If you look at a
U normal data distribution using a bell curve,
the curve will be perfectly symmetrical.
Now, this doesn't happen all that often! In
S order to fully understand when a data
distribution is imperfect and skewed, let's
T look at a normal data distribution and
symmetrical bell curve.
N
20

15

10

5
39
0
 In a normal data distribution, the mean is
directly in the middle (and top point) of the
A bell curve. Imagine that Mrs.Thomas wanted
to teach her high school statistics class on
U the first day about data distributions,
standard deviations, and bell curves. She asks
her 16 student class to secretly divulge their
S summer job incomes. Each student provides
Mrs.Thomas with a piece of paper with their
T income. She rounds each income level to the
nearest 500 and makes a chart.
N 
20

15

10

5
40
0
41
 Now that we see the data on a
chart, we can see that four of
the students made about $2,000

A in total over the summer. If we


find the mean, we see that it is
$2,000.The mode and median in
this data distribution also
U happen to be $2,000. In a
normal data distribution and
perfectly symmetrical bell curve,

S the median and mean are always


the same value.Take a look at
the graph of the data which
represents a normal bell curve

T (no skewness at all!).

N
20

15

10

5
42
0
43
44
Chapter Three: Numerical Measures of the Data

A The Empirical (Normal) Rule

For any bell shaped distribution:


U Approximately 68% of the data values will fall
within one standard deviation of the mean.

S Approximately 95% will fall within two standard


deviations of the mean.
Approximately 99.7% will fall within three
T standard deviations of the mean.
   =     = 95%    = 
N
20

3-45 15

10

Statistics103110
5

0
Chapter Three: Numerical Measures of the
Data

A The Empirical (Normal) Rule


   =     = 95%    = 

U
S
T
N  −  −  −   +  +  +

20

3-46 15

10

Statistics103110
5

0
Chapter Three: Numerical Measures of the Data

What is a Box Plot


A To construct a box plot, first obtain the 5 number
summary

U { Min, Q1 , M, Q3, Max


The box-plot is a graphical representation of data
}

S When the data set contains a small number of


values, a box plot is used to graphically represent
the data set. These plots involve five values: the

T minimum value (the smallest value which is not


an outlier), the first quartile, the median, the
third quartile, and the maximum value (the

N largest value which is not an outlier).

20

3-47 15

10

Statistics103110
5

0
Chapter Three: Numerical Measures of the Data

The box plot is useful in analyzing small data


sets that do not lend themselves easily to
histograms. Because of the small size of a box
plot, it is easy to display and compare several
box plots in a small space.
A box plot is a good alternative or
complement to a histogram and is usually
better for showing several simultaneous
comparisons.

3-48
Statistics103110
Chapter Three: Numerical Measures of the Data

How to use it:


Collect and arrange data. Collect the data and
arrange it into an ordered set from lowest value to highest.
Calculate the median. M = median= Q2
Calculate the first quartile. (Q1)
Calculate the third quartile. (Q3)
Calculate the interquartile rage (IQR). This
range is the difference between the first and third quartile
vales. (Q3 - Q1)
Obtain the maximum. This is the largest data value
that is less than or equal to the third quartile plus 1.5 X IQR.
Q3 + [(Q3 - Q1) X 1.5]
3-49
Statistics103110
Chapter Three: Numerical Measures of the Data

Obtain the minimum. This value will be the


smallest data value that is greater than or equal to the first
quartile minus 1.5 X IQR.
Q1 - [(Q3 - Q1) X 1.5]
Draw and label the axes of the graph. The
scale of the horizontal axis must be large enough to encompass the
greatest value of the data sets.

Draw the box plots. Construct the box, insert median


points, and attach maximum and minimum. Identify outliers (values
outside the upper and lower fences) with asterisks.
The box plot can provide answers to the following
questions:
1. Does the location differ between subgroups?
2. Does the variation differ between subgroups?
3-50 3. Are there any outliers?
Statistics103110
Chapter Three: Numerical Measures of the Data

A Example 1:- Failure times of industrial machines (in hours)


32.56 42.02
62.84 63.29
47.26 50.25 59.03 60.17 61.56 62.16
63.52 65.52 66.54 68.71 70.60 71.27

U 76.33 80.37
5 # summary:
82.87
{ 32.56 , 59.03 , 63.29 , 70.60 , 82.87 }
The final product: A Simple Box-plot. Only quartile information is displayed.

S
T A mathematical rule designates “outliers.” These are plotted
using special symbols.

N
20

3-51 15

10

Statistics103110
5

0
Chapter Three: Numerical Measures of the Data

A
U
S
T
N
20

3-52 15

10

Statistics103110
5

0
Chapter Three: Numerical Measures of the Data

Now find the interquartile range (IQR). The interquartile range is the difference

A between the upper quartile and the lower quartile. In this case the IQR =
87 - 52 = 35.The IQR is a very useful measurement. It is useful because it is
less influenced by extreme values, it limits the range to the middle 50%

U of the values.
35 is the interquartile range
begin to draw Box-plot graph.

S
T
N
20

3-53 15

10

Statistics103110
5

0
Chapter Three: Numerical Measures of the Data

Example 2
A Consider two datasets:
A1={0.22, -0.87, -2.39, -1.79, 0.37, -1.54, 1.28, -0.31, -0.74, 1.72, 0.38, -0.17,

U -0.62, -1.10, 0.30, 0.15, 2.30, 0.19, -0.50, -0.09}


A2={-5.13, -2.19, -2.43, -3.83, 0.50, -3.25, 4.32, 1.63, 5.18, -0.43, 7.11, 4.87, -
3.10, -5.81, 3.76, 6.31, 2.58, 0.07, 5.76, 3.50}

S Notice that both datasets are approximately balanced around


zero; evidently the mean in both cases is "near" zero.
However there is substantially more variation in A2 which

T ranges approximately from -6 to 6 whereas A1 ranges


approximately from -2½ to 2½.
Below find box plots. Notice the difference in scales: since the
N box plot is displaying the full range of variation, the y-range
must be expanded.
20

3-54 15

10

Statistics103110
5

0
Chapter Three: Numerical Measures of the Data

A
U
S
T
N
20

3-55 15

10

Statistics103110
5

0
Chapter Three: Numerical Measures of the Data

A
U
S
T
N
20

3-56 15

10

Statistics103110
5

0
Chapter Three: Numerical Measures of the Data

A Information Obtained from a Box Plot


1. If the median is near the center of the box, the distribution is
approximately symmetric.

U 2. If the median falls to the left of the center of the box, the
distribution is positively skewed.
3. If the median falls to the right of the center of the box, the
distribution is negatively skewed

S Similarly :
1. If the lines are about the same length, the distribution is
approximately symmetric.

T 2. If the right line is larger than the left line, the distribution is
positively skewed.
3. If the left line is larger than the right line, the distribution is

N negatively skewed.

20

3-57 15

10

Statistics103110
5

You might also like