0% found this document useful (0 votes)
10 views10 pages

Lecture 3

Upload

Uploaded by

esthermwende103
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views10 pages

Lecture 3

Upload

Uploaded by

esthermwende103
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

STA 101/SMA 140 Introduction to Probability

and Statistics

February 26, 2021

1 Descriptive Statistics
1.1 Median
Definition: This is the middle value of a given data set after arranging data
in ascending or descending order. If the sample size is n, then the median can
be obtained as: th
n − odd, the median is found to be in the n+1 2 position in the data set.
th th
n − even, the median values are found to be in the n2 and the n+2
2
positions within the data set (already arranged in ascending or descending or-
der).
Examples 1:
Data: 2, 1, 1, 2, 3, 2, 3, 3, 4, 6, 4, 1, 2, 3, 2, 3, 5, 4, 4, 5
Median:
Ascending order: 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 6.
We have 2 values at the middle:
 n th  th
20
term =⇒ term = 10th term = 3
2 2
 th  th
n+2 22
term =⇒ term = 11th term = 3
2 2

=⇒ M edian = Avearage; 3

x f cf
1 3 3
2 5 8
3 5 13
4 4 17
5 2 19
6 1 20
20

1
Median lies in the 10th and 11th values. From the cf column, both values
lies before 13th position but after the 8th position. Thus they are both 3.
Example 2:
Data: 10, 11, 10, 12, 11, 10, 11, 11, 11, 12, 13, 11, 12, 12, 13.
Cumulative frequency distribution table:
x f cf
10 3 3
11 6 9
12 4 13
13 2 15
15
Median position is the
 th
n+1 16
= = 8th P osition
2 2

8th position is just before 9th position. This

=⇒ M edian = 11

For a grouped data, we estimate the median through two methods:

• Graphically - An Ogive
• Iterative formula
Class f cf
0-4 1 1
5-9 14 15
Example 3: 10-14 23 38 Middle position
15-19 21 59
20-24 15 74
25-29 6 80
 n th 80
= = 40th P osition
2 2
We first need to make the data continuous (Exclusive class intervals) if we are
estimating the median, juast like we do for the mode.
The new classes now become:
Class f cf
-0.5-4.5 1 1
4.5-9.5 14 15
9.5-14.5 23 38
14.5-19.5 21 59
19.5-24.5 15 74
24.5-29.9 6 80

2
Iteratively, the median now becomes,
 
40 − 38
=⇒ M edian = 14.5 + 5 = 14.98
21

1
− cfm

2N
=⇒ M edian = Lm + ×i
fm

We can as well estimate it using the ogive, a graph of cumulative frequency (cf)
against the variable, the continuous case.
Other estimates:

Quartiles Data divided into 4 equal parts. We have the 1st quarter (the
lower quartile), the 2nd quarter (the median) and the 3rd quarter (the upper
quartile).
1st 2nd 3rd 4th

Quintiles Data divided into 5 equal parts.

Deciles Data divided into 10 equal parts


1 2 3 4 5 6 7 8 9 10
Same data divided into 4 parts in the quartiles, it’s now divided into 10 parts

Percentiles Data divided into 100 parts

The lower quartile The value separating the 1st quarter of the data from
the other data values
Estimated by applying the median concept:
1
4 N − cfLQ

=⇒ Lower Quartile = LLQ + ×i
fLQ

From the above example,

Lower Quartile, LQ, position,


becomes, 41 (80) = 20th position

which is, 
above 15
in the cf column
below 38
It ∴ lies in 10.5 − 14.5 class. Thus,
 
20 − 15
LQ = 9.5 + 5 = 9.5 + 1.09 = 10.59
23

3
The upper quartile The value separating the 1st 3 quarters of the data
from the last quarter of the other data values, estimated as
3
N − cfU Q

=⇒ U pperQuartile = LU Q + 4 ×i
fU Q

From the above example,

U pper Quartile, U Q, position,


becomes, 34 (80) = 60th position

which is the, 
above 59
in the cf column
below 74
It ∴ lies in 19.5 − 24.5 class. Thus
 
60 − 59
U Q = 19.5 + 5 = 19.5 + 0.33 = 19.83
15

Inter-quartile range The difference between lower quartile and the upper
quartile,
⇒ IQR = U Q − LQ
From the above exapmle, it then implies that,

=⇒ IQR = U Q − LQ = 19.83 − 10.59 = 7.24

Semi inter-quartile range or Quartile deviation Half the inter-quartile


range,
U Q − LQ
=⇒ Semi − IQR =
2
which then gives,
1
=⇒ Semi − IQR = (8.24) = 3.62
2
Characteristics of Quartile Deviation:
The Quartile Deviation doesn’t take into account the extreme points of the
distribution. It considers the dispersion or the spread of only the central 50%
data.
If the scale of the data is changed, the QD also changes in the same ratio.
It is the best measure of dispersion for open-ended systems (which have
open-ended extreme ranges).
It is less affected by sampling fluctuations in the dataset as compared to the
range.
Since it is solely dependent on the central values in the distribution, if in
any experiment, these values are abnormal or inaccurate, the result would be
affected drastically.

4
Cofficient of Quartile Deviation This is a relative measure of dispersion
based on quartiles.
formally defined as
Q3 − Q1
Coef f icient of Quartile Deviation = × 100%
Q3 + Q1
It is a ratio of two quantities of the same dimension, hence unit-less. It is a
good measure of dispersion for comparing 2 datasets with different units.

Deciles Data divided into 10 parts; e.g. the 6th decile becomes
 6
N − cf6th d

=⇒ 6th decile = L6th d + 10 ×i
f6th d

From the above example, the 6th decile becomes,

6th decile position,


6
becomes 10 (80) = 48th position

which is the, 
above 38
in the cf column
below 59
It ∴ lies in 14.5 − 19.5 class. Thus

6th decile = 14.5 + 48−38



21 5
= 14.5 + 2.38 = 16.88

Percentiles Data divided into 100 parts; e.g. the 65th percentile becomes
 65
N − cf65th d

=⇒ 65th decile = L65th d + 100 ×i
f65th d

From the above example, the 65th percentile becomes,

65th percentile position,


65
becomes, 100 (80) = 52nd position

which is the, 
above 38
in the cf column
below 58
It ∴ lies in 14.5 − 19.5 class. Thus
52−38

65th percentile = 14.5 + 21 5
= 14.5 + 3.33 = 17.83

Note: All these statistics could also be estimated from an ogive.

5
2 Measures of Dispersion
These statistics are, the

• Overall spread of items: The Range


• Spread about the mean: Concerned with measuring the distance between
the items and their common mean. We have the Mean deviation, the
standard deviation

• Central percentage spread of items: These measures have link with the
median. e.g. the 10 to 90 percentile range, the quartile deviation e.t.c.
• Coefficient of Variation

2.1 Range
Definition:
The numerical difference between the highest and the smallest values of the
items in a given data set or distibution;

Range = M ax. − M in.

Characteristics of the range:


This is the easiest measure of dispersion to calculate.
It is also the poorest measure since sometimes, these values may be the out-
liers (too extreme values) in the given data set. Some important characteristics
include:
1. It’s the simplest and easy to calculate
2. Poorest measure of dispersion since it only takes two values into account
(the smallest & the largest), hence affected by extreme values.

3. The highly used measure in our day-to-day discussions because of its sim-
plicity; range of prices of same commodity in different places; range of
times of delivery of different items ordered; the range of manpower avail-
able on a production line at different times e.t.c.
4. Has no natural partner in the measures of location and so cannot be used
in further statistical work.
5. Highly used measure of dispersion, e.g. in meteorological department -
when discussing temperatures of different places in the world.
6. Highly used in quality control charts in making sample mean control
charts. The chart for the ranges of samples enables a check to be kept on
the variability of production.

6
2.2 The Mean Deviation
Definition:
A measure of dispersion that gives the average absolute difference (i.e. ig-
noring ’minus’ sign) between each item and the mean.
A much more representative measure than the range. All item values are
taken into account.
e.g. Suppose an assembly line produced 3, 10, 5, 2 defective products;
Then, the mean number of defectives are x = 3+10+5+2
4 = 5. Tha absolute
differences between each value and the mean becomes,

3 − 5 = −2 (2, ignoring minus)


10 − 5 = 5
5−5=0
2 − 5 = −3 (3, ignoring minus)

Thus the
2+5+0+3
M ean deviation = = 2.5
4
Formulae for Mean Deviation:
For a set: P
|x − x|
M ean deviation, M d =
n
For a frequency distribution:
P
f |x − x|
M ean deviation, M d = P
f

2.2.1 Mean Deviation for a frequency distribution


From the previous, Example 3, where
P
fx 1202
M ean = P = = 15.025
f 80

then,
class f x fx |x − x| f |x − x|
0-4 1 2 2 13.025 13.025
5-9 14 7 98 8.025 112.35
10-14 23 12 253 3.025 69.575
15-19 21 17 357 1.975 41.475
20-24 15 22 330 6.975 104.625
25-29 6 27 162 11.975 71.85
80 1202 412.9
Thus, P
M ean deviation = fP|x−x|
f
= 412.9
80 = 15.16125 ≈ 15.16

7
Characteristics of the Mean Deviation
1. A good representative measure of dispersion that is not difficult to under-
stand. Used to compare variability between distributions of like nature.
2. It can be complicated and awkward to calculate if the mean is anything
other a whole number.
3. Mean deviation is virtually impossible to handle theoretically because of
the modulus sign. Hence, not used in more advanced analysis.

2.3 Standard Deviation


Definition:
“The root of the mean of the squares of deviatons from the common mean”
of a set of values.
NB: If the mean is not a whole number, the calculations could involve some
awkward , decimal-bound work.

2.3.1 Steps in Calculating Standard deviation


• Step 1: Calculate the mean;
P P
xi fx
x= = P
n f

• Step 2: Calculate the sum of the squares of deviations of items from the
mean
• Step 3: Divide the sum by the number of items and take the square root;
s
P 2 P 2
2 (x − x) (x − x)
s = =⇒ s =
n n
for a data set; while for a distribution, the value becomes,
sP
P 2 2
f (x − x) f (x − x)
s2 = P =⇒ s = P
f f

This formula can be re-arranged to avoid the awkward subtraction from a dec-
imal mean to become
P 2 rP
2 x 2 x2
s = − x =⇒ s = − x2
n n
for a data set, and
P 2  P 2 sP  P 2
2 fx fx f x2 fx
s = P − P =⇒ s = P − P
f f f f

8
for a frequency distribution.
From the previous example, then
class f x fx x2 f x2
0-4 1 2 2 4 4
5-9 14 7 98 49 686
10-14 23 12 253 144 3312
15-19 21 17 357 289 6069
20-24 15 22 330 484 7260
25-29 6 27 162 729 4374
80 1202 51705
The variance becomes
P 2
 P 2
s2 = Pf fx − Pffx
1202 2
= 51705

80 − 80
= 646.3125 − 15.0252
= 646.3125 − 225.7506
= 420.5619

s = 420.5619 = 20.51

Characteristics of the Standard Deviation


1. It’s the natural partner to the arithmetic mean ’by definition’ (formulae
for standard deviation has mean) and in ’further statistical work’ (the for-
mulae for the normal distribution function has both mean µ and standard
deviation σ) as the parameters.
2. Regarded as truly representative of the data, since all the data values are
taken into account in its calculation.
3. For distributions that are not too much skewed,

• almost all the items ∼


= 99% should be within 3 standard deviations of the
mean; x ± 3s
• 95% of the items should lie within 2 standard deviations of the mean;
x ± 2s
• 50% of the items should lie within 0.67 standard deviations of the mean;
x ± 23 s

2.4 The Coefficient of Variation


Used when comparing 2 different distributions with regard to variability. It
compares the variation of certain critical dimensions of different outputs.
Standard deviation is used as a measure for comparison only when the units
in the distributions are the same and the respective means are roughly compa-
rable.

9
In comparison of distributions with respect to variability, coefficient of vari-
ation is much more appropriate.
standard deviation
Coef f icient of V ariation = × 100%
mean
Example:
Over a period of one month, the daily number of components produced by
two comparable machines was measured:

M achine A : mean = 242.8; std dev = 20.5


M achine B : mean = 281.3; std dev = 23.0
20.5
The coefficient of variation for M achine A = 242.8 × 100% = 8.4%
23.0
The cofficient of variation for M achine B = 281.3 × 100% = 8.2%
Thus, although the std dev. for Machine B is higher in absolute terms, the
dispersion for Machine A is higher in relative terms.

2.5 Pearson’s Measure of Skewness


Definition:
mean−mode
P sk = standard deviation
OR
3×(mean−median)
P sk = standard deviation

Note that:
P sk < 0 shows there is left or negative skew
P sk = 0 signifies no skew (mean = mode for a symmetric distribution)
P sk > 0 means there is right or positive skew
Example:
Determine the skewness for the previous example.
 
23−14
M ode = 10.5 + (23−14)+(23−21) 5
= 10.5 + 0.8182 × 5
= 10.5 + 4.091
= 14.591

Std dev = 15.16 and M ean = 15.025


Thus
15.025 − 14.591 0.434
P sk = = = +0.0286
15.16 15.16
This demonstrates a very small degree of right skew. This can be confirmed by
inspecting the frequency distribution table.

10

You might also like