0% found this document useful (0 votes)
19 views60 pages

Numerical Summary Measures

The document discusses numerical summary measures, focusing on central tendency and dispersion. It defines measures such as mean, median, mode, quartiles, and percentiles, explaining their calculations and properties. Additionally, it highlights the importance of choosing appropriate measures based on data types and variability.

Uploaded by

feredenatnael
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views60 pages

Numerical Summary Measures

The document discusses numerical summary measures, focusing on central tendency and dispersion. It defines measures such as mean, median, mode, quartiles, and percentiles, explaining their calculations and properties. Additionally, it highlights the importance of choosing appropriate measures based on data types and variability.

Uploaded by

feredenatnael
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 60

Numerical Summary Measures

Mekdes W.(MPH)
Numerical summary
measures
 A single number which quantify the characteristics of a
distribution of values.

Measures of central tendency (location)

Measures of dispersion (variability)


A. Measures of Central location

• A measure of central tendency (MCT)is a univariate


statistic that indicates, in one manner or another,

– the average or typical observed value of a variable


in a data set, or

– put otherwise, the center of the frequency


distribution of the data.
Cont’d…

• Measures used to summarize the point at which the


data tend to cluster in a single number.
• The term “number crunching” is used to illustrate
this aspect of data description.
•We describe them as mean, median and mode.
1.Mean

• The sum of the observations divided by the


number of observations.
• The mean is defined if and only if the variable
is at least interval in nature [i.e., interval or
ratio].
Reading assignment
• Read on the different types of mean.
arithmetic mean
weighted mean
geometric mean (GM)
harmonic mean (HM)
a)Ungrouped data
• If x 1 , x 2 , ..., x n are n observed values,
then
b) Grouped data
• It is calculated as follow:

 m ifi
i=1
x = k

 i=1
fi

• where,

k = the number of class intervals

mi = the mid-point of the ith class interval

fi = the frequency of the ith class


Example. Compute the mean age of 169 subjects from
the grouped data.
Mean = 5810.5/169 = 34.48 years

Class interval Mid-point (mi) Frequency (fi) mifi


[10-19] 14.5 4 58.0
[20-29] 24.5 66 1617.0
[30-39] 34.5 47 1621.5
[40-49] 44.5 36 1602.0
[50-59] 54.5 12 654.0
[60-69] 64.5 4 258.0

Total 169 5810.5


Properties of the arithmetic
mean
• For given set of data there is one and only one arithmetic
mean (uniqueness).

• It is easy to calculate and understand (simple).

• Poor measure of central location if the underlying distribution


is not normal (or not Gaussian).

• Influenced by each and every value in the data set hence


affected by the extreme values(outliers).

• In grouped data if any class interval is open, arithmetic


mean can not be calculated.
Median
• With the observations arranged in increasing or decreasing
order,
the median is defined as the middle observation.

a) ungrouped data

If observations are odd, the median is defined as the [(n+1)/2]th

observation.

• If observations are even the median is the average of the


two middle (n/2)th and [(n/2)+1]th values i.e
Cont’d…
Example : Find the median for the following
•20 20 19 22 24 27 27 27 34 21 20
•19 20 20 20 21 22 24 27 27 27 34
b) Grouped data

 we assume that the values within a class-interval are evenly


distributed through the interval.

– The first step is to locate the class interval in which it


is located.

– Find n/2 and see a class interval with a minimum


cumulative frequency which contains n/2.
Median for Grouped data…..
To find a unique median value, use the following formal.

nF 
~  
x = Lm  2 c W
  fm 
• where,
 
• Lm = lower true class boundary of the interval containing the median

• Fc = cumulative frequency of the interval just above the median class


interval

• fm = frequency of the interval containing the median


• W= class interval width

• n = total number of
observations
Example. Compute the median age of 169 subjects from the
grouped data.

n/2 = 169/2 = 84.5

Class interval Mid-point (mi) Frequency (fi) Cum. freq


[10-19] 14.5 4 4
[20-29] 24.5 66 70
[30-39] 34.5 47 117
[40-49] 44.5 36 153
[50-59] 54.5 12 165
[60-69] 64.5 4 169
Total 169
• n/2 = 84.5 = in the 3rd class interval

• Lower limit = 29.5, Upper limit = 39.5

• Frequency of the class = 47

• Fc = 70

• (n/2 – fc) = 84.5-70 = 14.5

• Median = 29.5 + (14.5/47)10 = 32.58 ≈


33
Properties of median

• There is only one median for a given set of data (uniqueness)

• The median is easy to calculate

• Median is a positional average and hence it is not sensitive to


very large or very small values.

• The median is a better measure of central tendency (than


the mean) when the distribution is skewed (not normal)

• Can be calculated even in the case of open end intervals


Quartiles
• If the data are divided into four equal parts, we speak
of quartiles.

• The median divides the data into two equal parts

a) The first quartile (Q1): 25% of all the ranked

observations are less than Q1. [25th percentile]

b) b) The second quartile (Q2): 50% of all the ranked observations

are less than Q2. [50th percentile] The second quartile is the
median.

c) The third quartile (Q3): 75% of all the ranked observations are
less than Q3. [75th percentile] 104
Percentiles

 Simply divide the data into 100 pieces.


 Commonly used percentiles:
→ 10, 20, ….. 90% (deciles)
→ 20, 40, ….. 80% (quintiles)
→ 25, 50, 75%
(quartiles)
→ 33.3, 66.7%
(tertiles)
– P0: The minimum

– P25: 25% of the sample values are less than or equal to this value.
P25 means 1st Quartile or 25th percentile and given by:-
0.25(n+1)th observation

– P50: 50% of the sample are less than or equal to this value. 2nd
Quartile or 50th percentile and given by:-

0.5(n+1)th observation

– P75: 75% of the sample values are less than or equal to this
value. 3rd Quartile or 75th percentile and given by:-

0.75(n+1)th observation
– P100: The maximum
Class exercise
1. The following data set is birth in grams. Find
the 10th and 90th percentile.
2069, 2581, 2759, 2834, 2838, 2841, 3031,
3101, 3200, 3245, 3248,3260, 3265, 3314, 3323,
3484, 3541, 3609, 3649, 4146
Solution
 10th percentile = 0.1(20+1) = 2.1th value
the average of the 2nd and 3rd values =
(2581+2759)/2 = 2670 g
 90th percentile = 0.9(20+1) = 18.9th value
• the average of the18th and 19th values =
(3609+3649)/2 = 3629 g
Mode

• It is a value that occur most often.

• Most distributions have one peak and are described as uni-


modal.
• Some distributions have more than one mode

 Unimodal: A distribution with one mode.

 Bimodal: A distribution with two modes.

 Trimodal: A distribution with three modes.


Mode….

• The mode of grouped data usually refers to the modal class with
the highest frequency.

• If a single value for the mode of grouped data must be


specified, it is taken as the mid point of the modal class interval.
Properties of mode

 It is not affected by extreme values

 Often its value is not unique (more than one mode is possible)

 The main drawback of mode is that often it does not exist,


therefore it is not a good summary of the majority of the
data.
Cont’d
• Given a continuous frequency curve:
– the mode is the value of the variable under the highest
point of the frequency curve (the point with the greatest
density of observed values).
Considerations for Choosing a Measure of
Central Tendency
• For a nominal variable, the mode is the only measure
that can be used.

• For ordinal variables, the mode and the median may


be used. The median provides more information

• For interval-ratio variables, the mode, median, and


mean may all be calculated. The mean provides the
most information about the distribution, but the
median is preferred if the distribution is skewed.
Descriptive statistics
Measures of
dispersion
Measures of Dispersion……

Consider the following two sets of data:


A: 177, 193, 195, 209, 226 Mean = 200
B: 192, 197, 200, 202, 209 Mean = 200

 Two or more sets may have the same


mean and/or median but they may be
quite different.
 MCT are not good to describe about
the variability or spread of the values.
Measure of dispersion
 Measures that quantify the variation or dispersion
of a set of data from its central location.
 Dispersion refers to the variety exhibited by
the values of the data.
 The amount may be small when the values are close
together.
 If all the values are the same, no dispersion
1. Range (R)
• The difference between the largest and smallest observations in a
data set.

• Range = Maximum value – Minimum value

• Example –

– Data values: 5, 9, 12, 16, 23, 34, 37, 42

– Range = 42-5 = 37
Properties of range

 It is the simplest crude measure and can be easily understood

 It takes into account only two values which causes it to be a poor


measure of dispersion

 Very sensitive to extreme observations


2. Inter-quartile range (IQR)
• Indicates the spread of the middle 50% of the observations,
and used with median

IQR = Q3 - Q1

Example: Suppose the first and third quartile for weights of girls
12 months of age are 8.8 Kg and 10.2 Kg, respectively.

IQR = 10.2 Kg – 8.8 Kg

i.e., 50% of the infant girls weigh between 8.8 and 10.2 Kg.
Example 2
• Given the following data set (age of patients):-

18, 59, 24, 42, 21, 23, 24, 32

• Find the inter-quartile range

• Solution: 18 21 23 24 24 32 42 59

• 1st quartile = {(n+1)/4}th = (2.25)th = (21 + 23)/2 = 22

• 3rd quartile = {3/4 (n+1)}th = (6.75)th = (32 + 42)/2 = 37

• Hence, IQR = 37 - 22 = 15
Properties of IQR:

• It encloses the central 50% of the observations

• It is not based on all observations but only on two specific


values

• It is important in selecting cut-off points in the formulation


of clinical standards.

• Since it excludes the lowest and highest 25% values, it is


not affected by extreme values

• Less sensitive to the size of the sample


n
 (x i  x) 2
i=1
S2 
n-
1
n
 (x i  x) 2
i=1
S2 
n-
1
n
 (x i  x) 2
i=1
S2 
n-
1
n
 (x i  x) 2
i=1
S2 
n-
1
n
 (x i  x) 2
i=1
S2 
n-
1
n
 (x i  x) 2
i=1
S2 
n-
1
n
 (x i  x) 2
i=1
S2 
n-
1
n
 (x i  x) 2
i=1
S2 
n-
1
n
 (x i  x) 2
i=1
S2 
n-
1
n
 (x i  x) 2
i=1
S2 
n-
1
Example. Compute the variance and SD of the age of 169 subjects from
the grouped data.
Mean = 5810.5/169 = 34.48
years S2 = 20199.22/169-1 =
120.23
SD = √S2 = √120.23 = 10.96
Class
interval (mi) (fi) (mi-Mean) (mi-Mean)2 (mi-Mean)2 fi
10-19 14.5 4 -19.98 399.20 1596.80
20-29 24.5 66 -9-98 99.60 6573.60
30-39 34.5 47 0.02 0.0004 0.0188
40-49 44.5 36 10.02 100.40 3614.40
50-59 54.5 12 20.02 400.80 4809.60
60-69 64.5 4 30.02 901.20 3604.80
Total 169 1901.20 20199.22
Properties of SD
• Has the advantage of being expressed in the same units
of measurement as the mean

• The best measure of dispersion and is used widely because of the


properties of the theoretical normal curve.

• However, if the units of measurements of variables of two data sets


is not the same, then there variability can‟t be compared by
comparing the values of SD.
Coefficient of variation (CV)
 When two data sets have different units of measurements the CV
should be used as a measure of dispersion.

 It is the best measure to compare the variability of two series of


sets of observations.

 Data with less coefficient of variation is


considered more consistent.
CV is the ratio of the SD to the mean multiplied by
100.

S
CV  x 
100

SD Mean CV (%)

SBP 15mm 130mm 11.5


Cholesterol 40mg/dl 200md/dl 20.0

“Cholesterol is more variable than systolic blood


pressure”
Skewed distributions

 Skewness: If extremely low or extremely high observations are


present in a distribution, then the mean tends to shift towards
those scores.

 Based on the type of Skewness, distributions can be:

A. Positively skewed distribution: Occurs when the majority of


scores are at the left end of the curve and a few extreme large
scores are scattered at the right end.
B. Negatively skewed distribution: occurs when majority of
scores are at the right end of the curve and a few small scores
are scattered at the left end.

C. Symmetrical distribution: It is neither positively


nor negatively skewed.

A curve is symmetrical if one half of the curve is the mirror


image of the other half.
Mean, Median & Mode
Which measures to use?
• When the distribution is symmetric, summarize the data using means and
standard deviations.

• When the data are skewed, it is preferable to use the median and IQR as
summary statistics.

• Median and IQR are not easily influenced by extreme values in a


skewed
distribution unlike means and standard deviations.

• Remark:
• The mean and median of symmetric distribution coincide.

• When skewed to the right, its mean is larger than its median.

• When skewed to the left, its mean is smaller than its median.
Median Mode Mean
Fig. 2(a). Symmetric Distribution Mode Median Mean
Fig. 2(b). Distribution skewed to the right

Mean = Median = Mode Mean > Median > Mode

Mean Median Mode


Fig. 2(c). Distribution skewed to the left

Mean < Median < Mode 143


Any question?

144

You might also like