Chapter 3
Chapter 3
Summarization of Data
03/09/2021 Tesfa S. 1
Measures of Central Tendency/ Measures of Location
03/09/2021 Tesfa S. 2
1. Arithmetic Mean/simple Mean
03/09/2021 Tesfa S. 3
The mean for Grouped data can be computed as follows:
03/09/2021 Tesfa S. 4
Properties of the arithmetic mean
Uniqueness: For a given set of data there is only one arithmetic
mean
Simplicity: The mean is easily understood and easy to compute
Center of gravity: Algebraic sum of the deviations of the
given values from their arithmetic mean is always zero. i.e.∑(xi-
) =0. So, mean is the center of gravity of the given data set.
Sensitivity: Since each and every value in a set of data enters
into the computation of the mean, it is greatly affected by
extreme values.
In skewed distribution, it is undesirable measure of central
tendency.
03/09/2021 Tesfa S. 5
Example 1
Consider the data on birth weight of 10 new born
children in kg at university of Gondar hospital:
2.51, 3.01, 3.25, 2.02,1.98, 2.33, 2.33, 2.98, 2.88, 2.43.
Then the average birth weight can be computed as:
03/09/2021 Tesfa S. 7
Example 2…
Median =
Median =
03/09/2021 Tesfa S. 10
Properties of Median
Uniqueness: There is only one median for a
given set of data
Simplicity: Median is easy to compute
Insensitivity: median is a positional average In
contrast to the mean; the median is not
influenced to the same extent by extreme
values.
03/09/2021 Tesfa S. 11
Example:
Consider the data on the weight of 10 new born
children at university of Gondar hospital within a
month:
2.51, 3.01, 3.25, 2.02,1.98, 2.33, 2.33, 2.98, 2.88, 2.43.
Find median for the data?
first arrange the data in to ascending order as:
1.98, 2.02, 2.33, 2.33, 2.43,2.51, 2.88, 2.98, 3.01, 3.25.
As 10 is even we need to take the middle two
observations and the median will be the average of this
two middle observations.
03/09/2021 Tesfa S. 12
Median…
Where as:
LCB= lower class boundary of the median class
Fc= cumulative frequency just before the median
class
fc=frequency of the median class
W =class width and n=number of observations.
03/09/2021 Tesfa S. 13
Example 1
03/09/2021 Tesfa S. 14
As we can see from the distribution, the class which
contains 120 observation for the first time is the class
with cumulative frequency 155 as 120 is under 155. So,
the median class is the 4th class
03/09/2021 Tesfa S. 15
3. Mode
Mode is the value appearing most frequently
It can be obtained by counting the number of appearance for
each observation from the list.
Important for summarising nominal/categorical types of data
Disadvantage,
In small number of observations, there may be no mode.
In addition, sometimes, there may be more than one
mode
Example
a. 22, 66, 69, 70, 73 (no modal value)
b. 1.8, 3.0, 3.3, 2.8, 2.9, 3.6, 3.0, 1.9, 3.2, 3.5 (modal
value = 3.0 kg)
c. 1,2,2,3,3,1,4,7,9 (modal value=2,3)
03/09/2021 Tesfa S. 16
Properties of Mode
It is not affected by extreme values
It can be calculated for distributions with open end classes
03/09/2021 Tesfa S. 17
03/09/2021 Tesfa S. 18
Quartiles
It is quantiles which divide the distribution into four equal parts.
The 25th percentile demarcates the first quartile(Q1).
the median or 50th percentile demarcates the second quartile(Q2).
the 75th percentile demarcates the third quartile (Q3)and
the 100th percentile demarcates the fourth quartile(Q4)
03/09/2021 Tesfa S. 19
Central Tendency cont---
03/09/2021 Tesfa S. 20
Central Tendency cont---
03/09/2021 Tesfa S. 21
Percentiles
The pth percentile is the value Vp such that p percent of the sample points are less
than or equal to Vp.
Percentages are less sensitive to outliers and not being affected by the sample size .
Different definition is needed for the pth percentile, depending on whether
np/100 is an integer or not.
The pth percentile is defined by
1. The (k+1)th largest value if np/100 is not an integer (where k is the largest integer
less than np/100)
2. The average of the (np/100)th and (np/100 + 1)th larges observation if np/100 is an
integer.
NB. To calculate the exact percentile value, multiply the difference by the fraction
03/09/2021 Tesfa S. 22
Example 1
Suppose the sample consists of birth weights (in grams) of all live born
infants born at a private hospital in a city, during a 1-week period. This
sample is shown in the following table:
3265, 3323, 2581, 2759, 3260, 3649,2841,3248, 3245, 3200, 3609, 3314,
3484, 3031, 2838, 3101, 4146, 2069, 3541, 2834
Compute the 10th and 90th percentile for the birth weight data.
03/09/2021 Tesfa S. 23
Example 1…
03/09/2021 Tesfa S. 25
03/09/2021 Tesfa S. 26
Positively skewed distribution: Occurs when the majority of
scores are at the left end of the curve and a few extreme large
scores are scattered at the right end.
03/09/2021 Tesfa S. 27
03/09/2021 Tesfa S. 28
Negatively skewed distribution: occurs when majority of
scores are at the right end of the curve and a few small
scores are scattered at the left end.
03/09/2021 Tesfa S. 29
03/09/2021 Tesfa S. 30
kurtosis
Kurtosis refers to the appearance of the peak of a curve, as well as to
its tail, relative to a normal distribution.
Data distributions with high kurtosis generally exhibit high and steep
peaks near the mean, with wider tails;
data with low kurtosis exhibit broader and flatter peaks than the
normal distribution.
A Gaussian distributed curve has zero skew and zero kurtosis.
NB. In a kurtotic distribution, the variance of the data remains
unchanged.
03/09/2021 Tesfa S. 31
2. Measures of Dispersion/ Variation
More over, two or more sets may have the same mean
and/or median but they may be quite different
03/09/2021 Tesfa S. 33
03/09/2021 Tesfa S. 34
1. RANGE:
It is the difference between the largest and smallest
observation from the data
R= L value – S value from the data set
03/09/2021 Tesfa S. 35
Then the range can be computed by arranging all observation in
ascending order:
1.98, 2.02, 2.33, 2.33, 2.43, 2.51, 2.88, 2.98, 3.01, 3.25.
Maximum-Minimum=3.25-1.98=1.27
The usefulness of the range is limited.
The fact that it takes in to account only two values causes
It wastes information , it takes no account of the entire data.
The main advantage in using the range is the simplicity computation.
03/09/2021 Tesfa S. 36
The usefulness of the range is limited. The fact that it takes in to
account only two values causes it to be a poor measure of
dispersion.
The main advantage in using the range is the simplicity of its
computation.
The extremes values may be unreliable; that is, they are the
most likely to be faulty
03/09/2021 Tesfa S. 37
2. The interquartile range (IQR):
It reflects the variability among the middle 50 percent of the
observation in a data set.
is the difference between the first and the third quartiles
To compute
we first sort the data in ascending order
Find the first quartile
The third quartile
Then calculate the difference
03/09/2021 Tesfa S. 38
IQR Cont---
03/09/2021 Tesfa S. 39
Example
Given the following data set (age of patients) find the interquartile
range?
18,59,24,42,21,23,24,32
1. sort the data from lowest to highest
18 21 23 24 24 32 42 59
03/09/2021 Tesfa S. 40
Example …
03/09/2021 Tesfa S. 41
While the inter-quartile range eliminates the problem of
outliers it creates another problem in that you are
eliminating half of your data.
The solution to both problems is to measure variability
from the center of the distribution.
03/09/2021 Tesfa S. 42
3. Variance
Variance:
Variance measure how far on average scores deviate or differ from the
mean.
03/09/2021 43
Tesfa S.
Variance:
03/09/2021 Tesfa S. 44
• Mathematically the formula for sample variance is
defined as:
03/09/2021 Tesfa S. 45
4. Standard Deviation
Standard Deviation:
The sample and population standard deviations are denoted by S
and σ (by convention) respectively.
The standard deviation(S.D.), is just the positive square root of the
variance.
It expresses exactly the same information as the variance, but re-
scaled to be in the same units as the mean.
The best measures for normally distributed data
Mathematically: Population standard deviation
03/09/2021 Tesfa S. 46
Standard Deviation:
101,105,110,114,115,124,125,125,130,133,135,136,13
7,140,145
03/09/2021 Tesfa S. 47
Example 1
Find the variance and standard deviation of the
above distribution.
Solutions
The mean of the sample is 125 m2.
Variance (sample) = s2 = Σ(xi –x)2/n-1 = {(101-125) 2
+(105-125) 2 + ….(145-125) 2 } / (15-1)
= 2502/14
= 178.71 m4
Hence, the standard deviation
=
= 13.37 m2
03/09/2021 Tesfa S. 48
Variance for grouped frequency distribution
S2 =
Where as
fi =frequency of ith class
Xci =class mark of ith class
n = total number of the sample
03/09/2021 Tesfa S. 49
Example 3
03/09/2021 Tesfa S. 50
S =
03/09/2021 Tesfa S. 51
5. Coefficient of variance
The SD is an absolute measure of deviation of
observations around their mean and is expressed with
the same unit of the data.
Due to this nature of the standard deviation not directly
used for comparison purposes with respect to variability.
Coefficient of variation, is often used for comparison
purpose
The coefficient of variation (CV) is defined by:
CV =
03/09/2021 Tesfa S. 54
Summary Cont---
Data type vs Measure of central tendency and
dispersion
03/09/2021 Tesfa S. 55
Describing Data: Summary
Nominal Data
Do the data have order? No Frequency Table
(and Mode)
Yes
Plot/compare mean and
median
Are the data skewed? Yes
Use medians and IQR
(or consider data
No
transformation)
Does the measure have the
interval property? No
Yes
Use mean and standard
deviation
03/09/2021 Tesfa S. 56
Exercise
1. What general measures are used to describe
frequency distributions for quantitative data?
2. What are most commonly used measures of central
tendency?
3. Which is a more stable indicator of central tendency,
the median or the mean?
4. What is the relationship among mean, median, and
mode in a symmetric frequency distribution?
5. What are the measures of dispersion commonly used
in biostatistics?
6. Name three terms that are used to describe the
03/09/2021
shape of frequency distributions.
Tesfa S. 57
THANK YOU
03/09/2021 Tesfa S. 58
QUIZ
1. Which measure of central tendency is appropriate for skewed
data
A. Mean B. Median C. Mode D. Range
2. The exam sore out of 10 for seven students were:7, 6, 5, 7, 4,
2,3. Based on the information given calculate
B. Median B. Mode C. Range D.IQR E. SD
03/09/2021 Tesfa S. 59