0% found this document useful (0 votes)
122 views59 pages

Chapter 3

This document summarizes various measures of central tendency including the arithmetic mean, median, and mode. It provides definitions and examples of calculating each measure for both raw and grouped data. Additional concepts covered include properties of the measures, quartiles, and percentiles. Formulas are given for calculating the median of grouped data and percentiles. An example is provided for finding the 10th and 90th percentiles of a birth weight data set.

Uploaded by

ABAY
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
122 views59 pages

Chapter 3

This document summarizes various measures of central tendency including the arithmetic mean, median, and mode. It provides definitions and examples of calculating each measure for both raw and grouped data. Additional concepts covered include properties of the measures, quartiles, and percentiles. Formulas are given for calculating the median of grouped data and percentiles. An example is provided for finding the 10th and 90th percentiles of a birth weight data set.

Uploaded by

ABAY
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 59

Chapter 3

Summarization of Data

03/09/2021 Tesfa S. 1
Measures of Central Tendency/ Measures of Location

 “Central Tendency”: The tendency of statistical data to


get concentrated at certain values

 The measure of central tendency includes


 arithmetic mean,
 median and
 mode.

03/09/2021 Tesfa S. 2
1. Arithmetic Mean/simple Mean

 It is the sum of all observations divided by the number


of observations.
 it is usually denoted by µ /
 Let consider X1,X2,..., XN are the list of N measurements
obtained from N subjects. Then the mean for ungrouped
number of measurements for N subjects is defined as:

03/09/2021 Tesfa S. 3
The mean for Grouped data can be computed as follows:

 where k=the number of classes


 Xi=class mark for the ith class and
 fi=frequency of the ith class

03/09/2021 Tesfa S. 4
Properties of the arithmetic mean
 Uniqueness: For a given set of data there is only one arithmetic
mean
 Simplicity: The mean is easily understood and easy to compute
 Center of gravity: Algebraic sum of the deviations of the
given values from their arithmetic mean is always zero. i.e.∑(xi-
) =0. So, mean is the center of gravity of the given data set.
 Sensitivity: Since each and every value in a set of data enters
into the computation of the mean, it is greatly affected by
extreme values.
 In skewed distribution, it is undesirable measure of central
tendency.
03/09/2021 Tesfa S. 5
Example 1
 Consider the data on birth weight of 10 new born
children in kg at university of Gondar hospital:
2.51, 3.01, 3.25, 2.02,1.98, 2.33, 2.33, 2.98, 2.88, 2.43.
Then the average birth weight can be computed as:

 For example, if the first infant were in the above data


happened to be a premature infant weighing 0.50kg
rather than 1.98kg, then the arithmetic mean of the
sample would be reduced to 2.424kg.
03/09/2021 Tesfa S. 6
Example 2

 Now, let us compute grouped mean for the grouped


frequency distribution given bellow:
 The grouped frequency distribution for current age
of women

03/09/2021 Tesfa S. 7
Example 2…

 Where as: fi = frequency distribution of ith class


Xc = is the mid-point
n = total sample size
 Thus, the two means (mean from raw data and mean
from grouped data) are almost the same.
 Hence, we can say that grouped frequency distribution
03/09/2021
well represent the raw data.
Tesfa S. 8
2. Median
 An alternative measure of central location, perhaps second in
popularity to the arithmetic mean.
 Suppose there are n observations in a sample. If these observations
are ordered from smallest to largest, then the median is defined as
follows:
 The median, is a value such that at least half of the observations
are less than or equal to median and at least half of the
observations are greater than or equal to median .
 Median means middle, and the median is the middle of a set of
data that has been put into rank order.
 To find the median of a data set:
Arrange the data in ascending order.
03/09/2021
 Find the middle observation
Tesfa S. of this ordered data. 9
Median…

 If the number of data is ODD, then the median is the


middle data point.

Median =

 If the number of data is EVEN, then the median is the


average of the two values around the middle.

Median =

03/09/2021 Tesfa S. 10
Properties of Median
 Uniqueness: There is only one median for a
given set of data
 Simplicity: Median is easy to compute
 Insensitivity: median is a positional average In
contrast to the mean; the median is not
influenced to the same extent by extreme
values.

03/09/2021 Tesfa S. 11
Example:
 Consider the data on the weight of 10 new born
children at university of Gondar hospital within a
month:
2.51, 3.01, 3.25, 2.02,1.98, 2.33, 2.33, 2.98, 2.88, 2.43.
Find median for the data?
 first arrange the data in to ascending order as:
1.98, 2.02, 2.33, 2.33, 2.43,2.51, 2.88, 2.98, 3.01, 3.25.
 As 10 is even we need to take the middle two
observations and the median will be the average of this
two middle observations.

03/09/2021 Tesfa S. 12
Median…

Median for grouped data:


 The median for grouped data is defined by:

 Where as:
LCB= lower class boundary of the median class
Fc= cumulative frequency just before the median
class
fc=frequency of the median class
W =class width and n=number of observations.
03/09/2021 Tesfa S. 13
Example 1

03/09/2021 Tesfa S. 14
 As we can see from the distribution, the class which
contains 120 observation for the first time is the class
with cumulative frequency 155 as 120 is under 155. So,
the median class is the 4th class

03/09/2021 Tesfa S. 15
3. Mode
 Mode is the value appearing most frequently
 It can be obtained by counting the number of appearance for
each observation from the list.
 Important for summarising nominal/categorical types of data
 Disadvantage,
 In small number of observations, there may be no mode.
 In addition, sometimes, there may be more than one
mode
 Example
a. 22, 66, 69, 70, 73 (no modal value)
b. 1.8, 3.0, 3.3, 2.8, 2.9, 3.6, 3.0, 1.9, 3.2, 3.5 (modal
value = 3.0 kg)
c. 1,2,2,3,3,1,4,7,9 (modal value=2,3)

03/09/2021 Tesfa S. 16
Properties of Mode
 It is not affected by extreme values
 It can be calculated for distributions with open end classes

 its value may not unique

 The main drawback of mode is that often it does not exist.

03/09/2021 Tesfa S. 17
03/09/2021 Tesfa S. 18
Quartiles
 It is quantiles which divide the distribution into four equal parts.
 The 25th percentile demarcates the first quartile(Q1).
 the median or 50th percentile demarcates the second quartile(Q2).
 the 75th percentile demarcates the third quartile (Q3)and
 the 100th percentile demarcates the fourth quartile(Q4)

03/09/2021 Tesfa S. 19
Central Tendency cont---

03/09/2021 Tesfa S. 20
Central Tendency cont---

03/09/2021 Tesfa S. 21
Percentiles

 The pth percentile is the value Vp such that p percent of the sample points are less
than or equal to Vp.
 Percentages are less sensitive to outliers and not being affected by the sample size .
 Different definition is needed for the pth percentile, depending on whether
np/100 is an integer or not.
The pth percentile is defined by
1. The (k+1)th largest value if np/100 is not an integer (where k is the largest integer
less than np/100)
2. The average of the (np/100)th and (np/100 + 1)th larges observation if np/100 is an
integer.
NB. To calculate the exact percentile value, multiply the difference by the fraction

03/09/2021 Tesfa S. 22
Example 1
Suppose the sample consists of birth weights (in grams) of all live born
infants born at a private hospital in a city, during a 1-week period. This
sample is shown in the following table:
3265, 3323, 2581, 2759, 3260, 3649,2841,3248, 3245, 3200, 3609, 3314,
3484, 3031, 2838, 3101, 4146, 2069, 3541, 2834
Compute the 10th and 90th percentile for the birth weight data.

By sorting the data from the smallest to highest


2069 2581 2759 2834 2838 2841 3031 3101 3200 3245 3248
3260 3265 3314 3323 3484 3541 3609 3649 4146

03/09/2021 Tesfa S. 23
Example 1…

Solution: n=20; p=0.1 & 0.9


Since 20×0.1=2 and 20×0.9=18 are integers, the 10th and 90th
percentiles are defined by
10th percentile = the average of the 2nd and 3rd largest values =
(2581+2759)/2 = 2670 g
90th percentile=the average of the18th and 19th largest values =
(3609+3649)/2 = 3629 grams.

 We would estimate that 80 percent of birth weights would fall


between2670 g and 3629 g, which gives us an overall feel for the
03/09/2021
spread of the distribution. Tesfa S. 24
Skewness
 The presence of extremely low or extremely high observations
 Based on the type of skewness, distributions can be:
 Symmetrical distribution: It is neither positively nor
negatively skewed. A curve is symmetrical if one half of the
curve is the mirror image of the other half.
 If the distribution is symmetric and has only one mode, all
three measures are the same, an example being the
normal distribution.

03/09/2021 Tesfa S. 25
03/09/2021 Tesfa S. 26
Positively skewed distribution: Occurs when the majority of
scores are at the left end of the curve and a few extreme large
scores are scattered at the right end.

For positively skewed distributions (where the upper, or left,


tail of the distribution is longer (“fatter”) than the lower, or
right, tail) the measures are ordered as follows:
mode < median < mean.

03/09/2021 Tesfa S. 27
03/09/2021 Tesfa S. 28
Negatively skewed distribution: occurs when majority of
scores are at the right end of the curve and a few small
scores are scattered at the left end.

For negatively skewed distributions (where the right tail of


the distribution is longer than the left tail), the reverse
ordering occurs:

mean < median < mode.

03/09/2021 Tesfa S. 29
03/09/2021 Tesfa S. 30
kurtosis
 Kurtosis refers to the appearance of the peak of a curve, as well as to
its tail, relative to a normal distribution.
 Data distributions with high kurtosis generally exhibit high and steep
peaks near the mean, with wider tails;
 data with low kurtosis exhibit broader and flatter peaks than the
normal distribution.
 A Gaussian distributed curve has zero skew and zero kurtosis.
NB. In a kurtotic distribution, the variance of the data remains
unchanged.

03/09/2021 Tesfa S. 31
2. Measures of Dispersion/ Variation

 Measures of dispersion or variability will give us


information about the spread of the scores how closely
the rest of the data fall about that central value in our
distribution.

 More over, two or more sets may have the same mean
and/or median but they may be quite different

 Thus to have a clear picture of data, one needs to have


a measure of dispersion or needs to have a measure of
dispersion or variability (scatterdness) amongst
03/09/2021
observations in the set. Tesfa S. 32
 Consider the following three datasets
Dataset 1:7, 7, 7, 7, 7, 7 Mean=7, s.d=0
Dataset 2: 6, 7, 7, 7, 7, 8, mean=7, s.d=0.63
Dataset 3: 3, 2, 7, 8, 9, 13, mean=7, s.d=4.04
Which one is more scattered ?
 Two treatments to prolong life of a diseased
individual
- Drug A average survival 1.6 years
- Drug B average survival 1.1 years
Is drug A better?

03/09/2021 Tesfa S. 33
03/09/2021 Tesfa S. 34
1. RANGE:
 It is the difference between the largest and smallest
observation from the data
R= L value – S value from the data set

EXAMPLE: Consider the data on the weight of 10 new born


children at university of Gondar hospital within a month:
2.51, 3.01, 3.25, 2.02,1.98, 2.33, 2.33, 2.98, 2.88, 2.43.

03/09/2021 Tesfa S. 35
 Then the range can be computed by arranging all observation in
ascending order:
1.98, 2.02, 2.33, 2.33, 2.43, 2.51, 2.88, 2.98, 3.01, 3.25.
 Maximum-Minimum=3.25-1.98=1.27
 The usefulness of the range is limited.
 The fact that it takes in to account only two values causes
 It wastes information , it takes no account of the entire data.
 The main advantage in using the range is the simplicity computation.

03/09/2021 Tesfa S. 36
 The usefulness of the range is limited. The fact that it takes in to
account only two values causes it to be a poor measure of
dispersion.
 The main advantage in using the range is the simplicity of its
computation.

 The extremes values may be unreliable; that is, they are the
most likely to be faulty

03/09/2021 Tesfa S. 37
2. The interquartile range (IQR):
 It reflects the variability among the middle 50 percent of the
observation in a data set.
 is the difference between the first and the third quartiles
 To compute
 we first sort the data in ascending order
 Find the first quartile
 The third quartile
 Then calculate the difference

03/09/2021 Tesfa S. 38
IQR Cont---

03/09/2021 Tesfa S. 39
Example
Given the following data set (age of patients) find the interquartile
range?
18,59,24,42,21,23,24,32
1. sort the data from lowest to highest
18 21 23 24 24 32 42 59

2. find the bottom and the top quarters of the data


3. find the difference (interquartile range) between the two
quartiles.

03/09/2021 Tesfa S. 40
Example …

 1st quartile = The {1/4 (n+1)}th observation = (2.25) th observation = 21


+ (23-21)x 0.25 = 21.5
 3rd quartile = {3/4 (n+1)}th observation = (6.75)th observation = 32 + (42-
32)x 0.75 = 39.5
Hence, IQR = 39.5 - 21.5 = 18
i.e. 50% observation age of patients between 21.5 and 39.5
 The IQR is a preferable measure than the range for skewed data.
 It can also computed from open-end classes.

03/09/2021 Tesfa S. 41
 While the inter-quartile range eliminates the problem of
outliers it creates another problem in that you are
eliminating half of your data.
 The solution to both problems is to measure variability
from the center of the distribution.

03/09/2021 Tesfa S. 42
3. Variance

Variance:

 Variance measure how far on average scores deviate or differ from the
mean.

 To compute variance we first start by computing the deviation of each


observation from the mean.
 As the property of mean, the sum of the deviation of each observation
from the mean is zero.

03/09/2021 43
Tesfa S.
Variance:

 Hence to avoid this problem, let us take the square of


the deviation from the mean.
 Thus variance is defined as the sum of the square of
the deviation of each observation from the mean
divided by total number of observation.
 Mathematically the formula for population variance is
defined as:

03/09/2021 Tesfa S. 44
• Mathematically the formula for sample variance is
defined as:

03/09/2021 Tesfa S. 45
4. Standard Deviation
Standard Deviation:
 The sample and population standard deviations are denoted by S
and σ (by convention) respectively.
 The standard deviation(S.D.), is just the positive square root of the
variance.
 It expresses exactly the same information as the variance, but re-
scaled to be in the same units as the mean.
 The best measures for normally distributed data
 Mathematically: Population standard deviation

03/09/2021 Tesfa S. 46
Standard Deviation:

 Sample standard deviation can be defined as:

 Example1 The Areas of spray able surfaces with DDT


from a sample of 15 houses are measured as follows (in
m2) :

101,105,110,114,115,124,125,125,130,133,135,136,13
7,140,145

03/09/2021 Tesfa S. 47
Example 1
 Find the variance and standard deviation of the
above distribution.
 Solutions
The mean of the sample is 125 m2.
Variance (sample) = s2 = Σ(xi –x)2/n-1 = {(101-125) 2
+(105-125) 2 + ….(145-125) 2 } / (15-1)
= 2502/14
= 178.71 m4
Hence, the standard deviation
=
= 13.37 m2
03/09/2021 Tesfa S. 48
Variance for grouped frequency distribution

 In a grouped frequency distribution, the variance is computed as:

S2 =

 Where as
fi =frequency of ith class
Xci =class mark of ith class
n = total number of the sample
03/09/2021 Tesfa S. 49
Example 3

 Consider the previous data on time spend by college


students for leisure activities

03/09/2021 Tesfa S. 50
S =

03/09/2021 Tesfa S. 51
5. Coefficient of variance
 The SD is an absolute measure of deviation of
observations around their mean and is expressed with
the same unit of the data.
 Due to this nature of the standard deviation not directly
used for comparison purposes with respect to variability.
 Coefficient of variation, is often used for comparison
purpose
 The coefficient of variation (CV) is defined by:
CV =

 The CV is most useful in comparing the variability of


several different samples, each with different means.
03/09/2021 Tesfa S. 52
Coefficient of variance…
 CV is a relative measure free from unit of measurement.
 example

Weights of newborn Weights of newborn


elephants (kg) mice (kg)

929 853 0.72 0.42


878 939 0.63 0.31
895 972 0.59 0.38
937 841 0.79 0.96
801 826 1.06 0.89
Mice show
n=10, = 0.68 greater birth-
n=10, = 887.1
s = 0.255 weight variation
s = 56.50
CV = 0.375
CV = 0.0637
03/09/2021 Tesfa S. 53
When to use coefficient of variance

 When comparison groups have very different means


(CV is suitable as it expresses the standard deviation
relative to its corresponding mean)

 When different units of measurements are involved,


e.g. group 1 unit is mm, and group 2 unit is gm (CV is
suitable for comparison as it is unit-free)

 In such cases, standard deviation should not be used


for comparison

03/09/2021 Tesfa S. 54
Summary Cont---
Data type vs Measure of central tendency and
dispersion

Central Tendency Measure of Dispersion

Nominal  Mode Nominal  IQR

Ordinal  Median Ordinal  range

Interval  Mean Interval  SD

Ratio  Mean Ratio  SD

03/09/2021 Tesfa S. 55
Describing Data: Summary

Nominal Data
Do the data have order? No Frequency Table
(and Mode)
Yes
Plot/compare mean and
median
Are the data skewed? Yes
Use medians and IQR
(or consider data
No
transformation)
Does the measure have the
interval property? No

Yes
Use mean and standard
deviation
03/09/2021 Tesfa S. 56
Exercise
1. What general measures are used to describe
frequency distributions for quantitative data?
2. What are most commonly used measures of central
tendency?
3. Which is a more stable indicator of central tendency,
the median or the mean?
4. What is the relationship among mean, median, and
mode in a symmetric frequency distribution?
5. What are the measures of dispersion commonly used
in biostatistics?
6. Name three terms that are used to describe the
03/09/2021
shape of frequency distributions.
Tesfa S. 57
THANK YOU

03/09/2021 Tesfa S. 58
QUIZ
1. Which measure of central tendency is appropriate for skewed
data
A. Mean B. Median C. Mode D. Range
2. The exam sore out of 10 for seven students were:7, 6, 5, 7, 4,
2,3. Based on the information given calculate
B. Median B. Mode C. Range D.IQR E. SD

03/09/2021 Tesfa S. 59

You might also like