0% found this document useful (0 votes)
33 views

Biostat Lecture 3-1

The document discusses various methods for organizing and presenting data collected from surveys, including ordering data numerically or categorically, displaying data in frequency distributions and tables, and using graphical representations like diagrams. It provides examples of how to construct frequency distributions, tables with one or two variables, and guidelines for effective table construction. The goal is to condense and simplify raw data to make patterns and relationships more evident.

Uploaded by

ODAA TUBE
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Biostat Lecture 3-1

The document discusses various methods for organizing and presenting data collected from surveys, including ordering data numerically or categorically, displaying data in frequency distributions and tables, and using graphical representations like diagrams. It provides examples of how to construct frequency distributions, tables with one or two variables, and guidelines for effective table construction. The goal is to condense and simplify raw data to make patterns and relationships more evident.

Uploaded by

ODAA TUBE
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 162

Lecture 3

Methods of Data Processing,


Organization, presentation and
summarization

1
Methods of data organization and presentation

 The data collected in a survey is called raw data.

 In most cases, useful information is not immediately


evident from the mass of unsorted data.
 Collected data need to be organized in such a way as
to condense the information they contain in a way
that will show patterns of variation clearly.

2
Precise methods of analysis can be decided
up on only when the characteristics of the
data are understood.

For the primary objective of this different


techniques of data organization and
presentation like order array, tables and
diagrams are used.

3
Generally Summarizing and organizing data can
be achieved through:

1. Frequency Distributions

2. Graphical Representations

3. Measures of Central Tendency

4. Measures of variability

4
Frequency Distributions
o For data to be more easily appreciated and to draw
quick comparisons, it is often useful to arrange the data
in the form of a table, or in one of a number of different
graphical forms.

o When analyzing voluminous data collected from say, a


health center's records, it is quite useful to put them
into compact tables.

o Quite often, the presentation of data in a meaningful


way is done by preparing a frequency distribution.

o If this is not done the raw data will not present any
meaning and any pattern in them (if any) may not be
detected. 5
Array
Array (ordered array) is a serial arrangement of
numerical data in an ascending or descending order.

This will enable us to know the range over which the


items are spread and will also get an idea of their
general distribution.

Very difficult with large sample size

Hence it is an appropriate way of presentation when


the data are small in size (usually less than 20).
20
6
Ordered Array
12 19 27 36 42 59
15 22 31 39 43 61
17 23 31 41 44 65
18 26 34 41 54 67

7
• The actual summarization and organization of data
starts from frequency distribution.

• Frequency distribution: A table which has a list of


each of the possible values that the data can assume
along with the number of times each value occurs.

8
• For nominal and ordinal data, frequency distributions
are often used as a summary.
• Example:

• The % of times that each value occurs, or the relative


frequency, is often listed

• Tables make it easier to see how the data are distributed


9
• For both discrete and continuous data, the
values are grouped into non-overlapping
intervals, usually of equal width.

10
a) Qualitative variable: Count the number of cases in
each category.

- Example1: The intensive care unit type of 25 patients


entering ICU at a given hospital:
1. Medical
2. Surgical
3. Cardiac
4. Other

11
Frequency Relative Frequency
ICU Type (How often) (Proportionately often)

Medical 12 0.48
Surgical 6 0.24
Cardiac 5 0.20
Other 2 0.08

Total 25 1.00

12
Example 2:
A study was conducted to assess the characteristics of a
group of 234 smokers by collecting data on gender and
other variables.
Gender, 1 = male, 2 = female

Gender Frequency (n) Relative Frequency


Male (1) 110 47.0%
Female (2) 124 53.0%
Total 234 100%

13
b) Quantitative variable:
- Select a set of continuous, non-overlapping
intervals such that each value can be placed in
one, and only one, of the intervals.

- The first consideration is how many intervals to


include

14
For a continuous variable (e.g. –
age), the frequency distribution
of the individual ages is not so
interesting.

15
• We “see more” in
frequencies of age
values in “groupings”.
Here, 10 year groupings
make sense.
• Grouped data
frequency distribution

16
To determine the number of class intervals and the
corresponding width, we may use:

Sturge’s rule:

K  1  3.322(log n)
L S
W
K
where
K = number of class intervals
n = no. of observations
W = width of the class interval
L = the largest value
S = the smallest value
17
Example:
Leisure time (hours) per week for 40 college students:

23 24 18 14 20 36 24 26 23 21 16 15 19 20 22 14 13 10 19
27 29 22 38 28 34 32 23 19 21 31 16 28 19 18 12 27 15
21 25 16

K = 1 + 3.22 (log40) = 6.32 ≈ 6

Maximum value = 38, Minimum value = 10


Width = (38-10)/6 = 4.66 ≈ 5

18
Time Relative Cumulative
(Hours) Frequency Frequency Relative
Frequency
10-14 5 0.125 0.125
15-19 11 0.275 0.400
20-24 12 0.300 0.700
25-29 7 0.175 0.875
30-34 3 0.075 0.950
35-39 2 0.050 1.00

Total 40 1.00

19
• Cumulative frequencies: When frequencies of two or
more classes are added.

• Cumulative relative frequency: The percentage of the


total number of observations that have a value either in
that interval or below it.

• Mid-point: The value of the interval which lies midway


between the lower and the upper limits of a class.

20
• True limits: Are those limits that make an
interval of a continuous variable continuous in
both directions

• Used for smoothening of the class intervals

• Subtract 0.5 from the lower and add it to the


upper limit

21
Time
(Hours) True limit Mid-point Frequency

10-14 9.5 – 14.5 12 5


15-19 14.5 – 19.5 17 11
20-24 19.5 – 24.5 22 12
25-29 24.5 – 29.5 27 7
30-34 29.5 – 34.5 32 3
35-39 34.5 - 39.5 37 2

Total 40

22
Simple Frequency Distribution
• Primary and secondary cases of syphilis morbidity by
age, 1989
Age group Cases
(years) Number Percent

0-14 230 0.5


15-19 4378 10.0
20-24 10405 23.6
25-29 9610 21.8
30-34 8648 19.6
35-44 6901 15.7
45-54 2631 6.0
>44 1278 2.9
Total 44081 100 23
Two Variable Table
• Primary and secondary cases of syphilis morbidity
by age and sex, 1989
Age group Number of cases
(years) Male Female Total

0-14 40 190 230


15-19 1710 2668 4378
20-24 5120 5285 10405
25-29 5301 4306 9610
30-34 5537 3111 8648
35-44 5004 1897 6901
45-54 2144 487 2631
>44 1147 131 1278
Total 26006 18075 44081
24
Tables can also be used to present more than
three or more variables.

Variable Frequency (n) Percent


Sex
Male
Female
Age (yrs)
15-19
20-24
25-29
Religion
Christian
Muslim
Occupation
Student
Farmer
Merchant
25
Guidelines for constructing tables
• Keep them simple,
• Limit the number of variables to three or less,
• All tables should be self-explanatory,
• Include clear title telling what, when and where,
• Clearly label the rows and columns,
• State clearly the unit of measurement used,
• Explain codes and abbreviations in the foot-note,
• Show totals,
• If data is not original, indicate the source in foot-note.

26
Diagrammatic Representation

• Pictorial representations of numerical data

27
Importance of diagrammatic representation:

1. Diagrams have greater attraction than


mere figures.
2. They give quick overall impression of the
data.
3. They have great memorizing value than
mere figures.
4. They facilitate comparison
5. Used to understand patterns and trends
28
• Well designed graphs can be powerful means
of communicating a great deal of information

• When graphs are poorly designed, they not


only ineffectively convey message, but they are
often misleading.

29
Limitations of Diagrammatic Representation
1. The technique of diagrammatic representation is
made use only for purposes of comparison. It is not
to be used when comparison is either not possible
or is not necessary.
2. Diagrammatic representation is not an alternative
to tabulation. It only strengthens the textual
exposition of a subject, and cannot serve as a
complete substitute for statistical data.
3. It can give only an approximate idea and as such
where greater accuracy is needed diagrams will not
be suitable.
4. They fail to bring to light small differences
30
Construction of graphs
 The choice of the particular form among the
different possibilities will depend on personal
choices and/or the type of the data.
 Bar charts and pie chart are commonly used
for qualitative or quantitative discrete data.
 Histograms, frequency polygons are used for
quantitative continuous data.

31
There are, however, general rules that are commonly
accepted about construction of graphs:
1.Every graph should be self-explanatory and as simple as
possible.
2.Titles are usually placed below the graph and it should
again question what? Where? When? How classified?
3.Legends or keys should be used to differentiate variables
if more than one is shown.
4.The axes label should be placed to read from the left side
and from the bottom.
5.The units in to which the scale is divided should be
clearly indicated.
6.The numerical scale representing frequency must start at
zero or a break in the line should be shown.
32
Method of constructing bar chart
• All the bars must have equal width
• The bars are not joined together (leave space
between bars)
• The different bars should be separated by equal
distances
• All the bars should rest on the same line called
the base
• Label both axes clearly

33
Specific types of graphs include:
• Bar graph Nominal, ordinal
• Pie chart data

• Histogram
• Stem-and-leaf plot
• Box plot Quantitative
• Scatter plot data
• Line graph
• Others

34
1. Bar Chart
 Bar diagrams are used to represent and compare the
frequency distribution of discrete variables and
attributes or categorical series

 When we represent data using bar diagram, all the


bars must have equal width and the distance between
bars must be equal.

 There are different types of bar diagrams, the most


important ones are:

35
A. Simple bar chart:
• It is a one-dimensional diagram in which the bar
represents the whole of the magnitude.

• The height or length of each bar indicates the


size (frequency) of the figure represented

36
90
80
Number of Children 70
60
50
40
30
20
10
0
Not Immunized Partialy immunized Fully immunized
Immunization Status

Fig 1. Immunization status of children in x District Jan ,2014

37
B. Multiple bar chart
In this type of chart the component figures are
shown as separate bars adjoining/touch each
other.
The height of each bar represents the actual
value of the component figure.
It depicts distributional pattern of more than
one variable
– Example of multiple bar diagrams: consider that
data on immunization status of women by marital
status.

38
Fig. 2 TT Immunization status by marital status of women 15-49
years, Asendabo town, 1996
39
There’s no reason why the bar chart can’t be
plotted horizontally instead of vertically.

CHA
Type of source

HC

Reading

Training female
male
Campaign

Anti FGMC

CAT

0 10 20 30 40 50
Percent

Figure 1. Source of information on the complications of FGM and participation in RH programs,


Jijiga, 2004*. * FGMC = female genital mutilation committee; CAT= community action team; HC =
health centre; CHA= community health agent

40
Example: Construct a bar chart for the following data.

Distribution of patients in hospital by source of referral


Source of referral No. of patients Relative freq.
Other hospital 97 5.1
General practitioner 769 40.3
Out-patient department 623 32.7
Casualty 256 13.4
Other 161 8.5
Total 1 906 100.0

41
Distribution of patients in hopital X by source of referal, 1999
769
800

700 623
600
No. of patients

500

400

300 256

200 161
97
100

0
Other GP OPD Casualty Other
hospital
Source of referal

42
C. Component ( sub-divided) Bar Diagram
Bars are sub-divided into component parts of the
figure.
These sorts of diagrams are constructed when each
total is built up from two or more component
figures.
They can be of two kinds:
I) Actual Component Bar Diagrams: When the overall
height of the bars and the individual component
lengths represent actual figures.
Example of actual component bar diagram: The
above data can also be presented as below.

43
44
C. Percentage Component Bar Diagram

 Where the individual component lengths


represent the percentage each component
forms the overall total.

 Note that a series of such bars will all be the


same total height, i.e., 100 percent.
oExample of percentage component bar
diagram

45
46
2. Pie chart
• Shows the relative frequency for each category by
dividing a circle into sectors, the angles of which are
proportional to the relative frequency.
• Used for a single categorical variable
• Use percentage distributions

47
Steps to construct a pie-chart
• Construct a frequency table

• Change the frequency into percentage (P)

• Change the percentages into degrees, where:


degree = Percentage X 360o

• Draw a circle and divide it accordingly


48
Example: Distribution of deaths for females, in England
and Wales, 1989.

Cause of death No. of death


Circulatory system 100 000
Neoplasm 70 000
Respiratory system 30 000
Injury and poisoning 6 000
Digestive system 10 000
Others 20 000
Total 236 000

49
Distribution fo cause of death for females, in England and Wales, 1989

Others
8%
Digestive System
4%
Injury and Poisoning
3%

Circulatory system
Respiratory system
42%
13%

Neoplasmas
30%

50
3. Histogram
• Histograms are frequency distributions with
continuous class intervals that have been turned into
graphs.

• To construct a histogram, we draw the interval


boundaries on a horizontal line and the frequencies
on a vertical line.

• Non-overlapping intervals that cover all of the data


values must be used.

51
• Bars are drawn over the intervals in such a
way that the areas of the bars are all
proportional in the same way to their interval
frequencies.

• The area of each bar is proportional to the


frequency of observations in the interval

52
Example: Distribution of the age of women at the time of marriage
Age 15-19 20-24 25-29 30-34 35-39 40-44 45-49
group
Number 11 36 28 13 7 3 2
Age of women at the time of marriage

40

35

30
No of women

25

20

15

10

0
14.5-19.5 19.5-24.5 24.5-29.5 29.5-34.5 34.5-39.5 39.5-44.5 44.5-49.5
Age group

53
Histogram for the ages of 2087 mothers with <5 children,
Adami Tulu, 2003

700

600

500

400

300

200

100 Std. Dev = 6.13


Mean = 27.6
0 N = 2087.00
15.0 20.0 25.0 30.0 35.0 40.0 45.0 50.0 55.0

N1AGEMOTH

54
Two problems with histograms
1. They are somewhat difficult to construct
2. The actual values within the respective groups
are lost and difficult to reconstruct

 The other graphic display (stem-and-leaf plot)


overcomes these problems

55
4. Stem-and-Leaf Plot
• A quick way to organize data to give visual impression
similar to a histogram while retaining much more detail
on the data.

• Similar to histogram and serves the same purpose and


reveals the presence or absence of symmetry
• Are most effective with relatively small data sets

• Are not suitable for reports and other communications,


but
• Help researchers to understand the nature of their data
56
Steps to construct Stem-and-Leaf Plots
1. Separate each data point into a stem and leaf
components
• Stem = consists of one or more of the initial digits
of the measurement
• Leaf = consists of the rightmost digit
The stem of the number 483, for example, is 48 and the
leaf is 3.
2. Write the smallest stem in the data set in the
upper left-hand corner of the plot

57
Steps …

3. Write the second stem (first stem +1) below the first
stem
4. Continue with the remaining stems until you reach
the largest stem in the data set
5. Draw a vertical bar to the right of the column of
stems
6. For each number in the data set, find the appropriate
stem and write the leaf to the right of the vertical
bar

58
Example: 3031, 3101, 3265, 3260, 3245, 3200, 3248,
3323, 3314, 3484, 3541, 3649 (BWT in g)

Stem Leaf Number


30 31 1
31 01 1
32 65 60 45 00 48 5
33 23 14 2
34 84 1
35 41 1
36 49 1
59
Percentiles (Quartiles)
• Suppose that 50% of a cohort survived at least 4
years.
• This also means that 50% survived at most 4 years.
• We say 4 years is the median.
• The median is also called the 50th percentile
• We write: P50 = 4 years.

60
• Similarly we could speak of other percentiles:
– P0: The minimum
– P25: 25% of the sample values are less than or equal
to this value. 1st Quartile
. P25 means 25th percentile

– P50: 50% of the sample are less than or equal to this


value. 2nd Quartile

– P75: 75% of the sample values are less than or equal


to this value. 3rd Quartile
– P100: The maximum
61
It is possible to estimate the values of percentiles from a
cumulative frequency polygon.

62
5. Scatter plot
• Most studies in medicine involve measuring more than
one characteristic, and graphs displaying the relationship
between two characteristics are common in literature.

• When both the variables are qualitative then we can use


a multiple bar graph.

• When one of the characteristics is qualitative and the


other is quantitative, the data can be displayed in box
and whisker plots.

63
• For two quantitative variables we use
bivariate plots (also called scatter plots or
scatter diagrams).

• In the study on percentage saturation of


bile, information was collected on the age
of each patient to see whether a
relationship existed between the two
measures.

64
• A scatter diagram is constructed by drawing X-and Y-axes.
• Each point represented by a point or dot() represents a pair of
values measured for a single study subject

Age and percentage saturation of bile for women patients in


hospital Z, 1998
160

140

120
Saturation of bile

100

80

60

40

20

0
0 10 20 30 40 50 60 70 80
Age

65
• The graph suggests the possibility of a positive
relationship between age and percentage
saturation of bile in women.

66
6. Line graph
• Useful for assessing the trend of particular situation overtime.
• Helps for monitoring the trend of epidemics.
• The time, in weeks, months or years, is marked along the
horizontal axis, and
• Values of the quantity being studied is marked on the vertical
axis.
• Values for each category are connected by continuous line.
• Sometimes two or more graphs are drawn on the same graph
taking the same scale so that the plotted graphs are
comparable.

67
No. of microscopically confirmed malaria cases by species and
month at Zeway malaria control unit, 2003
No. of confirmed malaria cases

2100

1800 Positive
1500 P. falciparum
P. vivax
1200

900

600

300

0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Months

68
Line graph can be also used to depict the relationship
between two continuous variables like that of scatter
diagram.

• The following graph shows level of zidovudine


(AZT) in the blood of AIDS patients at several
times after administration of the drug, for
with normal fat absorption and with fat mal
absorption.

69
Response to administration of zidovudine in two groups of AIDS
patients in hospital X, 1999

8
7
6
Blood zidovudine
concentration

5
4
3
2
1
0
10
20
70
80
100
120
170
190
250
300
Time since administration (Min.) 360

Fat malabsorption Normal fat absorption

70
Exercise
• Evaluate the following graphs whether they
are good or bad and discuss the points which
make them good or bad

71
MMRatio per 100,000 live births by age of woman;
Giza, Egypt 1984

1200

1000
MMR per 100,000 LB

800

600

400

200

0
15-19 20-24 25-29 30-34 35-39 40-44 45-49

Age

MMR per 100,000 LB

72
1. The title of the graph tells the reader the
content of the graph. For example:
• the statistic presented (MMRatio);
• the second dimension of the graph (age of
woman on the x axis);
• the metric (per 100,000 live births);
• the source of the data (Giza, Egypt);
• The date (1984);

73
2. The Y axis is labeled (MMR per 100,000
LB);
3. The X axis is labeled (age of woman);
4. The legend is given (_______= MMR);
5. The source of the information is provided
(Kane et al)

74
Maternal Mortality:
Countries X, Y and Z since 1850
900

800
700
600
500
400
300
200
100
0

Sweden UK USA

75
• The Y axis is not labeled;
• The title does not give you the statistic presented in
the graph (Maternal Mortality is not a statistic). This
is particularly problematic when the Y axis is also not
labeled;
• Neither the title nor the Y axis identify the metric (per
100,000 live births).
• The X axis is not labeled – but this is not so serious
when the categories are so obvious and when the
second dimension (year) has been identified in the
graph title.

76
14

Remember:
12

10

A graph is a tool. 2

0
Antepartum Intrapartum Postpartum

It is not an artwork to
Pre-eclampsia Eclampsia

hang above your sofa!


It is more important that it is
easy to correctly interpret
than it is that it is pretty!

77
Numerical Summary Measures

Single numbers which quantify the characteristics


of a distribution of values

 Measures of central tendency (location)

 Measures of dispersion

78
• A frequency distribution is a general picture of
the distribution of a variable

• But, can’t indicate the average value and the


spread of the values

79
Measures of Central Tendency (MCT)
• On the scale of values of a variable there is a certain
stage at which the largest number of items tend to
cluster.

• Since this stage is usually in the centre of distribution,


the tendency of the statistical data to get concentrated
at a certain value is called “central tendency”

• The various methods of determining the point about


which the observations tend to concentrate are called
MCT.
80
• The objective of calculating MCT is to determine a single
figure which may be used to represent the whole data
set.

• In that sense it is an even more compact description of


the statistical data than the frequency distribution.

• Since a MCT represents the entire data, it facilitates


comparison within one group or between groups of
data.

81
Position
20

15

10

0
0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99

82
Characteristics of a good MCT
A MCT is good or satisfactory if it possesses the
following characteristics.
1. It should be based on all the observations
2. It should not be affected by the extreme values
3. It should be as close to the maximum number of values
as possible
4. It should have a definite value
5. It should not be subjected to complicated and tedious
calculations
6. It should be capable of further algebraic treatment
7. It should be stable with regard to sampling

83
• The most common measures of central
tendency include:
– Arithmetic Mean

– Median

– Mode

– Others

84
1. Arithmetic Mean
A. Ungrouped Data
• The arithmetic mean is the "average" of the data set
and by far the most widely used measure of central
location and it is usually denoted by
• Is the sum of all the observations divided by the total
number of observations.

85
The Summation Notation

86
87
The heart rates for n=10 patients were as follows (beats
per minute):
167, 120, 150, 125, 150, 140, 40, 136, 120, 150
What is the arithmetic mean for the heart rate of these
patients?

88
b)G ro
u pe dd a
ta
Inc alc
u latingthem e
anfromgroup
eddata
,weass
u m
etha
tallvalu esfallingin
toa
particularc la
ssinte
rva
larelo
cate
datth
em id
-po
into
fth
einterv
a l.Itisc alc
ula
teda
s
follow:
k


mf ii
x=i=1k

f
i=
1
i

w
he
re,
k= thenum be
rofclassinterv a
ls
m i=them id
-po
intoftheithc la
ssinterv
al
fi=thefre
q u
encyoftheithc lassin
terval

89
Example. Compute the mean age of 169 subjects
from the grouped data.

Mean = 5810.5/169 = 34.48 years


Class interval Mid-point (mi) Frequency (fi) mifi
10-19 14.5 4 58.0
20-29 24.5 66 1617.0
30-39 34.5 47 1621.5
40-49 44.5 36 1602.0
50-59 54.5 12 654.0
60-69 64.5 4 258.0
Total __ 169 5810.5

90
The mean can be thought of as a “balancing
point”, “center of gravity”

91
When the data are skewed, the mean is “dragged” in
the direction of the skewness

• It is possible in extreme cases for all but one of the sample points
to be on one side of the arithmetic mean & in this case, the mean is
a poor measure of central location or does not reflect the center of
the sample.

92
Properties of the Arithmetic Mean.
• For a given set of data there is one and only one
arithmetic mean (uniqueness).

• Easy to calculate and understand (simple).

• Influenced by each and every value in a data set

• Greatly affected by the extreme values.

• In case of grouped data if any class interval is open,


arithmetic mean can not be calculated.
93
2. Median
a) Ungrouped data
• The median is the value which divides the data set into
two equal parts.

• If the number of values is odd, the median will be the


middle value when all values are arranged in order of
magnitude.

• When the number of observations is even, there is no


single middle value but two middle observations.
• In this case the median is the mean of these two middle
observations, when all observations have been arranged
in the order of their magnitude.
94
95
96
97
• The median is a better description (than the mean) of
the majority when the distribution is skewed
• Example
– Data: 14, 89, 93, 95, 96
– Skewness is reflected in the outlying low value of 14
– The sample mean is 77.4
– The median is 93

98
b) Grouped data
• In calculating the median from grouped data, we assume
that the values within a class-interval are evenly
distributed through the interval.

• The first step is to locate the class interval in which the


median is located, using the following procedure.

• Find n/2 and see a class interval with a minimum


cumulative frequency which contains n/2.

• Then, use the following formula.


99
n 
  Fc 
~
x = Lm   2 W
 fm 
 
 
where,
Lm = lower true class boundary of the interval containing the median
Fc = cumulative frequency of the interval just above the median
class
interval
fm = frequency of the interval containing the median
W= class interval width
n = total number of observations
100
Example. Compute the median age of 169
subjects from the grouped data.

n/2 = 169/2 = 84.5

Class interval Mid-point (mi) Frequency (fi) Cum. freq


10-19 14.5 4 4
20-29 24.5 66 70
30-39 34.5 47 117
40-49 44.5 36 153
50-59 54.5 12 165
60-69 64.5 4 169
Total 169

101
• n/2 = 84.5 = in the 3rd class interval
• Lower limit = 29.5, Upper limit = 39.5
• Frequency of the class = 47
• (n/2 – fc) = 84.5-70 = 14.5

• Median = 29.5 + (14.5/47)10 = 32.58 ≈ 33

102
Properties of the median
• There is only one median for a given set of data
(uniqueness)

• The median is easy to calculate


• Median is a positional average and hence it is insensitive
to very large or very small values

• Median can be calculated even in the case of open end


intervals

• It is determined mainly by the middle points and less


sensitive to the remaining data points (weakness).

103
Quartiles
• Just as the median is the value above and below which
lie half the set of data, one can define measures (above
or below) which lie other fractional parts of the data.

• The median divides the data into two equal parts

• If the data are divided into four equal parts, we speak of


quartiles.

104
a) The first quartile (Q1): 25% of all the ranked
observations are less than Q1.

b) The second quartile (Q2): 50% of all the ranked


observations are less than Q2. The second
quartile is the median.

c) The third quartile (Q3): 75% of all the ranked


observations are less than Q3.
105
Percentiles
• Simply divide the data into 100 pieces.

• Percentiles are less sensitive to outliers and not


greatly affected by the sample size (n).

106
3. Mode
• The mode is the most frequently occurring value among
all the observations in a set of data.

• It is not influenced by extreme values.

• It is possible to have more than one mode or no mode.

• It is not a good summary of the majority of the data.

107
Mode
Mode
Mode

20
18
16
14
12
N 10
8
6
4
2
0
108
T. Ancelle, D. Coulombie
a) Ungrouped data
• It is a value which occurs most frequently in a
set of values.
• If all the values are different there is no
mode, on the other hand, a set of values may
have more than one mode.

109
• Example
• Data are: 1, 2, 3, 4, 4, 4, 4, 5, 5, 6
• Mode is 4 “Unimodal”
• Example
• Data are: 1, 2, 2, 2, 3, 4, 5, 5, 5, 6, 6, 8
• There are two modes – 2 & 5
• This distribution is said to be “bi-modal”
• Example
• Data are: 2.62, 2.75, 2.76, 2.86, 3.05, 3.12
• No mode, since all the values are different

110
b) Grouped data
• To find the mode of grouped data, we
usually refer to the modal class, where the
modal class is the class interval with the
highest frequency.
• If a single value for the mode of grouped
data must be specified, it is taken as the
mid-point of the modal class interval.

111
 
x̂ = L m 
 w f 2 
 0  
f f 2 
 
where
L - Lower boundary of the Modal class
f0 – The frequency of the class next below the modal class
in value
f2 – the frequency of the class next above the modal class
in value
w – length of the interval of the modal class

112
113
Properties of mode
 It is not affected by extreme values
 It can be calculated for distributions with
open end classes
 Often its value is not unique
 The main drawback of mode is that often it
does not exist

114
4. Geometric mean (GM)
• Mainly used in many types of laboratory data,
specifically data in the form of concentrations of one
substance in another
• Example: the minimum inhibitory concentration of
penicillin in urine for N. gonorrhoeae in 71 patients

(µg/ml) Frequency (µg/ml) Frequency

0.03125 21 0.250 19
0.0625 6 0.50 17
0.1250 8 1.0 3

115
If x 1 , x 2 , ..., x n are n positive observed values, then
n
GM = n  x i
i=1

and
n

 logx
i=1
i
logGM = .
n
The geometric mean is generally used with data measured on a logarithmic scale, such
as titers of anti-neutrophil immunoglobulin G.

116
Example:
logGM = [21log(0.03125) + 6log(0.0625) +
8log(0.125) + 19log(0.25) + 17log(0.5)
+ 3log(1.0)]/74 = -0.846
The GM = the antilogarithm of -0.846 = 0.143

117
5. Harmonic mean (HM)
• Just as the geometric mean is based on an
arithmetic mean of logarithms, so is the
harmonic mean based on arithmetic mean
of the reciprocals.
• Pertains to rates and time
• We define it as the reciprocal of the
arithmetic mean of the reciprocal of the
given numbers.

118
If the given numbers are x 1 , x 2 , ..., x n , then
1
HM = n
1 1

n i=1 x i

119
6. Weighted mean (WM)
• In a weighted mean, separate outcomes have
separate influences.

• The influence attached to an outcome is the


weight.

• Familiar is the calculation of a course grade as


a weighted average of scores on separate
outcomes.

120
Example:

121
Which measure of central tendency is best with a given
set of data?

• Two factors are important in making this


decisions:
– The scale of measurement (type of data)
– The shape of the distribution of the
observations

122
• The mean can be used for discrete and
continuous data
• The median is appropriate for discrete and
continuous data as well, but can also be used
for ordinal data
• The mode can be used for all types of data,
but may be especially useful for nominal and
ordinal measurements
• For discrete or continuous data, the “modal
class” can be used

123
• The geometric mean is used primarily for
observations measured on a logarithmic
scale.
• Harmonic mean is a suitable MCT when the
data pertains to rates and time.
• Weighted mean is commonly used in the
calculation of mean for different outcomes.

124
(a) Symmetric and unimodal distribution —
Mean, median, and mode should all be
approximately the same

Mean, Median & Mode

125
(b) Bimodal — Mean and median should be
about the same, but may take a value that is
unlikely to occur; two modes might be best

126
(c) Skewed to the right (positively skewed) —
Mean is sensitive to extreme values, so median
might be more appropriate
Mode

Median

Mean

127
(d) Skewed to the left (negatively skewed) —
Same as (c)
Mode

Median

Mean

128
Measures of Dispersion
Consider the following two sets of data:

A: 177 193 195 209 226 Mean = 200

B: 192 197 200 202 209 Mean = 200

Two or more sets may have the same mean and/or median but they
may be quite different.

129
These two distributions have the same mean,
median, and mode

130
• MCT are not enough to give a clear
understanding about the distribution of the
data.

• We need to know something about the


variability or spread of the values — whether
they tend to be clustered close together, or
spread out over a broad range

131
Measures of Dispersion
• Measures that quantify the variation or dispersion of
a set of data from its central location

• Dispersion refers to the variety exhibited by the


values of the data.

• The amount may be small when the values are close


together.

• If all the values are the same, no dispersion

132
Measures of Dispersion
Other synonymous term:
– “Measure of Variation”
– “Measure of Spread”
– “Measures of Scatter”

133
• Measures of dispersion include:
– Range
– Inter-quartile range
– Variance
– Standard deviation
– Coefficient of variation
– Standard error
– Others

134
1. Range (R)
• The difference between the largest and smallest
observations in a sample.

• Range = Maximum value – Minimum value

• Example –
– Data values: 5, 9, 12, 16, 23, 34, 37, 42
– Range = 42-5 = 37
• Data set with higher range exhibit more
variability
135
Properties of range
 It is the simplest crude measure and can be
easily understood
 It takes into account only two values which
causes it to be a poor measure of dispersion
 Very sensitive to extreme observations
 The larger the sample size, the larger the
range

136
2. Interquartile range (IQR)
• Indicates the spread of the middle 50% of the
observations, and used with median

IQR = Q3 - Q1

• Example: Suppose the first and third quartile for


weights of girls 12 months of age are 8.8 Kg and 10.2
Kg, respectively.
IQR = 10.2 Kg – 8.8 Kg
i.e., 50% of the infant girls weigh between 8.8 and 10.2
Kg.

137
The two quartiles (Q3 &Q1) form the basis of the
Box-and-Whiskers Plots — Variables A, B, C
10
9
8
7
6
5
4
3
2
1
0
Variable A Variable B Variable C

138
Properties of IQR:
• It is a simple and versatile measure
• It encloses the central 50% of the observations
• It is not based on all observations but only on two
specific values
• It is important in selecting cut-off points in the
formulation of clinical standards
• Since it excludes the lowest and highest 25% values,
it is not affected by extreme values
• Less sensitive to the size of the sample

139
3. Quartile deviation (QD)

QD = Q 3  Q1
2

140
4. Coefficient of quartile deviation (CQD)

• CQD = Q 3  Q1
Q 3  Q1
• CQD is an absolute quantity (unitless) and is
useful to compare the variability among the
middle 50% observations.

141
5. Mean deviation (MD)
• Mean deviation is the average of the absolute
deviations taken from a central value, generally
the mean or median.
• Consider a set of n observations x1, x2, ..., xn.
Then:
n
1
MD   x i  A
n i 1
• ‘A’ is a central value (arithmetic mean or
median).
142
Properties of mean deviation:
 MD removes one main objection of the earlier
measures, that it involves each value

 It is not affected much by extreme values

 Its main drawback is that algebraic negative signs of the


deviations are ignored which is
mathematically unsound

143
6. Variance (2, s2)
• The main objection of mean deviation, that
the negative signs are ignored, is removed by
taking the square of the deviations from the
mean.

• The variance is the average of the squares of


the deviations taken from the mean.

144
• It is squared because the sum of the
deviations of the individual observations of a
sample about the sample mean is always 0

0= ( )
 xi- x
• The variance can be thought of as an average
of squared deviations

145
• Variance is used to measure the dispersion of
values relative to the mean.
• When values are close to their mean (narrow
range) the dispersion is less than when there
is scattering over a wide range.
– Population variance = σ2
– Sample variance = S2

146
a) Ungrouped data
 Let X1, X2, ..., XN be the measurement on N
population units, then:
N

 i
(X   ) 2

2  i 1
where
N
N

X i
= i=1
is the population mean.
N

147
A sample variance is calculated for a sample of individual values
(X1, X2 , … Xn) and uses the sample
mean (e.g. ) rather than the population mean µ.

148
Degrees of freedom
• In computing the variance there are (n-1)
degrees of freedom because only (n-1) of the
deviations are independent from each other
• The last one can always be calculated from the
others automatically.
• This is because the sum of the deviations from
their mean (Xi-Mean) must add to zero.

149
b) Grouped data
k

 (m i  x) 2 f i
S2  i =1
k

f
i =1
i -1

where
mi = the mid-point of the ith class interval
fi = the frequency of the ith class interval
x = the sample mean
k = the number of class intervals
150
Properties of Variance:
 The main disadvantage of variance is that
its unit is the square of the unite of the
original measurement values
 The variance gives more weight to the
extreme values as compared to those
which are near to mean value, because the
difference is squared in variance.
• The drawbacks of variance are overcome
by the standard deviation.

151
7. Standard deviation (, s)
• It is the square root of the variance.
• This produces a measure having the same
scale as that of the individual values.

   and S = S
2 2

152
• Following are the survival times of n=11
patients after heart transplant surgery.

• The survival time for the “ith” patient is


represented as Xi for i= 1, …, 11.

• Calculate the sample variance and SD.

153
154
Example. Compute the variance and SD of the age of 169 subjects from
the grouped data.
Mean = 5810.5/169 = 34.48 years
S2 = 20199.22/169-1 = 120.23
SD = √S2 = √120.23 = 10.96
Class
interval (mi) (fi) (mi-Mean) (mi-Mean)2 (mi-Mean)2 fi
10-19 14.5 4 -19.98 399.20 1596.80
20-29 24.5 66 -9-98 99.60 6573.60
30-39 34.5 47 0.02 0.0004 0.0188
40-49 44.5 36 10.02 100.40 3614.40
50-59 54.5 12 20.02 400.80 4809.60
60-69 64.5 4 30.02 901.20 3604.80
Total 169 1901.20 20199.22

155
Properties of SD
• The SD has the advantage of being expressed in the
same units of measurement as the mean

• SD is considered to be the best measure of dispersion


and is used widely because of the properties of the
theoretical normal curve.

• However, if the units of measurements of variables of


two data sets is not the same, then there variability can’t
be compared by comparing the values of SD.

156
SD Vs Standard Error (SE)
• SD describes the variability among individual values
in a given data set
• SE is used to describe the variability among
separate sample means obtained from one sample
to another

• We interpret SE of the mean to mean that another


similarly conducted study may give a mean that
may lie between  SE.
157
Standard Error
• SD is about the variability of individuals

• SE is used to describe the variability in the


means of repeated samples taken from the
same population.

• For example, imagine 5,000 samples, each of the same size n=11. This would
produce 5,000 sample means. This new collection has its own pattern of
variability. We describe this new pattern of variability using the SE, not the
SD.

158
Example: The heart transplant surgery
n=11, SD=168.89, Mean=161 days
• What happens if we repeat the study? What will our next mean
be? Will it be close? How different will it be? Focus here is on the
generalizability of the study findings.
• The behavior of mean from one replication of the study to the
next replication is referred to as the sampling distribution of
mean.
• We can also have sampling distribution of the median or the SD

• We interpret this to mean that a similarly conducted study might


produce an average survival time that is near 161 days, ±50.9
days.

159
8. Coefficient of variation (CV)
• When two data sets have different units of
measurements, or their means differ
sufficiently in size, the CV should be used as
a measure of dispersion.
• It is the best measure to compare the
variability of two series of sets of
observations.
• Data with less coefficient of variation is
considered more consistent.
160
•CV is the ratio of the SD to the mean multiplied by 100.

S
CV   100
x
SD Mean CV (%)
SBP 15mm 130mm 11.5
Cholesterol 40mg/dl 200mg/dl 20.0

• “Cholesterol is more variable than systolic blood


pressure”

161
NOTE:
• The range often appears with the median as a
numerical summary measure
• The IQR is used with the median as well
• The SD is used with the mean
• For nominal and ordinal data, a table or graph
is often more effective than any numerical
summary measure

162

You might also like