0% found this document useful (0 votes)

44 views162 pages

Biostat Lecture 3-1

The document discusses various methods for organizing and presenting data collected from surveys, including ordering data numerically or categorically, displaying data in frequency distributions and tables, and using graphical representations like diagrams. It provides examples of how to construct frequency distributions, tables with one or two variables, and guidelines for effective table construction. The goal is to condense and simplify raw data to make patterns and relationships more evident.

Uploaded by

ODAA TUBE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views162 pages

Biostat Lecture 3-1

Uploaded by

ODAA TUBE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 162

Lecture 3

Methods of Data Processing,

Organization, presentation and
summarization

1
Methods of data organization and presentation

 The data collected in a survey is called raw data.

 In most cases, useful information is not immediately

evident from the mass of unsorted data.
 Collected data need to be organized in such a way as
to condense the information they contain in a way
that will show patterns of variation clearly.

2
Precise methods of analysis can be decided
up on only when the characteristics of the
data are understood.

For the primary objective of this different

techniques of data organization and
presentation like order array, tables and
diagrams are used.

3
Generally Summarizing and organizing data can
be achieved through:

1. Frequency Distributions

2. Graphical Representations

3. Measures of Central Tendency

4. Measures of variability

4
Frequency Distributions
o For data to be more easily appreciated and to draw
quick comparisons, it is often useful to arrange the data
in the form of a table, or in one of a number of different
graphical forms.

o When analyzing voluminous data collected from say, a

health center's records, it is quite useful to put them
into compact tables.

o Quite often, the presentation of data in a meaningful

way is done by preparing a frequency distribution.

o If this is not done the raw data will not present any
meaning and any pattern in them (if any) may not be
detected. 5
Array
Array (ordered array) is a serial arrangement of
numerical data in an ascending or descending order.

This will enable us to know the range over which the

items are spread and will also get an idea of their
general distribution.

Very difficult with large sample size

Hence it is an appropriate way of presentation when

the data are small in size (usually less than 20).
20
6
Ordered Array
12 19 27 36 42 59
15 22 31 39 43 61
17 23 31 41 44 65
18 26 34 41 54 67

7
• The actual summarization and organization of data
starts from frequency distribution.

• Frequency distribution: A table which has a list of

each of the possible values that the data can assume
along with the number of times each value occurs.

8
• For nominal and ordinal data, frequency distributions
are often used as a summary.
• Example:

• The % of times that each value occurs, or the relative

frequency, is often listed

• Tables make it easier to see how the data are distributed

9
• For both discrete and continuous data, the
values are grouped into non-overlapping
intervals, usually of equal width.

10
a) Qualitative variable: Count the number of cases in
each category.

- Example1: The intensive care unit type of 25 patients

entering ICU at a given hospital:
1. Medical
2. Surgical
3. Cardiac
4. Other

11
Frequency Relative Frequency
ICU Type (How often) (Proportionately often)

Medical 12 0.48
Surgical 6 0.24
Cardiac 5 0.20
Other 2 0.08

Total 25 1.00

12
Example 2:
A study was conducted to assess the characteristics of a
group of 234 smokers by collecting data on gender and
other variables.
Gender, 1 = male, 2 = female

Gender Frequency (n) Relative Frequency

Male (1) 110 47.0%
Female (2) 124 53.0%
Total 234 100%

13
b) Quantitative variable:
- Select a set of continuous, non-overlapping
intervals such that each value can be placed in
one, and only one, of the intervals.

- The first consideration is how many intervals to

include

14
For a continuous variable (e.g. –
age), the frequency distribution
of the individual ages is not so
interesting.

15
• We “see more” in
frequencies of age
values in “groupings”.
Here, 10 year groupings
make sense.
• Grouped data
frequency distribution

16
To determine the number of class intervals and the
corresponding width, we may use:

Sturge’s rule:

K  1  3.322(log n)
L S
W
K
where
K = number of class intervals
n = no. of observations
W = width of the class interval
L = the largest value
S = the smallest value
17
Example:
Leisure time (hours) per week for 40 college students:

23 24 18 14 20 36 24 26 23 21 16 15 19 20 22 14 13 10 19
27 29 22 38 28 34 32 23 19 21 31 16 28 19 18 12 27 15
21 25 16

K = 1 + 3.22 (log40) = 6.32 ≈ 6

Maximum value = 38, Minimum value = 10

Width = (38-10)/6 = 4.66 ≈ 5

18
Time Relative Cumulative
(Hours) Frequency Frequency Relative
Frequency
10-14 5 0.125 0.125
15-19 11 0.275 0.400
20-24 12 0.300 0.700
25-29 7 0.175 0.875
30-34 3 0.075 0.950
35-39 2 0.050 1.00

Total 40 1.00

19
• Cumulative frequencies: When frequencies of two or
more classes are added.

• Cumulative relative frequency: The percentage of the

total number of observations that have a value either in
that interval or below it.

• Mid-point: The value of the interval which lies midway

between the lower and the upper limits of a class.

20
• True limits: Are those limits that make an
interval of a continuous variable continuous in
both directions

• Used for smoothening of the class intervals

• Subtract 0.5 from the lower and add it to the

upper limit

21
Time
(Hours) True limit Mid-point Frequency

10-14 9.5 – 14.5 12 5

15-19 14.5 – 19.5 17 11
20-24 19.5 – 24.5 22 12
25-29 24.5 – 29.5 27 7
30-34 29.5 – 34.5 32 3
35-39 34.5 - 39.5 37 2

Total 40

22
Simple Frequency Distribution
• Primary and secondary cases of syphilis morbidity by
age, 1989
Age group Cases
(years) Number Percent

0-14 230 0.5

15-19 4378 10.0
20-24 10405 23.6
25-29 9610 21.8
30-34 8648 19.6
35-44 6901 15.7
45-54 2631 6.0
>44 1278 2.9
Total 44081 100 23
Two Variable Table
• Primary and secondary cases of syphilis morbidity
by age and sex, 1989
Age group Number of cases
(years) Male Female Total

0-14 40 190 230

15-19 1710 2668 4378
20-24 5120 5285 10405
25-29 5301 4306 9610
30-34 5537 3111 8648
35-44 5004 1897 6901
45-54 2144 487 2631
>44 1147 131 1278
Total 26006 18075 44081
24
Tables can also be used to present more than
three or more variables.

Variable Frequency (n) Percent

Sex
Male
Female
Age (yrs)
15-19
20-24
25-29
Religion
Christian
Muslim
Occupation
Student
Farmer
Merchant
25
Guidelines for constructing tables
• Keep them simple,
• Limit the number of variables to three or less,
• All tables should be self-explanatory,
• Include clear title telling what, when and where,
• Clearly label the rows and columns,
• State clearly the unit of measurement used,
• Explain codes and abbreviations in the foot-note,
• Show totals,
• If data is not original, indicate the source in foot-note.

26
Diagrammatic Representation

• Pictorial representations of numerical data

27
Importance of diagrammatic representation:

1. Diagrams have greater attraction than

mere figures.
2. They give quick overall impression of the
data.
3. They have great memorizing value than
mere figures.
4. They facilitate comparison
5. Used to understand patterns and trends
28
• Well designed graphs can be powerful means
of communicating a great deal of information

• When graphs are poorly designed, they not

only ineffectively convey message, but they are
often misleading.

29
Limitations of Diagrammatic Representation
1. The technique of diagrammatic representation is
made use only for purposes of comparison. It is not
to be used when comparison is either not possible
or is not necessary.
2. Diagrammatic representation is not an alternative
to tabulation. It only strengthens the textual
exposition of a subject, and cannot serve as a
complete substitute for statistical data.
3. It can give only an approximate idea and as such
where greater accuracy is needed diagrams will not
be suitable.
4. They fail to bring to light small differences
30
Construction of graphs
 The choice of the particular form among the
different possibilities will depend on personal
choices and/or the type of the data.
 Bar charts and pie chart are commonly used
for qualitative or quantitative discrete data.
 Histograms, frequency polygons are used for
quantitative continuous data.

31
There are, however, general rules that are commonly
accepted about construction of graphs:
1.Every graph should be self-explanatory and as simple as
possible.
2.Titles are usually placed below the graph and it should
again question what? Where? When? How classified?
3.Legends or keys should be used to differentiate variables
if more than one is shown.
4.The axes label should be placed to read from the left side
and from the bottom.
5.The units in to which the scale is divided should be
clearly indicated.
6.The numerical scale representing frequency must start at
zero or a break in the line should be shown.
32
Method of constructing bar chart
• All the bars must have equal width
• The bars are not joined together (leave space
between bars)
• The different bars should be separated by equal
distances
• All the bars should rest on the same line called
the base
• Label both axes clearly

33
Specific types of graphs include:
• Bar graph Nominal, ordinal
• Pie chart data

• Histogram
• Stem-and-leaf plot
• Box plot Quantitative
• Scatter plot data
• Line graph
• Others

34
1. Bar Chart
 Bar diagrams are used to represent and compare the
frequency distribution of discrete variables and
attributes or categorical series

 When we represent data using bar diagram, all the

bars must have equal width and the distance between
bars must be equal.

 There are different types of bar diagrams, the most

important ones are:

35
A. Simple bar chart:
• It is a one-dimensional diagram in which the bar
represents the whole of the magnitude.

• The height or length of each bar indicates the

size (frequency) of the figure represented

36
90
80
Number of Children 70
60
50
40
30
20
10
0
Not Immunized Partialy immunized Fully immunized
Immunization Status

Fig 1. Immunization status of children in x District Jan ,2014

37
B. Multiple bar chart
In this type of chart the component figures are
shown as separate bars adjoining/touch each
other.
The height of each bar represents the actual
value of the component figure.
It depicts distributional pattern of more than
one variable
– Example of multiple bar diagrams: consider that
data on immunization status of women by marital
status.

38
Fig. 2 TT Immunization status by marital status of women 15-49
years, Asendabo town, 1996
39
There’s no reason why the bar chart can’t be
plotted horizontally instead of vertically.

CHA
Type of source

Reading

Training female
male
Campaign

Anti FGMC

CAT

0 10 20 30 40 50
Percent

Figure 1. Source of information on the complications of FGM and participation in RH programs,

Jijiga, 2004*. * FGMC = female genital mutilation committee; CAT= community action team; HC =
health centre; CHA= community health agent

40
Example: Construct a bar chart for the following data.

Distribution of patients in hospital by source of referral

Source of referral No. of patients Relative freq.
Other hospital 97 5.1
General practitioner 769 40.3
Out-patient department 623 32.7
Casualty 256 13.4
Other 161 8.5
Total 1 906 100.0

41
Distribution of patients in hopital X by source of referal, 1999
769
800

700 623
600
No. of patients

500

400

300 256

200 161
97
100

0
Other GP OPD Casualty Other
hospital
Source of referal

42
C. Component ( sub-divided) Bar Diagram
Bars are sub-divided into component parts of the
figure.
These sorts of diagrams are constructed when each
total is built up from two or more component
figures.
They can be of two kinds:
I) Actual Component Bar Diagrams: When the overall
height of the bars and the individual component
lengths represent actual figures.
Example of actual component bar diagram: The
above data can also be presented as below.

43
44
C. Percentage Component Bar Diagram

 Where the individual component lengths

represent the percentage each component
forms the overall total.

 Note that a series of such bars will all be the

same total height, i.e., 100 percent.
oExample of percentage component bar
diagram

45
46
2. Pie chart
• Shows the relative frequency for each category by
dividing a circle into sectors, the angles of which are
proportional to the relative frequency.
• Used for a single categorical variable
• Use percentage distributions

47
Steps to construct a pie-chart
• Construct a frequency table

• Change the frequency into percentage (P)

• Change the percentages into degrees, where:

degree = Percentage X 360o

• Draw a circle and divide it accordingly

48
Example: Distribution of deaths for females, in England
and Wales, 1989.

Cause of death No. of death

Circulatory system 100 000
Neoplasm 70 000
Respiratory system 30 000
Injury and poisoning 6 000
Digestive system 10 000
Others 20 000
Total 236 000

49
Distribution fo cause of death for females, in England and Wales, 1989

Others
8%
Digestive System
4%
Injury and Poisoning
3%

Circulatory system
Respiratory system
42%
13%

Neoplasmas
30%

50
3. Histogram
• Histograms are frequency distributions with
continuous class intervals that have been turned into
graphs.

• To construct a histogram, we draw the interval

boundaries on a horizontal line and the frequencies
on a vertical line.

• Non-overlapping intervals that cover all of the data

values must be used.

51
• Bars are drawn over the intervals in such a
way that the areas of the bars are all
proportional in the same way to their interval
frequencies.

• The area of each bar is proportional to the

frequency of observations in the interval

52
Example: Distribution of the age of women at the time of marriage
Age 15-19 20-24 25-29 30-34 35-39 40-44 45-49
group
Number 11 36 28 13 7 3 2
Age of women at the time of marriage

30
No of women

0
14.5-19.5 19.5-24.5 24.5-29.5 29.5-34.5 34.5-39.5 39.5-44.5 44.5-49.5
Age group

53
Histogram for the ages of 2087 mothers with <5 children,
Adami Tulu, 2003

700

600

500

400

300

200

100 Std. Dev = 6.13

Mean = 27.6
0 N = 2087.00
15.0 20.0 25.0 30.0 35.0 40.0 45.0 50.0 55.0

N1AGEMOTH

54
Two problems with histograms
1. They are somewhat difficult to construct
2. The actual values within the respective groups
are lost and difficult to reconstruct

 The other graphic display (stem-and-leaf plot)

overcomes these problems

55
4. Stem-and-Leaf Plot
• A quick way to organize data to give visual impression
similar to a histogram while retaining much more detail
on the data.

• Similar to histogram and serves the same purpose and

reveals the presence or absence of symmetry
• Are most effective with relatively small data sets

• Are not suitable for reports and other communications,

but
• Help researchers to understand the nature of their data
56
Steps to construct Stem-and-Leaf Plots
1. Separate each data point into a stem and leaf
components
• Stem = consists of one or more of the initial digits
of the measurement
• Leaf = consists of the rightmost digit
The stem of the number 483, for example, is 48 and the
leaf is 3.
2. Write the smallest stem in the data set in the
upper left-hand corner of the plot

57
Steps …

3. Write the second stem (first stem +1) below the first
stem
4. Continue with the remaining stems until you reach
the largest stem in the data set
5. Draw a vertical bar to the right of the column of
stems
6. For each number in the data set, find the appropriate
stem and write the leaf to the right of the vertical
bar

58
Example: 3031, 3101, 3265, 3260, 3245, 3200, 3248,
3323, 3314, 3484, 3541, 3649 (BWT in g)

Stem Leaf Number

30 31 1
31 01 1
32 65 60 45 00 48 5
33 23 14 2
34 84 1
35 41 1
36 49 1
59
Percentiles (Quartiles)
• Suppose that 50% of a cohort survived at least 4
years.
• This also means that 50% survived at most 4 years.
• We say 4 years is the median.
• The median is also called the 50th percentile
• We write: P50 = 4 years.

60
• Similarly we could speak of other percentiles:
– P0: The minimum
– P25: 25% of the sample values are less than or equal
to this value. 1st Quartile
. P25 means 25th percentile

– P50: 50% of the sample are less than or equal to this

value. 2nd Quartile

– P75: 75% of the sample values are less than or equal

to this value. 3rd Quartile
– P100: The maximum
61
It is possible to estimate the values of percentiles from a
cumulative frequency polygon.

62
5. Scatter plot
• Most studies in medicine involve measuring more than
one characteristic, and graphs displaying the relationship
between two characteristics are common in literature.

• When both the variables are qualitative then we can use

a multiple bar graph.

• When one of the characteristics is qualitative and the

other is quantitative, the data can be displayed in box
and whisker plots.

63
• For two quantitative variables we use
bivariate plots (also called scatter plots or
scatter diagrams).

• In the study on percentage saturation of

bile, information was collected on the age
of each patient to see whether a
relationship existed between the two
measures.

64
• A scatter diagram is constructed by drawing X-and Y-axes.
• Each point represented by a point or dot() represents a pair of
values measured for a single study subject

Age and percentage saturation of bile for women patients in

hospital Z, 1998
160

140

120
Saturation of bile

100

0
0 10 20 30 40 50 60 70 80
Age

65
• The graph suggests the possibility of a positive
relationship between age and percentage
saturation of bile in women.

66
6. Line graph
• Useful for assessing the trend of particular situation overtime.
• Helps for monitoring the trend of epidemics.
• The time, in weeks, months or years, is marked along the
horizontal axis, and
• Values of the quantity being studied is marked on the vertical
axis.
• Values for each category are connected by continuous line.
• Sometimes two or more graphs are drawn on the same graph
taking the same scale so that the plotted graphs are
comparable.

67
No. of microscopically confirmed malaria cases by species and
month at Zeway malaria control unit, 2003
No. of confirmed malaria cases

2100

1800 Positive
1500 P. falciparum
P. vivax
1200

900

600

300

0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Months

68
Line graph can be also used to depict the relationship
between two continuous variables like that of scatter
diagram.

• The following graph shows level of zidovudine

(AZT) in the blood of AIDS patients at several
times after administration of the drug, for
with normal fat absorption and with fat mal
absorption.

69
Response to administration of zidovudine in two groups of AIDS
patients in hospital X, 1999

8
7
6
Blood zidovudine
concentration

5
4
3
2
1
0
10
20
70
80
100
120
170
190
250
300
Time since administration (Min.) 360

Fat malabsorption Normal fat absorption

70
Exercise
• Evaluate the following graphs whether they
are good or bad and discuss the points which
make them good or bad

71
MMRatio per 100,000 live births by age of woman;
Giza, Egypt 1984

1200

1000
MMR per 100,000 LB

800

600

400

200

0
15-19 20-24 25-29 30-34 35-39 40-44 45-49

Age

MMR per 100,000 LB

72
1. The title of the graph tells the reader the
content of the graph. For example:
• the statistic presented (MMRatio);
• the second dimension of the graph (age of
woman on the x axis);
• the metric (per 100,000 live births);
• the source of the data (Giza, Egypt);
• The date (1984);

73
2. The Y axis is labeled (MMR per 100,000
LB);
3. The X axis is labeled (age of woman);
4. The legend is given (_______= MMR);
5. The source of the information is provided
(Kane et al)

74
Maternal Mortality:
Countries X, Y and Z since 1850
900
•
800
700
600
500
400
300
200
100
0

Sweden UK USA

75
• The Y axis is not labeled;
• The title does not give you the statistic presented in
the graph (Maternal Mortality is not a statistic). This
is particularly problematic when the Y axis is also not
labeled;
• Neither the title nor the Y axis identify the metric (per
100,000 live births).
• The X axis is not labeled – but this is not so serious
when the categories are so obvious and when the
second dimension (year) has been identified in the
graph title.

76
14

Remember:
12

A graph is a tool. 2

0
Antepartum Intrapartum Postpartum

It is not an artwork to
Pre-eclampsia Eclampsia

hang above your sofa!

It is more important that it is
easy to correctly interpret
than it is that it is pretty!

77
Numerical Summary Measures

Single numbers which quantify the characteristics

of a distribution of values

 Measures of central tendency (location)

 Measures of dispersion

78
• A frequency distribution is a general picture of
the distribution of a variable

• But, can’t indicate the average value and the

spread of the values

79
Measures of Central Tendency (MCT)
• On the scale of values of a variable there is a certain
stage at which the largest number of items tend to
cluster.

• Since this stage is usually in the centre of distribution,

the tendency of the statistical data to get concentrated
at a certain value is called “central tendency”

• The various methods of determining the point about

which the observations tend to concentrate are called
MCT.
80
• The objective of calculating MCT is to determine a single
figure which may be used to represent the whole data
set.

• In that sense it is an even more compact description of

the statistical data than the frequency distribution.

• Since a MCT represents the entire data, it facilitates

comparison within one group or between groups of
data.

81
Position
20

0
0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99

82
Characteristics of a good MCT
A MCT is good or satisfactory if it possesses the
following characteristics.
1. It should be based on all the observations
2. It should not be affected by the extreme values
3. It should be as close to the maximum number of values
as possible
4. It should have a definite value
5. It should not be subjected to complicated and tedious
calculations
6. It should be capable of further algebraic treatment
7. It should be stable with regard to sampling

83
• The most common measures of central
tendency include:
– Arithmetic Mean

– Median

– Mode

– Others

84
1. Arithmetic Mean
A. Ungrouped Data
• The arithmetic mean is the "average" of the data set
and by far the most widely used measure of central
location and it is usually denoted by
• Is the sum of all the observations divided by the total
number of observations.

85
The Summation Notation

86
87
The heart rates for n=10 patients were as follows (beats
per minute):
167, 120, 150, 125, 150, 140, 40, 136, 120, 150
What is the arithmetic mean for the heart rate of these
patients?

88
b)G ro
u pe dd a
ta
Inc alc
u latingthem e
anfromgroup
eddata
,weass
u m
etha
tallvalu esfallingin
toa
particularc la
ssinte
rva
larelo
cate
datth
em id
-po
into
fth
einterv
a l.Itisc alc
ula
teda
s
follow:
k


mf ii
x=i=1k

f
i=
1
i

w
he
re,
k= thenum be
rofclassinterv a
ls
m i=them id
-po
intoftheithc la
ssinterv
al
fi=thefre
q u
encyoftheithc lassin
terval

89
Example. Compute the mean age of 169 subjects
from the grouped data.

Mean = 5810.5/169 = 34.48 years

Class interval Mid-point (mi) Frequency (fi) mifi
10-19 14.5 4 58.0
20-29 24.5 66 1617.0
30-39 34.5 47 1621.5
40-49 44.5 36 1602.0
50-59 54.5 12 654.0
60-69 64.5 4 258.0
Total __ 169 5810.5

90
The mean can be thought of as a “balancing
point”, “center of gravity”

91
When the data are skewed, the mean is “dragged” in
the direction of the skewness

• It is possible in extreme cases for all but one of the sample points
to be on one side of the arithmetic mean & in this case, the mean is
a poor measure of central location or does not reflect the center of
the sample.

92
Properties of the Arithmetic Mean.
• For a given set of data there is one and only one
arithmetic mean (uniqueness).

• Easy to calculate and understand (simple).

• Influenced by each and every value in a data set

• Greatly affected by the extreme values.

• In case of grouped data if any class interval is open,

arithmetic mean can not be calculated.
93
2. Median
a) Ungrouped data
• The median is the value which divides the data set into
two equal parts.

• If the number of values is odd, the median will be the

middle value when all values are arranged in order of
magnitude.

• When the number of observations is even, there is no

single middle value but two middle observations.
• In this case the median is the mean of these two middle
observations, when all observations have been arranged
in the order of their magnitude.
94
95
96
97
• The median is a better description (than the mean) of
the majority when the distribution is skewed
• Example
– Data: 14, 89, 93, 95, 96
– Skewness is reflected in the outlying low value of 14
– The sample mean is 77.4
– The median is 93

98
b) Grouped data
• In calculating the median from grouped data, we assume
that the values within a class-interval are evenly
distributed through the interval.

• The first step is to locate the class interval in which the

median is located, using the following procedure.

• Find n/2 and see a class interval with a minimum

cumulative frequency which contains n/2.

• Then, use the following formula.

99
n 
  Fc 
~
x = Lm   2 W
 fm 
 
 
where,
Lm = lower true class boundary of the interval containing the median
Fc = cumulative frequency of the interval just above the median
class
interval
fm = frequency of the interval containing the median
W= class interval width
n = total number of observations
100
Example. Compute the median age of 169
subjects from the grouped data.

n/2 = 169/2 = 84.5

Class interval Mid-point (mi) Frequency (fi) Cum. freq

10-19 14.5 4 4
20-29 24.5 66 70
30-39 34.5 47 117
40-49 44.5 36 153
50-59 54.5 12 165
60-69 64.5 4 169
Total 169

101
• n/2 = 84.5 = in the 3rd class interval
• Lower limit = 29.5, Upper limit = 39.5
• Frequency of the class = 47
• (n/2 – fc) = 84.5-70 = 14.5

• Median = 29.5 + (14.5/47)10 = 32.58 ≈ 33

102
Properties of the median
• There is only one median for a given set of data
(uniqueness)

• The median is easy to calculate

• Median is a positional average and hence it is insensitive
to very large or very small values

• Median can be calculated even in the case of open end

intervals

• It is determined mainly by the middle points and less

sensitive to the remaining data points (weakness).

103
Quartiles
• Just as the median is the value above and below which
lie half the set of data, one can define measures (above
or below) which lie other fractional parts of the data.

• The median divides the data into two equal parts

• If the data are divided into four equal parts, we speak of

quartiles.

104
a) The first quartile (Q1): 25% of all the ranked
observations are less than Q1.

b) The second quartile (Q2): 50% of all the ranked

observations are less than Q2. The second
quartile is the median.

c) The third quartile (Q3): 75% of all the ranked

observations are less than Q3.
105
Percentiles
• Simply divide the data into 100 pieces.

• Percentiles are less sensitive to outliers and not

greatly affected by the sample size (n).

106
3. Mode
• The mode is the most frequently occurring value among
all the observations in a set of data.

• It is not influenced by extreme values.

• It is possible to have more than one mode or no mode.

• It is not a good summary of the majority of the data.

107
Mode
Mode
Mode

20
18
16
14
12
N 10
8
6
4
2
0
108
T. Ancelle, D. Coulombie
a) Ungrouped data
• It is a value which occurs most frequently in a
set of values.
• If all the values are different there is no
mode, on the other hand, a set of values may
have more than one mode.

109
• Example
• Data are: 1, 2, 3, 4, 4, 4, 4, 5, 5, 6
• Mode is 4 “Unimodal”
• Example
• Data are: 1, 2, 2, 2, 3, 4, 5, 5, 5, 6, 6, 8
• There are two modes – 2 & 5
• This distribution is said to be “bi-modal”
• Example
• Data are: 2.62, 2.75, 2.76, 2.86, 3.05, 3.12
• No mode, since all the values are different

110
b) Grouped data
• To find the mode of grouped data, we
usually refer to the modal class, where the
modal class is the class interval with the
highest frequency.
• If a single value for the mode of grouped
data must be specified, it is taken as the
mid-point of the modal class interval.

111
 
x̂ = L m 
 w f 2 
 0  
f f 2 
 
where
L - Lower boundary of the Modal class
f0 – The frequency of the class next below the modal class
in value
f2 – the frequency of the class next above the modal class
in value
w – length of the interval of the modal class

112
113
Properties of mode
 It is not affected by extreme values
 It can be calculated for distributions with
open end classes
 Often its value is not unique
 The main drawback of mode is that often it
does not exist

114
4. Geometric mean (GM)
• Mainly used in many types of laboratory data,
specifically data in the form of concentrations of one
substance in another
• Example: the minimum inhibitory concentration of
penicillin in urine for N. gonorrhoeae in 71 patients

(µg/ml) Frequency (µg/ml) Frequency

0.03125 21 0.250 19
0.0625 6 0.50 17
0.1250 8 1.0 3

115
If x 1 , x 2 , ..., x n are n positive observed values, then
n
GM = n  x i
i=1

and
n

 logx
i=1
i
logGM = .
n
The geometric mean is generally used with data measured on a logarithmic scale, such
as titers of anti-neutrophil immunoglobulin G.

116
Example:
logGM = [21log(0.03125) + 6log(0.0625) +
8log(0.125) + 19log(0.25) + 17log(0.5)
+ 3log(1.0)]/74 = -0.846
The GM = the antilogarithm of -0.846 = 0.143

117
5. Harmonic mean (HM)
• Just as the geometric mean is based on an
arithmetic mean of logarithms, so is the
harmonic mean based on arithmetic mean
of the reciprocals.
• Pertains to rates and time
• We define it as the reciprocal of the
arithmetic mean of the reciprocal of the
given numbers.

118
If the given numbers are x 1 , x 2 , ..., x n , then
1
HM = n
1 1

n i=1 x i

119
6. Weighted mean (WM)
• In a weighted mean, separate outcomes have
separate influences.

• The influence attached to an outcome is the

weight.

• Familiar is the calculation of a course grade as

a weighted average of scores on separate
outcomes.

120
Example:

121
Which measure of central tendency is best with a given
set of data?

• Two factors are important in making this

decisions:
– The scale of measurement (type of data)
– The shape of the distribution of the
observations

122
• The mean can be used for discrete and
continuous data
• The median is appropriate for discrete and
continuous data as well, but can also be used
for ordinal data
• The mode can be used for all types of data,
but may be especially useful for nominal and
ordinal measurements
• For discrete or continuous data, the “modal
class” can be used

123
• The geometric mean is used primarily for
observations measured on a logarithmic
scale.
• Harmonic mean is a suitable MCT when the
data pertains to rates and time.
• Weighted mean is commonly used in the
calculation of mean for different outcomes.

124
(a) Symmetric and unimodal distribution —
Mean, median, and mode should all be
approximately the same

Mean, Median & Mode

125
(b) Bimodal — Mean and median should be
about the same, but may take a value that is
unlikely to occur; two modes might be best

126
(c) Skewed to the right (positively skewed) —
Mean is sensitive to extreme values, so median
might be more appropriate
Mode

Median

Mean

127
(d) Skewed to the left (negatively skewed) —
Same as (c)
Mode

Median

Mean

128
Measures of Dispersion
Consider the following two sets of data:

A: 177 193 195 209 226 Mean = 200

B: 192 197 200 202 209 Mean = 200

Two or more sets may have the same mean and/or median but they
may be quite different.

129
These two distributions have the same mean,
median, and mode

130
• MCT are not enough to give a clear
understanding about the distribution of the
data.

• We need to know something about the

variability or spread of the values — whether
they tend to be clustered close together, or
spread out over a broad range

131
Measures of Dispersion
• Measures that quantify the variation or dispersion of
a set of data from its central location

• Dispersion refers to the variety exhibited by the

values of the data.

• The amount may be small when the values are close

together.

• If all the values are the same, no dispersion

132
Measures of Dispersion
Other synonymous term:
– “Measure of Variation”
– “Measure of Spread”
– “Measures of Scatter”

133
• Measures of dispersion include:
– Range
– Inter-quartile range
– Variance
– Standard deviation
– Coefficient of variation
– Standard error
– Others

134
1. Range (R)
• The difference between the largest and smallest
observations in a sample.

• Range = Maximum value – Minimum value

• Example –
– Data values: 5, 9, 12, 16, 23, 34, 37, 42
– Range = 42-5 = 37
• Data set with higher range exhibit more
variability
135
Properties of range
 It is the simplest crude measure and can be
easily understood
 It takes into account only two values which
causes it to be a poor measure of dispersion
 Very sensitive to extreme observations
 The larger the sample size, the larger the
range

136
2. Interquartile range (IQR)
• Indicates the spread of the middle 50% of the
observations, and used with median

IQR = Q3 - Q1

• Example: Suppose the first and third quartile for

weights of girls 12 months of age are 8.8 Kg and 10.2
Kg, respectively.
IQR = 10.2 Kg – 8.8 Kg
i.e., 50% of the infant girls weigh between 8.8 and 10.2
Kg.

137
The two quartiles (Q3 &Q1) form the basis of the
Box-and-Whiskers Plots — Variables A, B, C
10
9
8
7
6
5
4
3
2
1
0
Variable A Variable B Variable C

138
Properties of IQR:
• It is a simple and versatile measure
• It encloses the central 50% of the observations
• It is not based on all observations but only on two
specific values
• It is important in selecting cut-off points in the
formulation of clinical standards
• Since it excludes the lowest and highest 25% values,
it is not affected by extreme values
• Less sensitive to the size of the sample

139
3. Quartile deviation (QD)

QD = Q 3  Q1
2

140
4. Coefficient of quartile deviation (CQD)

• CQD = Q 3  Q1
Q 3  Q1
• CQD is an absolute quantity (unitless) and is
useful to compare the variability among the
middle 50% observations.

141
5. Mean deviation (MD)
• Mean deviation is the average of the absolute
deviations taken from a central value, generally
the mean or median.
• Consider a set of n observations x1, x2, ..., xn.
Then:
n
1
MD   x i  A
n i 1
• ‘A’ is a central value (arithmetic mean or
median).
142
Properties of mean deviation:
 MD removes one main objection of the earlier
measures, that it involves each value

 It is not affected much by extreme values

 Its main drawback is that algebraic negative signs of the

deviations are ignored which is
mathematically unsound

143
6. Variance (2, s2)
• The main objection of mean deviation, that
the negative signs are ignored, is removed by
taking the square of the deviations from the
mean.

• The variance is the average of the squares of

the deviations taken from the mean.

144
• It is squared because the sum of the
deviations of the individual observations of a
sample about the sample mean is always 0

0= ( )
 xi- x
• The variance can be thought of as an average
of squared deviations

145
• Variance is used to measure the dispersion of
values relative to the mean.
• When values are close to their mean (narrow
range) the dispersion is less than when there
is scattering over a wide range.
– Population variance = σ2
– Sample variance = S2

146
a) Ungrouped data
 Let X1, X2, ..., XN be the measurement on N
population units, then:
N

 i
(X   ) 2

2  i 1
where
N
N

X i
= i=1
is the population mean.
N

147
A sample variance is calculated for a sample of individual values
(X1, X2 , … Xn) and uses the sample
mean (e.g. ) rather than the population mean µ.

148
Degrees of freedom
• In computing the variance there are (n-1)
degrees of freedom because only (n-1) of the
deviations are independent from each other
• The last one can always be calculated from the
others automatically.
• This is because the sum of the deviations from
their mean (Xi-Mean) must add to zero.

149
b) Grouped data
k

 (m i  x) 2 f i
S2  i =1
k

f
i =1
i -1

where
mi = the mid-point of the ith class interval
fi = the frequency of the ith class interval
x = the sample mean
k = the number of class intervals
150
Properties of Variance:
 The main disadvantage of variance is that
its unit is the square of the unite of the
original measurement values
 The variance gives more weight to the
extreme values as compared to those
which are near to mean value, because the
difference is squared in variance.
• The drawbacks of variance are overcome
by the standard deviation.

151
7. Standard deviation (, s)
• It is the square root of the variance.
• This produces a measure having the same
scale as that of the individual values.

   and S = S
2 2

152
• Following are the survival times of n=11
patients after heart transplant surgery.

• The survival time for the “ith” patient is

represented as Xi for i= 1, …, 11.

• Calculate the sample variance and SD.

153
154
Example. Compute the variance and SD of the age of 169 subjects from
the grouped data.
Mean = 5810.5/169 = 34.48 years
S2 = 20199.22/169-1 = 120.23
SD = √S2 = √120.23 = 10.96
Class
interval (mi) (fi) (mi-Mean) (mi-Mean)2 (mi-Mean)2 fi
10-19 14.5 4 -19.98 399.20 1596.80
20-29 24.5 66 -9-98 99.60 6573.60
30-39 34.5 47 0.02 0.0004 0.0188
40-49 44.5 36 10.02 100.40 3614.40
50-59 54.5 12 20.02 400.80 4809.60
60-69 64.5 4 30.02 901.20 3604.80
Total 169 1901.20 20199.22

155
Properties of SD
• The SD has the advantage of being expressed in the
same units of measurement as the mean

• SD is considered to be the best measure of dispersion

and is used widely because of the properties of the
theoretical normal curve.

• However, if the units of measurements of variables of

two data sets is not the same, then there variability can’t
be compared by comparing the values of SD.

156
SD Vs Standard Error (SE)
• SD describes the variability among individual values
in a given data set
• SE is used to describe the variability among
separate sample means obtained from one sample
to another

• We interpret SE of the mean to mean that another

similarly conducted study may give a mean that
may lie between  SE.
157
Standard Error
• SD is about the variability of individuals

• SE is used to describe the variability in the

means of repeated samples taken from the
same population.

• For example, imagine 5,000 samples, each of the same size n=11. This would
produce 5,000 sample means. This new collection has its own pattern of
variability. We describe this new pattern of variability using the SE, not the
SD.

158
Example: The heart transplant surgery
n=11, SD=168.89, Mean=161 days
• What happens if we repeat the study? What will our next mean
be? Will it be close? How different will it be? Focus here is on the
generalizability of the study findings.
• The behavior of mean from one replication of the study to the
next replication is referred to as the sampling distribution of
mean.
• We can also have sampling distribution of the median or the SD

• We interpret this to mean that a similarly conducted study might

produce an average survival time that is near 161 days, ±50.9
days.

159
8. Coefficient of variation (CV)
• When two data sets have different units of
measurements, or their means differ
sufficiently in size, the CV should be used as
a measure of dispersion.
• It is the best measure to compare the
variability of two series of sets of
observations.
• Data with less coefficient of variation is
considered more consistent.
160
•CV is the ratio of the SD to the mean multiplied by 100.

S
CV   100
x
SD Mean CV (%)
SBP 15mm 130mm 11.5
Cholesterol 40mg/dl 200mg/dl 20.0

• “Cholesterol is more variable than systolic blood

pressure”

161
NOTE:
• The range often appears with the median as a
numerical summary measure
• The IQR is used with the median as well
• The SD is used with the mean
• For nominal and ordinal data, a table or graph
is often more effective than any numerical
summary measure

162

Role of Statistics in Engineering
No ratings yet
Role of Statistics in Engineering
17 pages
Challenges of Managing Heritage Building Questionnaires
No ratings yet
Challenges of Managing Heritage Building Questionnaires
6 pages
2 - Presenting Data Part
No ratings yet
2 - Presenting Data Part
42 pages
2. presenting of data - ١١١٠٥٩
No ratings yet
2. presenting of data - ١١١٠٥٩
39 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
65 pages
Methods of Data Organization and Presentation
No ratings yet
Methods of Data Organization and Presentation
75 pages
Course: Biostatistics: Haramaya University, Chms
100% (1)
Course: Biostatistics: Haramaya University, Chms
49 pages
3.descriptive Statistics Assig
No ratings yet
3.descriptive Statistics Assig
92 pages
BIOL 2163 Lecture 2 - Summarizing and Graphing Data
No ratings yet
BIOL 2163 Lecture 2 - Summarizing and Graphing Data
59 pages
Chapter 3 Descriptive Biostatistics
No ratings yet
Chapter 3 Descriptive Biostatistics
103 pages
Biostatistics Presentation Assignment
No ratings yet
Biostatistics Presentation Assignment
67 pages
Report of Stat Mini Project 2
No ratings yet
Report of Stat Mini Project 2
18 pages
Biostatistics 3
No ratings yet
Biostatistics 3
108 pages
Organizing-Data 250120 180858
No ratings yet
Organizing-Data 250120 180858
32 pages
1st Mid
No ratings yet
1st Mid
19 pages
Descriptive Statistics, Tables and Graphs 20
No ratings yet
Descriptive Statistics, Tables and Graphs 20
34 pages
Chapter 2 Methods of Data Collection and Presentation
No ratings yet
Chapter 2 Methods of Data Collection and Presentation
35 pages
Methods of Data Organization, Presentation and Summarization
No ratings yet
Methods of Data Organization, Presentation and Summarization
92 pages
Freq Distribution
No ratings yet
Freq Distribution
16 pages
Educ 301.advanced Statistics - Abduljaleel Sumayan
No ratings yet
Educ 301.advanced Statistics - Abduljaleel Sumayan
103 pages
g1 Data Management
No ratings yet
g1 Data Management
70 pages
Graphical Representation of Data
No ratings yet
Graphical Representation of Data
6 pages
Data Management
No ratings yet
Data Management
44 pages
MEDT 24 LAB L4 Data Presentation - 2022 PDF
No ratings yet
MEDT 24 LAB L4 Data Presentation - 2022 PDF
4 pages
Ns Statistics 2022
No ratings yet
Ns Statistics 2022
70 pages
Displaying & Organizing Data Statistics
No ratings yet
Displaying & Organizing Data Statistics
22 pages
Graphical Representation of Data
No ratings yet
Graphical Representation of Data
8 pages
Math Midterm
No ratings yet
Math Midterm
9 pages
Statistics Mpc006
No ratings yet
Statistics Mpc006
30 pages
Epidem Chapter 8
No ratings yet
Epidem Chapter 8
62 pages
Data Visualization
No ratings yet
Data Visualization
5 pages
2 Methods of Data Organization and Presentation - 113052
No ratings yet
2 Methods of Data Organization and Presentation - 113052
57 pages
Presentation of Data
No ratings yet
Presentation of Data
29 pages
Chap 2. Data Presentation
No ratings yet
Chap 2. Data Presentation
72 pages
Analytical Techniques Lec 1
No ratings yet
Analytical Techniques Lec 1
42 pages
AEB801 20222023-Lecture 03-1
No ratings yet
AEB801 20222023-Lecture 03-1
38 pages
Sasa Module 3
No ratings yet
Sasa Module 3
33 pages
Week 2.1 Data Presentation
No ratings yet
Week 2.1 Data Presentation
40 pages
Statistics Applied To Research
No ratings yet
Statistics Applied To Research
91 pages
Data Presentation
No ratings yet
Data Presentation
37 pages
Statistics A Review
No ratings yet
Statistics A Review
47 pages
L5 - Presentation of Data
No ratings yet
L5 - Presentation of Data
35 pages
Biostatistics and Epidemiology LAB
No ratings yet
Biostatistics and Epidemiology LAB
13 pages
Statistics For Begineers
No ratings yet
Statistics For Begineers
28 pages
INTRODUCTION TO STATIATICS Basic Medical Sciences
No ratings yet
INTRODUCTION TO STATIATICS Basic Medical Sciences
79 pages
Statistical Description of Data CAF
No ratings yet
Statistical Description of Data CAF
12 pages
Chapter 4.data Management Lesson 1 2
100% (1)
Chapter 4.data Management Lesson 1 2
86 pages
Data presentation2023-MRM112-3
No ratings yet
Data presentation2023-MRM112-3
17 pages
Statistics Day 1a - Types of Data, Graphical Representation, Correlation, Data Modeling & Index Numbers
No ratings yet
Statistics Day 1a - Types of Data, Graphical Representation, Correlation, Data Modeling & Index Numbers
54 pages
Organizing & Displaying of Data
No ratings yet
Organizing & Displaying of Data
22 pages
Picturing Distributions With Graphs
No ratings yet
Picturing Distributions With Graphs
21 pages
Stats For PGDM
No ratings yet
Stats For PGDM
52 pages
What Is Statistics
No ratings yet
What Is Statistics
147 pages
2 Organizing and Visualizing Variables
No ratings yet
2 Organizing and Visualizing Variables
36 pages
Introduction To Statistics: "There Are Three Kinds of Lies: Lies, Damned Lies, and Statistics." (B.Disraeli)
No ratings yet
Introduction To Statistics: "There Are Three Kinds of Lies: Lies, Damned Lies, and Statistics." (B.Disraeli)
32 pages
1 Stats Intro 14022024 105127am
No ratings yet
1 Stats Intro 14022024 105127am
26 pages
AE 9 Statistical Data
No ratings yet
AE 9 Statistical Data
39 pages
RM Data Analysis
No ratings yet
RM Data Analysis
67 pages
Biostats - PST 426.sister HO Fawole
No ratings yet
Biostats - PST 426.sister HO Fawole
85 pages
Basic Statistics
No ratings yet
Basic Statistics
23 pages
Data Presentation and Principles of Sampling
No ratings yet
Data Presentation and Principles of Sampling
7 pages
Statistics I Essentials
From Everand
Statistics I Essentials
Emil G. Milewski
No ratings yet
Bmi Assignment
No ratings yet
Bmi Assignment
18 pages
Research Proposal First Draft (Adamu Moti)
100% (3)
Research Proposal First Draft (Adamu Moti)
25 pages
Assignment Solution
No ratings yet
Assignment Solution
6 pages
Evidence Based Decision Making
No ratings yet
Evidence Based Decision Making
46 pages
Assignment Solution
No ratings yet
Assignment Solution
6 pages
Lecture 2
No ratings yet
Lecture 2
43 pages
Assignment Solution
No ratings yet
Assignment Solution
6 pages
Health Program Common Communicable Diseases: By: Hi Students G1
No ratings yet
Health Program Common Communicable Diseases: By: Hi Students G1
59 pages
HE EDu.
No ratings yet
HE EDu.
27 pages
SIP On Micromax
No ratings yet
SIP On Micromax
75 pages
NJIT ID: 31168287: Assignment 5 Name: Rohit Palyam
No ratings yet
NJIT ID: 31168287: Assignment 5 Name: Rohit Palyam
9 pages
Structural Evaluation/ Analysis (Push Over Analysis)
No ratings yet
Structural Evaluation/ Analysis (Push Over Analysis)
6 pages
Rosas Ormeño & Ruiz-Aguilar 2020
No ratings yet
Rosas Ormeño & Ruiz-Aguilar 2020
24 pages
1 s2.0 S2093791123000276 Main
No ratings yet
1 s2.0 S2093791123000276 Main
7 pages
Risk Assessment in Construction of Highway Project
No ratings yet
Risk Assessment in Construction of Highway Project
7 pages
Sales Pipeline Tracker v1 With Sample Data
No ratings yet
Sales Pipeline Tracker v1 With Sample Data
6 pages
Edfs 22 - Episode 7
No ratings yet
Edfs 22 - Episode 7
4 pages
Ljmu-7505-Pubuni - Topic Overview Week 7
No ratings yet
Ljmu-7505-Pubuni - Topic Overview Week 7
10 pages
Podcast Market Research
No ratings yet
Podcast Market Research
12 pages
Topouzelis 2018 Seagrass Greek Landsat
No ratings yet
Topouzelis 2018 Seagrass Greek Landsat
16 pages
Confidence Intervals For Pearson's Correlation
No ratings yet
Confidence Intervals For Pearson's Correlation
6 pages
Determinants of Customer Satisfaction in A Philippine Retail Chain
No ratings yet
Determinants of Customer Satisfaction in A Philippine Retail Chain
7 pages
Project
No ratings yet
Project
72 pages
OLA Proj
No ratings yet
OLA Proj
20 pages
The Relation of Socio-Civil Family Status of Cebu Doctors' University Senior High School Students
No ratings yet
The Relation of Socio-Civil Family Status of Cebu Doctors' University Senior High School Students
71 pages
Lec 17 - Principal Component Analysis PDF
No ratings yet
Lec 17 - Principal Component Analysis PDF
30 pages
Chapter 4 Tourism Planning and Development
80% (5)
Chapter 4 Tourism Planning and Development
57 pages
Reviewer - Group Influence - 30march2022
No ratings yet
Reviewer - Group Influence - 30march2022
3 pages
The Effectiveness of Food Labelling in Controlling Ones Calorie Intake
100% (1)
The Effectiveness of Food Labelling in Controlling Ones Calorie Intake
33 pages
2022.04 Individual Assignment OPTION 2 - Outline Business Trends in Luxury
No ratings yet
2022.04 Individual Assignment OPTION 2 - Outline Business Trends in Luxury
6 pages
Lumira DataStorytellingHandbook 2017 PDF
No ratings yet
Lumira DataStorytellingHandbook 2017 PDF
49 pages
Research Methods and Key Issues in Sociological Research
No ratings yet
Research Methods and Key Issues in Sociological Research
11 pages
12 - Greenpeace V Plant Genetic System
No ratings yet
12 - Greenpeace V Plant Genetic System
12 pages
THM - Revise Chapter 1
No ratings yet
THM - Revise Chapter 1
6 pages
Inbound 7922974729561075941
No ratings yet
Inbound 7922974729561075941
15 pages
Basu 2013 PDF
No ratings yet
Basu 2013 PDF
10 pages
Fin Aaaaa Al
No ratings yet
Fin Aaaaa Al
45 pages