0% found this document useful (0 votes)
0 views

Introduction (Data Presentation & Summarization

The document outlines a Biostatistics course offered by Aksum University, detailing its objectives, teaching methods, evaluation criteria, and the role of biostatistics in public health and medicine. It covers fundamental concepts such as types of statistics, variables, measurement scales, and data collection methods. The course aims to equip students with the skills to analyze and interpret numerical data relevant to health sciences.

Uploaded by

danuberh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Introduction (Data Presentation & Summarization

The document outlines a Biostatistics course offered by Aksum University, detailing its objectives, teaching methods, evaluation criteria, and the role of biostatistics in public health and medicine. It covers fundamental concepts such as types of statistics, variables, measurement scales, and data collection methods. The course aims to equip students with the skills to analyze and interpret numerical data relevant to health sciences.

Uploaded by

danuberh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 148

Aksum University, CHS & CSH; School of Public Health

Department of Epidemiology and Biostatistics


 Course Title = Biostatistics
 Course No. PuHe 601
 Credit = 4hrs
 Target group =GMPH
 Teaching methods: Lecture/discussion
:Individual/group assignment
:Seminars/PBL
Evaluation: Presentation of Assignments, Computer Lab &
Written Exams

Course Instructor : Getachew M ( MPH, Asst. Professor)


1
Course Objectives
At the end of this course, students will be able to
 Define Statistics and Biostatistics

 Enumerate the importance and limitations of


statistics

 Define and Identify the different types of variable


and list why we need to classify variables

 Identify the different methods of medical and


biological data organization and presentation

 Identify different data summarization techniques MCT


Dispersion………….

2
Introduction to Biostatistics
The discipline concerned with the treatment of numerical
data derived from groups of individuals (Armitage et
al., 2001).
Statistics: A field of study concerned with:
Collection,
 Organization,
Analysis,
Summarization and
Interpretation of numerical data, & the drawing of
inferences about a body of data when only a small part
of the data is observed from the population.

Statistical data: refers to numerical descriptions of


things. These descriptions may take the form of counts
or measurements.
3
Introduction to Biostatistics……..
Biostatistics: When the data being analyzed are derived
from the
Public health data,
Biological sciences and
 Medicine, we use the term biostatistics to
distinguish this particular application of statistical
tools and concepts.

4
Types of Biostatistics
1. Descriptive (exploratory) statistics: is the aspect of
collecting, organization, presentation and
summarization of data.

E.g. At our health centre, 50 patients were diagnosed with


angina last year.

Some statistical summaries which are especially common


in descriptive analyses are:
 Measures of central tendency
 Measures of dispersion Measures of association
 Cross-tabulation /contingency table
 Histogram
 Quantile, Q-Q plot , Scatter plot, Box plot
5
Types of Biostatistics

6
Types of Biostatistics……
2. Inferential Statistics:
Consists of generalizing from samples to population,
performing hypothesis testing, determining relation
among variables, and making prediction.

Example:
Principles of probability,
Estimation,
Confidence interval,
Comparison of two or more means or proportions,
Hypothesis testing, etc.

7
Uses of Biostatistics

8
Uses of biostatistics
Provide methods of organizing information
Assessment of health status
Health program evaluation
Resource allocation
To formulate and test hypothesis
Magnitude of association
Assessing risk factors; Cause & effect relationship
Drawing of inferences(for prediction or projection)
To handle biological variations: Among individuals as
well as within same individual over time
» Example: height, weight, blood pressure, eye color ...

9
Uses of biostatistics……
Collect reliable and unbiased data

Process and evaluate data rigorously

Interpret and draw appropriate conclusions

Essential for understanding, appraisal and critique of


scientific literature

Public health and medicine are becoming


increasingly quantitative.

10
limitations of Biostatistics
It deals with only those subjects of inquiry that are
capable of being quantitatively measured and
numerically expressed.

It deals on aggregates of facts and no importance is


attached to individual items – suited only if the group
characteristics are desired to be studied.

Statistical data are only approximation and not


mathematically correct.

11
What does biostatistics cover?

Research Planning

Design The best way to


Biostatistical learn about
thinking biostatistics is to
contribute in Execution (Data collection)
follow the flow of a
every step in a research from
research Data Processing inception to the
final publication
Data Analysis

Presentation

Interpretation
Publication
12
Variables
Variable: A variable is a characteristic an
attribute or quantity that can be measured and varies
among individuals under study that assumes
different values for different elements. or it is a
characteristic or attribute that can assume different
value.
Some examples of variables include:
Diastolic blood pressure,
Heart rate, height,
The weight and
Stage of bladder cancer to list some
Mild
Moderate
Severe
13
Variables….
Data: are values, observations, measurements, facts or
figures that variables describing an event in a given
survey, census, experiment or any other study .
Statistical data: is the numerical description of things
(counts/measurements)
Statistical method : methods that are used to collect
organize analyze and interpret data.
Variables: A characteristic that takes on different values
in different persons/ places/things.
Data set: it is a collection of observation on a
variable.
Random variable: are variables whose value are
determined by chance.

14
Variables……

15
Types of variables
Depending on the characteristic of the measurement,
variables can be: Qualitative and Quantitative
variables.
Qualitative(Categorical) variables
A variable or characteristic which cannot be measured
in quantitative form.

But, can only be identified by name or categories, or


variable that can be placed into distinct categories,
according to some characteristic or attribute.

For instance place of birth, ethnic group, type of


drug, stages of breast cancer (I, II, III, or IV), degree
of pain (low, moderate, sever or unbearable).
16
Types of variables……
The categories should be clear cut (not overlapping)
and cover all the possibilities.
For example, sex (male or female), vital status (alive
or dead), disease stage (depends on disease), ever
smoked (yes or no).
Quantitative (Numerical) variable:
Is one that can be measured and expressed numerically.
They can be of two types
Discrete Data
The values of a discrete variable are usually whole
numbers, such as the number of episodes of diarrhoea
in the first five years of life.
Observations can only take certain numerical values
Characterized by gabs or interruptions in the value that
it can assume.
17
Types of variables……
Numerical discrete data occur when the observations
are integers that correspond with a count of some
sort
Some common examples are:

The number of bacteria colonies on a plate,

The number of cells within a prescribed area


upon microscopic examination,

The number of heart beats within a specified


time interval,

A mother’s history of numbers of births ( parity)


and pregnancies (gravidity)
18
Types of variables……
Continuous Data
A continuous variable is a measurement on a
continuous scale
Not characterized by gabs or interruptions.
Each observation theoretically falls somewhere along
a continuum.
One is not restricted, in principle, to particular values
such as the integers of the discrete scale.
Most clinical measurements, such as:
Blood pressure,

Serum cholesterol level,

Height, weight, age etc. are on a numerical


continuous scale
19
SUMMARY
Variable

Types
of Qualitative Quantitative
variables or categorical measurement

Nominal Ordinal Discrete Continuous


(not ordered) (ordered) (count data) (real-valued)
e.g. ethnic e.g. response e.g. # of e.g. height
group to treatment admissions

Measurement scales
20
Scales of measurement
Measurement: is assignment of numbers to objects or
events according to set of events.
Data comes in various sizes and shapes and it is important
to know about these so that the proper analysis can be used
on the data.
There are four at which we measure:
Nominal scales of measurement
It may be thought of as "naming" level.
This level of measurement do not put subjects in any
particular order.
There is no logical basis for saying one category is higher
or less than the other category.
In research activities a Yes/No scale is nominal.

21
Scales of measurement………..
The simplest data consist of unordered, dichotomous, or
"either - or" types of observations, i.e., either the patient
lives or the patient dies, either he has some particular
attribute or he does not
The nominal level of measurement classifies data into
mutually exclusive (non over lapping),exhaustive
categories in which no order or ranking can be imposed
on the data.
Examples are: Blood group, Gender, religious affiliation
Ordinal Scales of Measurement
An ordinal scale is next up the list in terms of power of
measurement.
The simplest ordinal scale is a ranking.
22
Scales of measurement………..
At this level we put subjects in order from lowest to
highest.

There is no objective distance between any two points


on your subjective scale.

Hence, an ordinal scale only lets you interpret gross


order and not the relative positional distances.

E.g. If we told that third students have better


knowledge than first year students, then we do not
know by how much they are better.

To measure the amount of the difference between


subjects we need the next level of measurement.
23
Scales of measurement………..
Some of the examples under this scales of
measurement includes:
Not improved, improved, much improved
Academic status, job satisfaction index, employment
status, response to treatment (none, slow, moderate,
fast)e.g. 1. Strongly agree
2. Agree
3. No opinion
4. Disagree
5. Strongly disagree
The ordinal level of measurement classifies data into
categories that can be ranked; however, precise
differences between the ranks do not exist.
24
Scales of measurement………..
Interval Scales of Measurement
It is more powerful than nominal and ordinal as it not
only orders or ranks or rates but also shows exact
distances in between.

On interval measurement scales, one unit on the


scale represents the same magnitude on the trait or
characteristic being measured across the whole range
of the scale.
They do not have a "true" zero point, however, and
therefore it is not possible to make statements about
how many times higher one score is than another.

The selected zero does not indicate a total absence of


the quantity being measured.
25
Scales of measurement………..
A good example of an interval scale is the Fahrenheit
scale for temperature.

Equal differences on this scale represent equal


differences in temperature, but the scale is not a
RATIO Scale.

Thus, a temperature of 30 degrees is not twice as


warm as one of 15 degrees.
The interval level of measurement ranks data, and
precise differences between units of measure do exist;
however, there is no meaningful zero
26
Scales of measurement………..
Ratio Scales of Measurement
The highest level of measurement

This has the properties of an interval scale together


with a fixed origin or zero point.

Examples of variables which are ratio scaled include


weights, lengths and times.

27
Scales of measurement………..
Ratio scales permit the researcher to compare both
differences in scores and the relative magnitude of
scores.

For instance the difference between 5 and 10 minutes


is the same as that between 10 and 15 minutes, and 10
minutes is twice as long as 5 minutes.

The ratio level of measurement possesses all the


characteristics of interval measurement, and there
exists a true zero. In addition, true ratio exist between
different units of measure.

28
Scales of measurement………..

29
Individual Assignment 1

30
31
32
33
34
Assignment 2

35
Response and Explanatory variables
A variable can be also either
Response ,dependant, or outcome variable or
A variable can be either
 Explanatory ,independent, predictor variable.
Response (dependent, outcome) variables: are variables
which can be affected by explanatory variable and it is
the outcome of a study.
A variable you would be interested in predicting or
forecasting.
While explanatory variables are any variables that
explain the response variable.

36
Exercise
In a study to determine whether surgery or chemotherapy results
in higher survival rates for a certain type of cancer,
Which variable is the explanatory variable and
____________________

which one is the response variable?


__________________

Incidence of Visceral Leishmaniasis and its associated factors


among sesame harvesters in the borders of Sudan and Eritrea.
Dependent variable___________________________________

Independent variable__________________________________
37
Descriptive Statistics

38
Data collection methods

Before any statistical work can be done, data must be


collected.
Data collection is a crucial stage in the planning and
implementation of a study.

Data collection techniques allow us to systematically


collect data about our objectives of study and about the
setting in which they occur.

39
Data collection methods……
The choice of methods of data collection is based on:

Types information to be collected from the


source.

The accuracy of information they will yield

Practical considerations, such as, the need for


personnel, time, equipment and other facilities, in
relation to what is available.

40
Data collection methods………….
The methods of collecting data may be broadly
classified as:
Self-administered questionnaires

The use of documentary sources,

Observation

Interviews

Tape recording

Filming, Photography

Focus group discussion 41


Data collection methods………….
1)Observation
Observation is a technique that involves systematically
selecting, watching and recoding behaviors of people or
other phenomena and aspects of the setting in which
they occur, for the purpose of getting (gaining)
specific information.
It includes all methods from simple visual
observations to the use of high level machines and
measurements, sophisticated equipment or facilities,
such as
Radiographic,
Biochemical,
X-ray machines, city scan, MRI
Microscope,
Clinical examinations, and
Microbiological examinations. 42
Data collection methods………….
2) The use of documentary sources
Clinical records and other personal records, death
certificates, published mortality statistics, census
publications, etc.
Advantages
 Documents can provide ready made information
relatively easily
 The best means of studying past events.

Disadvantages

 Problems of reliability and validity (because the


information is collected by a number of different
persons who may have used different definitions or
methods of obtaining data).
43
Data collection methods………….
3) Interviewing
It involves oral questioning of respondents, either
individually or as a group

Answers can be recorded by writing them down or by


tape-recording the responses, or by a combination of
them.

Interviews can be conducted with varying degree of


flexibility (high degree of flexibility Vs low degree of
flexibility)
A) High degree of flexibility /unstructured:
Usually used when the researcher has little
understanding of the problem
Is frequently applied in exploratory studies
44
Data collection methods………….
B) Low degree of flexibility / highly structured
interview.

Useful when the researcher is relatively


knowledgeable about expected answers or when the
number of respondents being interviewed is relatively
large

Questionnaires may be used with a fixed list of


questions in a standard sequence, which have mainly
fixed or pre-categorized answers

Ways of interviewing participants: Face to face and


Telephone
45
Data collection methods………….
4. Self-administered questionnaires
Written questions are presented that are to be
answered by the respondents in written form.

The respondent reads the questions and fills in the


answers by him/ herself (sometimes in the presence
of
an interviewer who “stands by” to give assistance if
necessary.

The use of self-administered questionnaires is simpler


and cheaper. It can be administered to many persons
simultaneously.

46
Data collection methods………….
5) Focus Group Discussion

A group of participants discuss on the issue

Mostly used for qualitative study

Need more resources

47
Data collection methods………….
Advantages
Is less expensive; permits anonymity & may result in
more honest responses; does not require research
assistants; eliminates bias due to phrasing questions
differently with different respondents

Disadvantages
Cannot be used with illiterates; there is often a low rate
of response; questions may be misunderstood

48
Data collection methods………….
Problems in gathering data
Common problems might include:
Language barriers
Lack of adequate time
Expense
Inadequately trained and experienced staf
Invasion of privacy
Suspicion (mistrust)
Bias (any systematic error)
Cultural norms (e.g. which may preclude
(prevent) men interviewing women)

49
Data collection methods………….
Types of Questions
Depending on how questions are asked and recorded
we can distinguish two major possibilities
Open –ended questions, and

Closed questions.

Open-ended questions

Open-ended questions permit free responses that


should be recorded in the respondent’s own words.

The respondent is not given any possible answers to


choose from. 50
Data collection methods………….
Such questions are useful to obtain information on:
Facts with which the researcher is not very familiar,
Opinions, attitudes, and suggestions of informants, or
Sensitive issues.
For example
Can you describe exactly what the traditional birth
attendant did when your labor started?
What do you think the reasons for a high drop-out
rate of village health committee members?
What would you do if you noticed that your daughter
(school girl) had a relationship with a teacher?

51
Data collection methods………….
Closed Questions
Closed questions offer a list of possible options or
answers from which the respondents must choose.

When designing closed questions one should try


to:

Offer a list of options that are exhaustive and


mutually exclusive

Closed questions are useful if the range of possible


responses is known.

52
Data collection methods………….
For example
What is your marital status?
1. Single
2. Married/living together
3. Separated
4. divorced
5. widowed
Have you ever gone to the local village health worker
for treatment?
1. Yes
2. No

53
Methods of data organization and presentation

Data organization and presentation techniques

The data collected in a survey is called raw data.

Collected data need to be organized in such a way as


to condense the information they contain in a way
that will show patterns of variation clearly.

54
Frequency Distributions
A table which has a list of each of the possible values
that the data can assume along with the number of
times each value occurs.
Tables make it easier to see how the data are distributed

The Frequency is the count of the number of times that


a particular combination occurred in a data set.
Frequency distribution can be grouped or ungrouped
For nominal and ordinal /qualitative data, frequency
distributions are often used as a summary.

For both discrete and continuous/quantitative data,


the values are grouped into non-overlapping intervals,
usually of equal width. 55
Frequency distribution for qualitative data

Frequency distribution for quantitative data

56
Ungrouped Frequency Distribution

It uses to present categorical variable in simplified


and easily understandable way

This frequency table can be constructed by listing all


possible categories of the variable and then counting
the number laying on each category of the variable as
a frequency.

57
Example: The following ungrouped data/ordered array or individual
observation is about current age of women and it was collected from
240 women ( data 1).

58
Example: Consider the data collected on age at first marriage of
240 women (data 1). One of the variable in this dataset is religion
followed by the women. Hence, for such types of variable, we can
use ungrouped frequency distribution to summarize the data as
follows:

59
How Can I Change Ungrouped Frequency
Distribution In To
Grouped Frequency Distribution ???

60
Grouped Frequency Distribution
In order to present data using grouped frequency distribution, it
is not as simple as that of ungrouped.
In this case we need to compute some values. These values are
given below:
Number of class(K): The number of categories the table will
have
Number of class can be computed/ estimated using Sturge’s
rule as:
Sturge’s rule: K  1  3.322(logn)
LS
W
where K
 K = number of class intervals n = no. of observations
 W = width of the class interval L = the largest value
 S = the smallest value

61
Grouped Frequency Distrib……
Class limit: The range for each class / The smallest and largest
values that can go into any class; they can be either lower or
upper class limits.

Lower class limit: Smallest observation of the category

Upper class limit: Smallest observation plus width of the class


minus one.

When forming classes, always make sure that each item


(measurement or observation) goes into one and only
one class,
i.e. classes should be mutually exclusive (namely, that successive
classes have no values in common).
Note that: the Sturge’s rule should not be regarded
as final, but should: Be considered as a guide only.

62
Grouped Frequency cont’d…
Class Boundaries/True Limits: are those limits, which are
determined mathematically to make an interval of a continuous
variable continuous in both directions, and no gap exists between
classes. It is obtained by subtracting and adding 0.5 from lower
and upper class limit respectively.
 It has two boundaries :
 Lower class boundary

 Upper class boundary

63
Grouped Frequency cont’d…
Class mark/ Mid-point (Xc) of an interval: is the value of the
interval which lies mid-way between the lower true limit (LTL)
and the upper true limit (UTL) of a class.

It is calculated as: The average of lower and upper class limit.

64
Grouped Frequency cont’d…
Example for data 1
The number of classes(k) can be computed using Sturg's rule as:

K= 1 +3.322Log(240)

Therefore, the width W of each class can be computed as:

W = Largest Observation – Smallest Observation


K

W = 49-15 = 4
9
65
Thus the width of each class can be 4 and the lower class limit
for the first class will be the minimum observation from the
dataset.
Example for data 1
Class Class Class Frequency RF(%) CF
Limit boundary mark
15-18 14.5-18.5 16.5 15 6.25 15
19-22 18.5-22.5 20.5 49 20.41 64
23-26 22.5-26.5 24.5 51 21.25 115
27-30 26.5-30.5 25.5 40 16.67 155
31-34 30.5-34.5 32.5 21 8.75 176
35-38 34.5-38.5 36.5 22 9.17 198
39-42 38.5-42.5 40.5 18 7.50 216
43-46 42.5-46.5 44.5 15 6.25 231
47-50 46.5-50.5 48.5 9 3.75 240 66
Grouped Frequency cont’d…
Note that: the value to be added or subtracted on the
class limits to get class boundaries depends on the
decimal number of the dataset that we want to
summarize.
The width of a class is found from the true class limit by
subtracting the true lower limit from the upper true limit
of any particular class.
For example, the width of the above distribution is (let's
take the fourth class) ( w = 30.5 - 26.5 = 4).

Cumulative frequencies: When frequencies of two or more


classes are added.

Cumulative relative frequency: The percentage of the total


number of observations that have a value either in that interval or
below it.

67
Statistical Tables
A statistical table is an orderly and systematic
presentation of data in rows and columns.
Based on the purpose for which the table is designed
and the complexity of the relationship,
A table could be either of simple frequency table or cross
tabulation.

Simple frequency table is used when the individual


observations involve only to a single variable.

Cross tabulation is used to obtain the frequency


distribution of one variable by the subset of another
variables.

68
Statistical Tables….
Construction of tables
There are no hard and fast rules to follow, the following
general principles should be addressed in constructing
tables.

Tables should be as simple as possible.

Tables should be self-explanatory:

Title should be clear and to the point (a good


title answers: what? when? where? how
classified ?) and it should be placed above the
table.
Each row and column should be labeled.

69
Statistical Tables….
Numerical entities of zero should be explicitly
written rather than indicated by a dash.
Dashed are reserved for missing or unobserved
data.

If data are not original, their source should be given


in a footnote.

1)One-variable/ Simple frequency table Most basic


table is a simple frequency distribution with one
variable.

70
Frequency Relative Frequency
ICU Type (How often) (Proportionately often)

Medical 12 0.48
Surgical 6 0.24
Cardiac 5 0.20
Other 2 0.08

Total 25 1.00

71
Statistical Tables….
• Clinical symptoms among 54 patients with S
Typhimurium-infection, Oslo, Norway, May
1998

72
Statistical Tables….
Two variable table
Table 1. Cases of Salmonella Typhimurium-infection by age-
group and sex, Herøy, Norway, 1999

73
Three variable table

74
Composite/ Higher Order Table
It is a large table combining several separate variable/tables.
Age, sex and other demographic variables may be combined to
form a single table

75
Common form of a two by two variable
It is a special form of table favorite among
epidemiologist

It is used to compare whether there is relationship


between the two variables

76
Graphical Presentation
Graphs are often easier to interpret than tables,
perhaps at the expense of detail.

A variety of graphs are used depending on the type


of data.

If we want to present categorical/qualitative or


quantitative discrete data/variable using graph, then
pie chart and bar chart are the appropriate ones,

However if the variable is numerical/quantitative


continuous data in nature, then we can use histogram,
frequency polygon, cumulative frequency curve, box
plot…

77
Graphical Presentation……
Construction of a Graph
Every graph should be self-explanatory and as
simple as possible.

Titles are usually placed below the graph and it


should answer again question like: what ? Where?
When? How classified?
Legends or keys should be used to differentiate
variables if more than one is shown.

The units in to which the scale is divided should


be clearly indicated.
The numerical scale representing frequency must
start at zero or a break in the line should be
shown. 78
Graphical Presentation……
Importance of diagrammatic representation:
They give quick overall impression of the data.

They have great memorizing value than mere figures.

Used to understand patterns and trends

79
Specific types of graphs include:

Bar graph Nominal, ordinal


Pie chart data

Histogram
Stem-and-leaf plot Quantitative
Box plot data
Scatter plot
Line graph
Others
80
Graphical Presentation……
Bar Charts
 Categories are listed on the horizontal axis (X-axis)
 Frequencies or relative frequencies are represented on the Y-axis
(ordinate)

 Are used to represent and compare the frequency distribution of


discrete variables and attributes or categorical series.
 Each category of variable is represented by a bar

 Variables are categorical, or treated as qualitative

 It can be displayed as horizontal or vertical

Types of Bar Charts

A. Simple bar chart:


One variable and It can be displayed as horizontal or vertical
81
Graphical Presentation……

82
Graphical Presentation……
B. Grouped bar chart
 Data from 2-variable or more variable table
 Distinct colours or shading is used to differentiate; Legend is
necessary

83
Graphical Presentation……
C. Stacked bar chart It is used to show the same data as a
grouped bar chart using a single bar

84
Graphical Presentation……
D. 100% component bar chart

It is a variant of stacked bar chart , where bars are


pulled to 100% rather than their real values;

It is helpful for comparing the contribution of


different subgroups within the categories of the
main variable

85
Graphical Presentation……

86
Graphical Presentation……
Pie Chart
It is a circle divided into sectors so that the areas
of the sectors are proportional to the frequencies.
It is splited into segments to show percentages or
the relative contributions of categories of data.
It is a good method of representation if you wish
to compare a part of group with the whole group.
The number of categories should not be too much.
Used for a single categorical variable
Use percentage distributions
Performed by changing frequency to percentage
then to degrees.
87
Example: Distribution of deaths for females, in England and
Wales, 1989.
Cause of death No. of death

Circulatory system 100 000


Neoplasm 70 000
Respiratory system 30 000
Injury and poisoning 6 000
Digestive system 10 000
Others 20 000

Total 236 000

88
Distribution of deaths for females, in England and Wales, 1989.

Distribution fo cause of death for females, in England and Wales, 1989

Others
8%
Digestive System
4%
Injury and Poisoning
3%

Circulatory system
Respiratory system
42%
13%

Neoplasmas
30%

89
Graphical Presentation……
Histograms: is the graph of the frequency distribution
of continuous measurement variables.
It is constructed on the basis of the following principles:
The horizontal axis is a continuous scale running
from one extreme end of the distribution to the
other.
To construct a histogram, we draw the interval
boundaries on a horizontal line and the frequencies
on a vertical line.
The area of each bar is proportional to the
frequency of observations in the interval

Non-overlapping intervals that cover all of the data


values must be used.
90
Graphical Presentation……
Its base on the horizontal axis extending from
one class boundary of the class to the other class
boundary,
There will never be any gap between
the histogram rectangles.

The bases of all rectangles will be determined by


the width of the class intervals.

In constructing
– Use equal class intervals
– Do not use scale breaks

91
Example: Distribution of the age of women at the time of marriage

Age 15-19 20-24 25-29 30-34 35- 40- 45-49


group 39 44
Number 11 36 28 13 7 3 2
Age of women at the time of marriage

40

35

30
No of women

25

20

15

10

0
14.5-19.5 19.5-24.5 24.5-29.5 29.5-34.5 34.5-39.5 39.5-44.5 44.5-49.5
Age group
92
Graphical Presentation……
Frequency polygon
If we join the midpoints of the tops of the adjacent
rectangles of the histogram with line segments a
frequency polygon is obtained.

The scale should be marked in the numerical values of


the midpoints of intervals

Erect ordinates on the midpoints of the interval - the


length or altitude of an ordinate representing the
frequency of the class on whose mid-point it is erected.

Join the tops of the ordinates and extend the connecting


lines to the scale of sizes.

93
Frequency polygon for the ages of 2087 mothers with <5
children, Adami Tulu, 2003

700

600

500

400

300

200

100 Std. Dev = 6.13


Mean = 27.6
0 N = 2087.00
15.0 20.0 25.0 30.0 35.0 40.0 45.0 50.0 55.0

N1AGEMOTH

94
Graphical Presentation……

95
Graphical Presentation……
Ogive or cumulative frequency curve
To construct an Ogive curve:
Compute the cumulative frequency of the
distribution then turn in to graph

Prepare a graph with the cumulative frequency on the


vertical axis and the true upper class limits (class
boundaries) of the interval scaled along the X-axis
(horizontal axis).

96
Cumulative Frequency and Cum. Rel. Freq. of Age of 25 ICU
Patients
Relative Cumulative Cumulative
Age Interval Frequency Frequency frequency Rel. Freq.
(%) (%)
10-19 3 12 3 12
20-29 1 4 4 16
30-39 3 12 7 28
40-49 0 0 7 28
50-59 6 24 13 52
60-69 1 4 14 56
70-79 9 36 23 92
80-89 2 8 25 100

Total 25 100

97
Cumulative frequency of 25 ICU patients

98
Graphical Presentation……
Line graph
Useful for assessing the trend of particular situation
overtime.
Helps for monitoring the trend of epidemics.
The time, in weeks, months or years, is marked along
the horizontal axis, and
Values of the quantity being studied is marked on the
vertical axis.

99
No. of microscopically confirmed malaria cases by species and month at
Zeway malaria control unit, 2003

2100
No. of confirmed malaria cases

1800 Positive
1500 P. falciparum
P. vivax
1200

900

600

300

0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Months

100
MMRatio per 100,000 live births by age of woman;
Giza, Egypt 1984

1200

1000
MMR per 100,000 LB

800

600

400

200

0
15-19 20-24 25-29 30-34 35-39 40-44 45-49
Age

MMR per 100,000 LB


101
Data Summarization
Measure of Central Tendency (location)
One type of measure useful for summarizing data
defines the center, or middle, of the sample.
These central tendency includes:
Mean ,
Median and
Mode .

102
Data Summarization………..
A MCT is good or satisfactory if it possesses the following
characteristics.
 It should be based on all the observations

 It should not be affected by the extreme values

 It should be as close to the maximum number of


values as possible
 It should have a definite value

 It should not be subjected to complicated and tedious


calculations
 It should be capable of further algebraic treatment

 It should be stable with regard to sampling


103
Data Summarization………..
Arithmetic Mean/simple Mean( ( )
Definition: the arithmetic mean is the sum of all
observations divided by the number of observations.
It is usually denoted by

The mean for Ungrouped Data


Let us consider X1, X2, ..., XN are the list of N
measurements obtained from N subjects. Then the
mean for ungrouped number of measurements for N
subjects is defined as:

104
Data Summarization………..
Mean for Grouped data

105
Data Summarization………..
For a given set of data there is one and only one
arithmetic mean (uniqueness).

Easy to calculate and understand (simple).

Influenced by each and every value in a data set

Individual extreme values (also known as 'outliers')


can distort its ability to represent the typical value of
a variable (which is The main weakness of the mean.)

106
Data Summarization………..
Example 1
Consider the data on birth weight of 10 new born
children in kilo gram at Aksum Saint marry public hospital
2.51, 3.01, 3.25, 2.02,1.98, 2.33, 2.33, 2.98, 2.88,2.43.Then
the average birth weight can be computed as:

Solution:

Therefore the mean average birth weight of infants is 2.572


Kg.
107
Data Summarization………..
Example 2:
The heart rates for n=10 patients admitted for
further investigation were as follows (beats per minute):
167, 120, 150, 125, 150, 140, 40, 136, 120, 150
What is the arithmetic mean for the heart rate of these
patients?

Solution:

The arithmetic mean heart beat for the patients is 129.8


beats /minute. 108
Data Summarization………..
Example 3
Compute mean for the grouped frequency distribution
given bellow: The grouped frequency distribution for
current age of women

109
Data Summarization………..
Solution:

Therefore the mean age of women is 29 yrs.


110
Data Summarization………..
The Median
The median, is a value such that at least half of the
observations are less than or equal to median and at least
half of the observations are greater than or equal to
median .
The median is the midpoint of the data array.

If the number of values is odd, the median will be the


middle value when all values are arranged in order of
magnitude.

When the number of observations is even, there is no


single middle value but two middle observations.

111
Data Summarization………..

112
Data Summarization………..
Example:Consider the data on the weight of 10 new born
children at Saint Marry hospital within a month:2.51,
3.01, 3.25, 2.02,1.98, 2.33, 2.33, 2.98, 2.88, 2.43.
Find median for the data.
Solution:
First arrange the data in to ascending order as:
1.98, 2.02, 2.33, 2.33, 2.43,2.51, 2.88, 2.98, 3.01, 3.25.
As 10 is even we need to take the middle two observations
and the median will be the average of this two middle
observations.

Therefore the median value of the observation is 2.47


113
Data Summarization………..
Median for grouped data:
The median for grouped data is defined by:

Where:
LCB= Lower Class Boundary of the median class
FC = Cumulative Frequency just before the median class
fC = Frequency of the median class
W = Class Width and
n=number of observations.

114
Data Summarization………..
Example :Median for grouped data
Consider the example on age of women we
presented using frequency distribution bellow.
Compute median for grouped data?
To compute median for grouped data, we need
first find the median class.
In this example half of the observation is 120 that
is n/2; 240/2=120
Let us see the distribution with the cumulative
frequency:

115
Data Summarization………..

116
Data Summarization………..
As we can see from the distribution, the class which
contains 120 observation for the first time is the class
with cumulative frequency 155 as 120 is under 155. So,
the median class is the 4th class.
Solution:  n 
  Fc 
LCB =26.5 x = Lm   2
~ W
FC = 115  fm 
 
fC = 40
W =4
n= 240

117
Data Summarization………..
Properties of the median
Extreme values do NOT affect the median, making
the median a good alternative to the mean to
measure central tendencies when such values occur.
There is only one median for a given set of data
(uniqueness)

The median is easy to calculate

Median can be calculated even in the case of open


end intervals

It is determined mainly by the middle points and less


sensitive to the remaining data points (weakness).
118
Data Summarization………..
The Mode
Mode is the value appearing most frequently

It can be obtained by counting the number of


appearance for each observation from the list.

Important for summarizing nominal/categorical types of


data
It is not influenced by extreme values.

It is possible to have more than one mode or no


mode(drawback).
It is not a good summary of the majority of the data.

It is not affected by extreme values


119
Data Summarization………..

120
Data Summarization………..
NB: The mode for grouped data is modal class. Modal
class is the class with the largest frequency.

121
Group Assignment
Explain with arbitrary data to the below mentioned
statistical terms and present with power point presentation
to your group/classmates
Geometric Mean(GM)
Harmonic Mean(HM)
Weighted Mean(WM)
Quantiles
Percentiles/Quartiles
Range
Interquartile Range(IQR)
Box and Whisker Plot
Outliers
122
Data Summarization………..
Skewness
If extremely low or extremely high observations are
present in a distribution, then the mean tends to shift
towards those scores.
Based on the type of skewness, distributions can be:

Symmetrical distribution: when data values are


evenly distributed on both sides of the three measures
of central tendency (Mean, Median and Mode).
It is neither positively nor negatively skewed. A curve
is symmetrical if one half of the curve is the mirror
image of the other half.

If the distribution is symmetric and has only one


mode, all three measures are the same, an example
being the normal distribution. 123
Data Summarization………..

Mean = Median = Mode


124
Data Summarization………..
Positively skewed distribution:
Occurs when the majority of scores are at the
left end of the curve and a few extreme large
scores are scattered at the right end.

For positively skewed distributions (where the


upper, or left tail of the distribution is longer
(“fatter”) than the lower, or right tail) the
measures are ordered as follows:
Mode < median < mean.

125
Data Summarization………..

126
Data Summarization………..
Negatively skewed distribution:
Occurs when majority of scores are at the right end of
the curve and a few small scores are scattered at the left
end.

For negatively skewed distributions (where the right


tail of the distribution is longer than the left tail), the
reverse ordering occurs:
Mean < median < mode.

127
128
Measurement of Variation
Used to determine the degree of variability between
points relative to MCT.(SD, CV, Mean deviation, etc)

Measures of dispersion or variability will give us


information about the spread of the scores in our
distribution.
Without knowing something about how the data is
dispersed, measures of central tendency may be
misleading.
Most common measures of dispersion includes
Range,
Inter-quartile range,
Variance,
Standard deviation and standard error
Coefficient of variation 129
Measurement of Variation………
Consider the following three datasets
Dataset 1:7, 7, 7, 7, 7, 7 Mean=7, s.d=0
Dataset 2: 6, 7, 7, 7, 7, 8, mean=7, s.d=0.63
Dataset 3: 3, 2, 7, 8, 9, 13, mean=7, s.d=4.04

What is your observation from the data 1,2 &3?

130
Measurement of Variation………..

Group Discussion

Standard Deviation vs Standard Error???

131
Measurement of Variation………..
The Variance
Variance measure how far on average scores
deviate or differ from the mean.

Variance is the average of the square of the


distance each value is from the mean
Mathematically the formula for population
variance is defined as:

132
Measurement of Variation………..
The mathematical formula for sample variance is

defined as:

133
Measurement of Variation………..
Following are the survival times of n=11 patients
after heart transplant surgery.

The survival time for the “ith” patient is represented


as Xi for i= 1, …, 11.

Calculate the sample variance and SD.

134
135
Variance for grouped frequency distribution

In a grouped frequency distribution, the variance is


computed as:

136
Example of Variance for grouped frequency
distribution
Consider the following data of time spend by college
students for leisure activities. Compute standard
deviation.

137
Measurement of Variation………..

138
Measurement of Variation………..
Standard Error (SE) : is used to describe the variability
among separate sample means obtained from one sample
to another.

SE is used to describe the variability in the means of repeated


samples taken from the same population.
For example, imagine 5,000 samples, each of the same
size
n=11. This would produce 5,000 sample means. This new
collection has its own pattern of variability. We describe
this new pattern of variability using the SE, not the SD
139
Measurement of Variation………….
The Standard Deviation
The sample and population standard deviations are
denoted by S and σ (by convention) respectively.

The standard deviation, S.D., is just the positive


square root of the variance.
This produces a measure having the same scale as
that of the individual values.

It expresses exactly the same information as the


variance, but re-scaled to be in the same units as the
mean.

Mathematically: Population standard deviation is


given by
140
Measurement of Variation……….

Sample standard deviation can be defined as:

141
Measurement of Variation………..
Example1 The Areas of sprayable surfaces with DDT
from a sample of 15 houses are measured as follows (in
m2) :
101, 105, 110, 114, 115, 124, 125, 125, 130, 133, 135,
136, 137, 140, 145
Find the variance and standard deviation of the
above distribution.
Solutions
The mean of the sample is 125 m2.
Variance (sample) = s2 = Σ(xi –x)2/n-1 = {(101-125) 2
+(105-125) 2 + ….(145-125) 2 } / (15-1)
= 2502/14
= 178.71 m4
Hence, the standard deviation

= 13.37 M2
142
Measurement of Variation………..
Properties of SD
The SD has the advantage of being expressed in the
same units of measurement as the mean

SD is considered to be the best measure of dispersion


and is used widely because of the properties of the
theoretical normal curve.

However, if the units of measurements of variables of


two data sets is not the same, then there variability
can’t be compared by comparing the values of SD.

143
Measurement of Variation………..
Coefficient of variation (CV)
The standard deviation is an absolute measure of
deviation
of observations around their mean and is expressed with
the same unit of the data.

Due to this nature of the standard deviation it is not


directly used for comparison purposes with respect to
variability.
Coefficient of variation, is often used for this purpose
The coefficient of variation (CV) is defined by:

144
Measurement of Variation………..
The coefficient of variation is most useful in
comparing the variability of several different samples,
each with different means.
CV is a relative measure free from unit of measurement.

145
Measurement of Variation………..
When to use coefficient of variance
When comparison groups have very different
means(CV is suitable as it expresses the standard
deviation relative to its corresponding mean)
When different units of measurements are involved,
e.g. group 1 unit is mm, and group 2 unit is gm (CV is
suitable for comparison as it is unit-free)
In such cases, standard deviation should not be used
for comparison.
It is the best measure to compare the variability of two
series of sets of observations.
Data with less coefficient of variation is considered
more consistent. 146
Measurement of Variation………..

SD Mean CV (%)

SBP 15mm 130mm 11.5


Cholesterol 40mg/dl 200mg/dl 20.0

“Cholesterol is more variable than systolic blood pressure”

147
Thank You !!!

148

You might also like