Lecture 2 Data Information Knowledge-1
Lecture 2 Data Information Knowledge-1
Lecture 2:
Data, Information, Knowledge
Outline
• Data, information & knowledge
• Information cycle – from data to action
• Ensuring data quality
• Data processing and compilation
• Data presentation
Data, Information, Knowledge
• Data: Observations (numbers, terms)
• No meaning attached to it as a result of which it
may have multiple meanings
• Example: what does “9” mean?
(Zins,
2007)
Data, Information, Knowledge
• Knowledge: includes facts about real world
entities and the relationship between them;
justifiable beliefs based on data and information.
• It is an understanding
gained through experience
• Answers the ‘how’
question
(Zins, 2007)
Example: Data is everywhere!
• 38.5
• 45
• 150
• Female
Example: Information
• 38.5
• 38.5 degree Celsius high body temperature
reading
• 45
• 45 years old female patient
• 45 kilograms female patient
• 150
• 150 cm female patient
• Female patient with CD4 count: 150 cell/mm3
Example: Knowledge
• An middle-age women with fever and
progressed Stage 3 infection (AIDS)
• An middle-age women with fever.
• An women with normal Body Mass Index (BMI)
with fever.
Quiz: Matching
CATEGORY REPONSE
• Data • 17
• Information • Positive Drug Screen
• Knowledge • Positive Pregnancy Test
• 20 weeks gestation
• A high risk, obese
pregnant teen in second
trimester with
substance use issues
• 90
Different types of Data/Variables
Yellow
Acute IGM+
fever
Age Sex Date of Fever Jaundice Lab Test
ID # Village Vaccine
(years) (M/F) Onset
Y=Yes, N=No, U=Unknown
1 A 5 M 30 Dec 2016 Y N Y
2 B 11 F 09 Jan 2017 Y N Y
3 A 34 M 12 Jan 2017 Y N Y
4 C 73 M 12 Jan 2017 Y N Y
5 A 84 F 13 Jan 2017 Y N Y
6 B 16 M 16 Jan 2017 U N Y
7 B 19 F 30 Jan 2017 Y N Y
8 A 23 F 02 Feb 2017 Y N Y
9 C 38 F 08 Feb 2017 Y N Y
10 B 47 M 11 Feb 2017 Y N Y
11 A 27 F 17 Feb 2017 Y N Y
Process
Stage 4: Aggregate the data.
• This step helps to consolidates the data from
individual patients to groups or pools of patients.
15
10
0
0 1 2 3 4 5 6 7 8 9
Age (Years)
Individual Records to Summarize
Day Dataset: incubation period (in days) of 19 patients with
Pt. s
Ebola virus disease (EVD)
KP 9
JB 8
SW 11
EB 9
NG 10
PK 7
BJ 9
JH 9
RF 6
AH 2
TN 11
RT 8
LW 14
EN 9
CL 8
RD 13
KJ 8
LC 10
TB 7
Mode
Definition
value that occurs most frequently in a dataset
• Simple measure, but relatively unimportant
To identify the mode
1 Create frequency
distribution table
3 Median is value at
14
EN 109
CL
15 108
below median
RD
16 13
11
10th position = 9
KJ
17 118
LC
18 10
13
TB
19 147
Median: Example
Obs
Pt.
1
Days
Days
29
Ebola Incubation Period (n=20)
KP
2
JB 68
SW3 7
11 Added 20th patient, so now,
4 79
EB
5
NG 8
10
Even number of values (n = 20)
6
PK 87
1
7
BJ 89
8
JH 89 Sort
9
RF 96
10
AH 92
Median = 9
11
TN
12
RT
13
LW
9
11
98
9
14
2 Find middle position
(20 + 1) / 2 = 10.5
3 Median is value
14
EN 109
15
CL 108
16
RD 11
13
midway between 10th
17
KJ 118 and 11th position =
18 13
LC
19
10
14
(9+9)/2 = 9
TB 7
20
YY 21
Median: Properties and Uses
• Good descriptive measure for center of data
• Not affected by an extreme value (“outlier”)
• Measure of choice for asymmetrical (“skewed”)
distribution
2
11 9
TN
12
RT
11
98
Divide sum by number of
13
LW 9
14 observations (n)
14
EN 109
15
CL 108 n = 20
16
RD 11
13
17
KJ
18
LC
11
13
10
8
Mean is 189 / 20 = 9.5 days
19
TB 147
20
YY 21
Mean: Properties and Uses
• Best known measure of central location
• Uses all the data
• Affected by extreme values (outliers)
• Best for symmetrically distributed data
20 Chart Title
Number of Cases
15
10
5
Spread
0
0 1 2 3 4 5 6 7 8 9
Age (Years)
Range
Definition (Epidemiologic)
Description of smallest to largest value
Measure of spread
Sort data or
1 create frequency
distribution
2 Find minimum and
maximum values
Range: Example
Obs Days
Ebola Incubation Period (n=19)
1 2
2 6
3 7
4 7 Minimum value
5 8
6 8 =2
7 8
8 8
9 9
10 9 Range = 2 – 14
11 9
12 9
13 9
14 10
15
16
10
11
Maximum value
17 11 = 14
18 13
19 14
Summarizing Quantitative Data:
Example: Ebola Incubation Period (n=19)
Pt.
KP
Days
9
Ebola incubation period (days)
JB
SW
8
11
Mode = 9
EB
NG
9
10
Median = 9
PK 7 Mean = 8.8
BJ 9
JH 9 Range = 2 – 14
RF 6
AH 2 For quantitative epidemiologic data,
TN 11
RT 8 recommend summary with median
LW 14 and range.
EN 9
CL 8 Summary of Incubation period:
RD 13
KJ 8 Median (range) = 9 (2 – 14) days
LC 10
TB 7
Measures of Central Location:
Summary
• Measure of central location — single measure that
represents an entire distribution
• Mean — average value
• Mean uses all data; sensitive to outliers
• Mean preferred for symmetrical data; not common in
epidemiology
• Median — central value
• Safer choice for most epidemiologic data
• Mode — most common value
• Use median or mean with range
Exercise
• Review the data set with confirmed cases of
acute Middle east respiratory syndrome
coronavirus (MERS-CoV) infection
Date of Date of
symptoms Exposure Date of Days notification Days from
ID Age City of onset Exposure to MERS- outcome to to WHO onset to
No. (years) Sex residence (dd/mm/yy) to camels CoV cases Status (dd/mm/yy) Death (dd/mm/yy) notification
1 7
49 M Unizah 24-Oct-17 Yes Unknown Deceased 6-Nov-17 13 31-Oct-17
2 7
60 M Riyadh 25-Oct-17 Yes Unknown Alive 1-Nov-17
3 8
42 F Riyadh 25-Oct-17 Unknown Unknown Alive 2-Nov-17
4 11
65 M Riyadh 25-Oct-17 Unknown Unknown Alive 5-Nov-17
5 7
64 M Riyadh 29-Oct-17 Unknown Unknown Alive 5-Nov-17
6 5
49 M Riyadh 1-Nov-17 Unknown Unknown Alive 6-Nov-17
7 4
51 M Afif 9-Nov-17 Yes Unknown Alive 13-Nov-17
8 4
75 F Unizah 9-Nov-17 Unknown Unknown Deceased 18-Nov-17 9 13-Nov-17
9 3
69 M Zulfi 12-Nov-17 Unknown Unknown Alive 15-Nov-17
10 9
77 F Buridah 9-Nov-17 Unknown Unknown Deceased 18-Nov-17 9 18-Nov-17
11 5
63 M Bisha 15-Nov-17 Yes Unknown Alive 20-Nov-17
12 3
64 F Alasyah 21-Nov-17 Yes Unknown Deceased 24-Nov-17 3 24-Nov-17
13 5
15 M Riyadh 23-Nov-17 Unknown Unknown Deceased 3-Dec-17 10 28-Nov-17
14 NC
13 M Riyadh Unknown Unknown Yes Alive 28-Nov-17
15 11
67 F Bisha 18-Nov-17 Unknown Unknown Alive 29-Nov-17
16 4
71 M Buridah 25-Nov-17 Unknown Unknown Alive 29-Nov-17
17 4
64 M Riyadh 30-Nov-17 Unknown Unknown Alive 4-Dec-17
18 11
90 M Riyadh 27-Nov-17 Unknown Unknown Alive 8-Dec-17
Qualitative Variables
Type of Data Summarize with
• Descriptions Measures of
• Non-numeric data frequency
Measures
Examples Counts
• Ill? (yes/no) Ratios
• Sex Proportions
• District Rates
Counts: Global Number of Deaths* by
Selected Causes, 2000 and 2015
2000 2015
All causes 52,135 56,441
Ischemic heart disease 6,883 8,756
Stroke 5,407 6,241
Lower respiratory infections 3,408 3,190
Chronic obstructive pulm. disease 2,953 3,170
Trachea, bronchus, lung cancers 1,255 1,695
Diabetes mellitus 958 1,586
Diarrheal disease 2,177 1,389
Tuberculosis 1,667 1,373
Road injury 1,118 1,342
Cirrhosis of the liver 905 1,162
Kidney disease 709 1,129
HIV/AIDS 1,463 1,060
* x 1,000
Source: WHO. Global Health Observatory. Top 10 causes of death. 2017
Counts: Properties and Uses
Where: x = numerator
y = denominator
k = constant (1, 100, 1000, etc.)
Ratio
Definition
Comparison of any two values
Ratio = (x / y) x k
Where: x = numerator
y = denominator
k = constant (1, 100, 1000, etc.)
Definition
Comparison of a part to the whole
• Incidence
• Prevalence
• Attack rate
• Case-fatality rate
• Mortality rate
• Other rates
Incidence versus Prevalence
Numerator
Incidence — New cases
24 cases
x 100,000 = 8.0
300,000 pop x 1 year
Q3. Calculate attack rate (risk) for 1-14 year olds, per 1,000
population. 259 / 18,350 x 1,000 = 0.0141 x 1,000
= 14.1 cases /1,000 population
Attack Rates: Practice 2
Acute watery diarrhea cases by age and sex, Village X,
January, 20xx
Age (years) Males Females Total
Cases AR (%) Cases Pop. AR (%) AR (%)
<1 9 11.3 17 20.0 26 15.8
1 – 14 152 16.5 107 11.7 259 14.1
15 – 29 44 8.0 51 8.5 95 8.3
30 – 49 17 2.7 24 3.6 41 3.2
≥ 50 8 2.7 10 2.2 18 2.4
Total 230 9.3 209 7.7 439 8.4
Q1. Which age group had the most cases? 1-14 year olds
Q2. Which age group had the greatest risk of illness?
< 1 year olds
Prevalence
Definition — Prevalence of disease
Frequency of existing cases (new cases plus old cases
that are still active) of a disease in a population at a
point or over a period of time
Definition — Prevalence of an attribute
Frequency of persons with a particular attribute in a
population at a point or over a period of time
Formula
Numerator: number current cases or persons with
attribute
Denominator: size of population
Constant: usually 100 (%) or 1,000
Prevalence: Examples
Number persons living with HIV in Province X in 2018
Province X population on 1 July 2018
Types
• Death rate – refers to entire population
• Disease-specific (Cause-specific) death rate
• Age-specific death rate
• Maternal mortality rate
• Many others
Death Rate: Practice
Ratio
Counts Proportion
Comparison of
Number of cases Part of the whole
any two numbers
Incidence rate
Attack rate
New cases, any time
Rate New cases, short time
interval, need to take
Number of interval
time into account
cases
divided by Prevalence rate
Case-fatality rate
population Current cases in
Proportion of cases
population regardless
that died
of time of onset
Summary
• For qualitative variables, summarize with ratios,
proportions, and rates
• For quantitative variables, summarize with
mode, median, mean, and range
• For epidemiologic data, use median and range
• Key rates:
• Incidence: rate of new cases in population
• Prevalence: rate new + old cases in population
• Attack rate of disease: during outbreak
• Death rate: mortality accounting for population
size
• Case-fatality: deaths among cases
Exercise: Scenario
• Four years ago, 787 women aged 40–65 years who received
primary health care at a particular clinic were enrolled into a
blood pressure (BP) study. None had been previously
diagnosed with high blood pressure. Qualified clinicians
measured the BP of each woman, and hypertension was
defined as any person with one diastolic BP measurement of
>95 mm Hg. Each woman diagnosed with hypertension was
treated with antihypertensive drugs.
• Among the 787 women, 37 were diagnosed with hypertension
on Day 1 of the study. After exactly one year, an additional 43
women were diagnosed with new onset of hypertension. In the
subsequent 3 years, 54 additional women were diagnosed with
hypertension.
• Among the 787 enrollees, six died during the study period,
including five of those with hypertension.
Scenario……
• Question 1: What proportion of women in the cohort were
newly diagnosed with hypertension on Day 1?
• Question 2: What was the prevalence of hypertension among
this cohort of women at the end of the first year of this study?
• Question 3: What was the incidence of hypertension per year
during the study period
• Question 4: What was the annual death rate among all 787
women during the study period?
Present
Stage 6 Report the data.
• Reporting is more integral to healthcare quality
improvement.
• What do you want to communicate?
• Different information products for different data &
meanings
• Tools and methods for organizing data into
information:
• Graphs :Histogram, Line diagrams, Scatter plot,
Bar chart, Pie chart, population pyramids
• Tables : Frequency distribution
• Maps: Geographical presentation
Presenting data in graphs
Monthly Clinical Diagnostics at MZUNI Health Center
(Jan-April 2015)
3.5
2.5
Malaria
2 Dairrhea
Pneumonia
1.5
0.5
0
Jan Feb March April
GRAPHS
(a visual representation of data)
Advantages:
• Information is instantly conveyed
• Data presented clearly and simply
• Can expose relationships and patterns
• Detect trends over time
• Can be used to emphasise information
600
400
X axis – label if appropriate
200
0
X
Jan Feb Mar Apr May Jun Key or legend – used if more
than one element graphed
PHC Headcount
Source: Notes:
400
300
200
100
0
Jan Feb Mar Apr May Jun
80
Target line
60
%
40
20
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Monthly Immunisation 4 5.3 6.2 3.8 5.6 7.3 6.8 7 5.9 6.7 7.5 5.8
Cumulative Immunisation 4 9.3 15.5 19.3 24.9 32.2 39 46 51.9 58.6 66.1 71.9
80
%
immunization, antenatal coverage, etc. 40
20
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Monthly Immunisation 4 5.3 6.2 3.8 5.6 7.3 6.8 7 5.9 6.7 7.5 5.8
Cumulative Immunisation 4 9.3 15.5 19.3 24.9 32.2 39 46 51.9 58.6 66.1 71.9
800
700
600
500
numbers
400
300
200
100
0
Jan Feb Mar Apr May Jun
PHC Headcount under 5 years PHC Headcount 5 years and over
displays data over time or can compare 2 or more different facilities / districts / regions / years
* Slide from UiO Course:
INF5761/INF9761
Bar graph, stacked
Clinic Alpha : Attendance 2001
1200
1000
800
numbers
600
400
200
0
Jan Feb Mar Apr May Jun
it displays the quantities, but it also shows the relative proportions of the categories to each other
and to the whole
BUT hard to estimate the value of the variables at the top
* Slide from UiO Course:
INF5761/INF9761
* Slide from UiO Course:
INF5761/INF9761
Common faults with graphs
No title
No labels for the variables
Don’t trust
No units of measurement (or incorrect units!)
the
computer!
No scale markings (or just too many!)
Inappropriate scale choice – data points should
be evenly represented
Incorrect choice of independent (x-axis) and
dependent (y-axis) variables
No legends when needed
Too high ink-to-data ratio (e.g. 3D graphs)
* Slide from UiO Course:
INF5761/INF9761
BAD
GRAPHS!
Can’t afford
travel
INF5761/INF9761
Data quality bias?
1st Dose VS Population <1yr
Take action: Underweight children
Public campaign:
”You must weigh your child
every month to make sure
s/he grows properly”
106