1data Transcript Slides
1data Transcript Slides
Introduction to Course
This hour-long course provides public health professionals with
an introduction to data interpretation. The examples and inter-
active exercises in this module offer opportunities to increase
your skills in presenting data to co-workers, community-based
organizations, hospitals, public agencies, boards of health, and
the general public. When available, Idaho-specific examples
have been used. When not available, U.S. national examples are
used.
Objectives
By the end of this module you should be able to:
1) List at least three common data sources used to charac-
terize the health or disease status of a community
2) Define and interpret basic epidemiology measures such
as prevalence, incidence, mortality, and case-fatality
3) Define and interpret basic biostatistical measures such
as the mean, median, mode, confidence interval, and
p-value
4) Read and interpret tables and graphs, and
5) Determine the appropriate format for data presentation
Data Sources
Useful data for public health may come from national, state, and
local sources, and are not limited to what you might typically
identify as “public health data sources.”
Sources include, but are not limited to: surveillance data,
health related-surveys, administrative sources, vital statistics,
outbreak investigations, research, and the US Census.
Data Attributes
To be useful, data should be of good quality. When we use high
quality data, we are more likely to trust conclusions we draw
from those data and act on our results to make improvements in
whatever health condition is being studied.
The quality of data depends on several factors, including
accuracy and completeness. For example, in the case of surveil-
lance for notifiable conditions reporting, we hope that a reported
diagnosis in our data correctly classifies the true diagnosis of an
individual. This is called accuracy. As for completeness, if we get
incomplete data, we might not have a good idea of what is truly
going on specific to the health of our community. Even if our
data are not 100% accurate and complete, the data will usually still be useful if we have an idea of how
accurate and how complete the data are. Having some data is frequently better than having none at all.
When we use data to answer questions about the health of our communities, the data should be
relevant to the populations and health conditions we are interested in, and the data should arrive in
a timely enough fashion so that appropriate or necessary actions can be taken for control of a health-
related event.
Data Limitations
But all data sets have limitations. Though we won’t discuss
them in detail, some limitations of public health data we might
encounter include: inaccurate diagnoses or coding, poorly
conducted data collection, data entry or data analysis, and issues
that result in data not being representative of the population
we’d like to draw conclusions about.
For those of us working in public health practice, it is good
to know that reliable, published public health data sources are
available at the national, state and frequently at the local level.
Understanding the strengths and limitations of the data we work
with is important. If you have questions about the quality of a
data source you’re interested in using, data experts within the state Departments of Health can often
provide additional expertise and analysis.
Exercise on Proportions
Rates allow us to make comparisons between groups of people, such as different age groups, or
locations that have different population sizes, such as states versus cities or urban versus rural areas.
Rates also allow us to make comparisons within the same population over time.
Rates are useful in many ways. With rates, the health department can identify groups in the
community with an elevated risk of disease. With this information, risk factors can be examined and
interventions targeted to high-risk groups.
Using Rates
In using rates to make comparisons, we need to account for the
fact that the number of health events depends in part on the
number of people in the community. For instance, we expect
to find more cases in larger populations. To account for growth
in a community or to compare communities of different sizes,
we usually calculate rates to provide the number of events per
population unit.
For example, in looking at this table of surveillance data,
if you have 1000 people in your community and 20 cases
of a disease, that could be a lot more significant than if you
have 1 million people in your city and 20 cases of a disease.
Furthermore, if all 20 cases in either population occur within only one age group, race, or gender, this
might influence your decision to investigate further.
When we divide the numerator by the denominator in each city, you can see that the rate in city A
is much greater than the rate in city B – 1,000 times greater. Frequently, we will take these rates repre-
sented as decimals and multiply them by a multiple of 10 in order to convert them to whole numbers.
Here, this is done by multiplying both crude rates by 100,000, and the result is in the far right column.
As long as you multiply the rates for cities A and B by the same multiple of 10, you will still have a
valid comparison. Also, this far right column makes the most sense in terms of interpretation of a rate.
For example, in city B, we can say that the rate of disease in this population was 2/100,000, or for
every 100,000 people in the population, two cases of disease were identified during the time period
of interest.
Crude Rate
In the following slides we will introduce common rates that are
used in public health. Each serves a different purpose.
Let’s begin with crude rates: A crude rate is the rate calcu-
lated for the total population. Crude rates are recommended
when a summary measure is needed and it is not necessary or
desirable to take into account any other factors, such as the age
of the population.
A crude rate is calculated by dividing the total number of
events in a specified time period by the total number of individ-
uals in the population who are at risk for these events and
multiplying by a constant, such as 1,000 or 100,000 [in other
Category-Specific Rates
Category-specific rates are rates measured in a specific group,
such as the rate of disease within one gender, ethnicity or age
group. When the rate applies to a specific age group, such as
those between the ages of 15–24, it is called the age-specific
rate. Category-specific rates are used for comparisons when rates
differ widely between groups. For example, if we know that the
highest injury rates from unintentional falls are for children under
14 years and for people over 65 years, we might like to calculate
injury rates in these age groups specifically instead of in the total
population.
Category specific rates are recommended when specific
causal or protective factors are different for different subgroups. They present the actual magnitude of
an event within a designated group.
Age-Adjusted Rates
Almost all diseases or health outcomes occur at different rates
in different age groups. Many chronic diseases, including
most cancers, occur more often among older people. Other
outcomes, such as many types of injuries, may occur more often
among younger people than in middle aged people. Therefore,
the age distribution of a given population often determines what
the most common health problems in a community will be.
What I mean by this is that if a certain population mostly consists
of older people, the burden of disease from cancer in that
community will likely be greater that it would be in a community
that mostly consists of younger people.
By convention, rates are frequently adjusted to the age distribution of the estimated U.S. popula-
tion of the year 2000, commonly referred to as the standard population.
To summarize, age-adjusted rates are recommended when making comparisons in the rates of
age-related health events between different populations or for comparing trends in a given popula-
tion over time. And age-adjusted rates are essential for events that vary with age (for example, cancer
deaths), or when comparing populations with different age distributions. However, because age-
adjusted rates are often calculated using a standard population, these rates can mask important trends,
so it is also important to look at crude rates, as well as category-specific rates. Finally, age-adjustment
requires training and a knowledge of biostatistics.
• Consistency over time. For example, you probably don’t want to compare the rate of
diabetes in Idaho in 2006 with the rate of diabetes in the U.S. in 1986. The exception is,
of course, when you are purposefully comparing events between two time periods within
the same population.
When comparing age-specific rates, if the age categories are relatively large, such as 15 to 29, the
rates may be distorted, because this includes such a large group of people. Looking at smaller group-
ings (such as 15-19, or 20-24) might give you a better idea of what is going on in your population of
interest. In the case of comparing age-adjusted rates, be sure to compare only rates that have been
adjusted to the same “standard” population.
Prevalence
We are now going to turn our attention to specific types of
measures frequently used in public health to evaluate the
burden of disease in our communities. The first is prevalence.
Prevalence measures the number of cases (including both
new and old cases) of a disease (or health-related condition or
event) at a specific time point or period in time. Note that if
prevalence is measured for a period of time, say three months,
rather than at a point in time, the population denominator
should represent the average population during that period.
Some important things to remember about prevalence are that:
• It is used to present a ‘snapshot’ view of the disease or health condition of interest.
• We frequently obtain prevalence data from surveys or surveillance databases, and that,
• Prevalence is a proportion (or percentage).
When we calculate prevalence, the numerator includes existing cases of a disease at a specified
time and the denominator includes people in the defined population at that time. Because prevalence
describes the burden of illness in a population, public health professionals use prevalence to assess
the effect of a health event or disease on the resource needs of both the public health system and the
health care delivery system.
Calculating Prevalence
Now let’s work through an example of how prevalence is
calculated.
In 2003, Idaho had a household population of 1,366,322. Of
these people, 11 percent were 65 and older.
In 2003, Idaho reported 21,000 individuals 65 and older with
diabetes.
In this example, the prevalence of diabetes among Idahoans
ages 65 and older is calculated by dividing the number of individuals 65 years and older with diabetes
by the household population who were 65 years and older.
Here we see that calculation [21,000 / 150,295 = 0.140]. We multiply by 100, to report the preva-
lence as a percentage, and we see that in 2003, the prevalence of diabetes in Idaho residents equal to
or over 65 years of age was 14%.
We can also express the prevalence as 140 cases per 1,000 Idaho residents equal to or over 65
years of age. We arrive at this value by multiplying 0.14 by 1000.
Exercise on Prevalence
Incidence (Rate)
Another measure of burden of disease is incidence. Incidence
is defined as the number of new cases of a condition during a
defined time interval divided by the number of persons at risk of
developing the condition over that time interval.
Incidence rates provide a direct measure of the rate at which
new illness occurs in the population, and therefore incidence
rates—and the incident cases from which we derive them—can
be used to study the causes of health events. Incidence rates are
commonly expressed as the number of cases of disease or injury
per 100,000 person-years of exposure to the risk.
Now let’s look at an example of incidence.
Calculating Incidence
In 2004, 2784 new cases of chlamydia were reported in Idaho
State.
The at-risk population in Idaho in 2004 was 1,393,262.
Incidence is equal to the new cases divided by the at-risk
population [2784 / 1,393,262 = 0.001998].
In 2004, the incidence of chlamydia in Idaho State was
0.1998%.
We can also express this as 199.8 cases per 100,000
population.
Exercise on Incidence
Mortality Rates
A mortality rate is a specific type of incidence rate. Mortality rates are used to describe the incidence
of death in a population, rather than the incidence of disease. Mortality rates are frequently referred
to as death rates. Mortality rates are calculated by dividing the number of deaths in the population
during a stated time period by the number of persons at risk of dying during that period.
age group by the total number of women in Idaho in that age group in 2003 and then multiplying
that value by 100,000. The breast cancer mortality rate for women ages 45-64 was 34 per 100,000.
Among women ages 65 and older, the breast cancer mortality rate was 108 per 100,000.
Case Fatality
Another type of measure we sometimes use in public health
is the case-fatality. Case-fatality is calculated by dividing the
number of deaths from a condition during a stated time period
by the number of persons with the condition of interest. We call
this case-fatality because in the denominator we’re referring to
those with the condition as cases.
The case-fatality provides us with a measure of the severity of
the condition of interest. For example, among people older than
age 70 with West Nile Virus meningoencephalitis, the case-fatal-
ity was 21% in the U.S. in the year 2002.
Now let’s work through an example.
Mean
The mean is the average value of a set of data. We calculate
it by adding two or more quantities together and dividing by
the number of quantities. For example, if we wanted to know
the mean of two numbers, 6 and 7, we would add these two
numbers and then divide by 2 (which is the number of quanti-
ties we have).
The mean is a popular statistical measure because: it is
familiar to most people; it provides useful summary informa-
tion about our data, and it is easily used with other statistical
measurements. The major disadvantage to the use of the statisti-
cal mean is that it can be affected by extreme values in the data
set and therefore can be biased.
Median
The median is the midpoint, or “middle value,” in a series of
numbers arranged in order from small to large. Half the data
values are above the median, and half are below. For example,
in 2005 the median age of death in the U.S. was 75, meaning
that half the people who died were older than 75 and half were
younger.
If the list has an odd number of entries, the median is the
middle entry in the list. If the list has an even number of entries,
the median is equal to the sum of the two middle numbers
divided by two.
The median, unlike the mean, is not affected by extreme data
values.
Mode
The final summary measure we will discuss is the mode.
In a list of numbers, the mode is the number that occurs most
often, assuming at least one number, or data point, occurs more
than once.
For example, in the following set of data [1, 3, 5, 5, 7, 9], 5 is
the mode because it is the most frequently occurring value.
Some data are Unimodal, meaning they have only one
mode; some data are bimodal meaning they have two modes.
In this set of data [1, 3, 3, 5, 5, 7, 9], both 3 and 5 are the
modes.
95% CI Example
Let’s take a look at an example of the 95% CI. In this example,
we can look at the confidence interval and get a feel for how
precise our estimates are of the annual rate of death in Idaho
and in the different districts in Idaho. The annual rate of death
per 100,000 people in the state is 7.2/100,000. The narrow
confidence interval of 7.0 to 7.3 suggests that we can be 95%
confident that the true death rate lies within this range.
We can also use confidence intervals to compare 2 rates
to determine whether they are statistically different from each
other. When comparing two rates, if the confidence intervals
do not overlap, the difference in the rates is considered unlikely
to be the result of chance. (We use the term “statistically significant” to say that something is unlikely
to be the result of chance.) For example, when comparing the confidence intervals around the death
rates for districts 2 and 3, we can see that they do not overlap. This suggests that the difference
between the death rate of 9.0 per 100,000 for district 2 and 7.3 per 100,000 for district 3 is statisti-
cally significant.
It is worth noting that when comparing 2 rates, although non-overlapping CIs indicate a statistically
significant difference between the 2 rates, the opposite is not true. In other words overlapping CIs do
not necessarily suggest that the 2 rates are statistically similar. In order to be sure, you would have to
perform a statistical test to compare the 2 rates.
P-Value
Now let’s take a look at the other useful statistical tool I
mentioned at the beginning of the section: P-value. The p-
value is frequently used in public health to determine whether
observed differences between groups are ‘real’ differences.
(Another way to say this is that the p-value is a measure of
the statistical significance of a difference between rates or
proportions.)
The p-value is a measure of how likely it is that the differences
between two observed rates or proportions occurred by chance
alone. We’ll look at examples of p-values on the next slide, but
for now let me just say that a very small p-value means that observed differences were very unlikely to
have occurred by chance. For example, a p-value of 0.05 indicates that there was only a 5% chance
that the observed differences between the two estimates you are comparing occurred by chance
alone. This means that conversely, there was a 95% chance that the difference between the two
estimates you observed resulted from something other than chance.
A p-value of less than .05 suggests that there was less than a 5% chance that the observed
differences between the two estimates you are comparing occurred by chance alone.
It is common practice in publi health to use a cutoff of p less than .05 to establish that an oberved
difference was unlikely to have occurred by chance alone.
P-Value Example
Here is an example of how to interpret p-values. In this slide,
we are looking at a table of annual death rates per 100,000
people in Idaho by district. The far right column shows the
p-values for the death rates for each district compared with the
rest of the state.
A p-value that is less than .05 indicates that there is a statisti-
cally significant difference between the annual death rate in a
certain district and the rest of the state. We can see that 5 of the
p-values are less than .05. By looking at the rates themselves,
we can see that in districts 1, 2 and 5, the annual death rates
were statistically significantly higher than the rest of Idaho, and
in districts 4 and 7, the death rates were statistically significantly lower than the rest of Idaho. Because
the p-values were not less than .05 for the comparisons between district 3 and the rest of the state
and district 6 and the rest of the state, we cannot say that there is a statistically significant difference
between these death rates.
Data Presentation
In the previous slides, we discussed ways to use data to measure
the burden of disease in populations using disease frequency
measures such as prevalence and incidence. Then we discussed
ways to summarize data (such as with the mean or median) and
we discussed ways to measure the precision of our estimates and
make comparisons between groups (using confidence intervals
and p-values). But once you have summarized your data, how
do you know how to present those data in a clear and meaning-
ful fashion?
In the following slides we will discuss ways to present data
and the strengths and limitations of each presentation format,
and we will describe how to chose between different presentation options.
Often it is difficult to determine the most appropriate way to visually display data. Although we
may be comfortable with using one of these formats, the choice of graphic depends on what we want
to emphasize, rather than simply trying to fit the data into a familiar framework. The next series of
slides will review some common ways that data can be presented, and the strengths and limitations of
each method.
Examples of data presentation options that we will discuss are:
• Tables
• Line graphs
• Bar graphs
• Pie charts
Tables
A table is a visual display of data arranged into rows and
columns. One benefit of using tables is that they allow us to
demonstrate a number of patterns or differences between
groups, depending on what data are included in the table.
Almost any quantitative information can be organized into a
table.
Tables may take longer to read and understand than some
other visual comparisons, such as graphs. A table should be as
simple as possible. Because large complicated tables can be
overwhelming for the reader, for clarity, sometimes it is better to
create two or three small tables rather than one large table.
Although tables can be useful for presenting time trend data, sometimes other data presentation
options might be preferable.
Table Example
This table illustrates the leading causes of death among Idaho
residents below the age of 1 year in 2004. In the first column,
causes of death are presented. In the second column, the
number of deaths that occurred in each category is shown, and
in the third column, the frequency of those cause-specific deaths
is represented. In the bottom row of the table, total number of
deaths is reported, and you can see that the frequencies of the
cause-specific deaths add to 100%.
This table is simple and clear in its presentation. The fact that
the causes of death are listed in order of how frequently they
occurred makes it easy for the reader to identify the leading
causes of death in this age group and perhaps to begin thinking
about prevention strategies.
Line Graphs
A line graph is a useful data presentation tool for showing a long
series of data (such as disease trends over time). Line graphs are
also useful for comparing several different series of data in the
same graph. Line graphs display data in two dimensions. We call
the dimensions the x-axis and the y-axis.
By convention the dependent or y variable is on the vertical
axis and the x, or independent variable is on the horizontal axis.
When reading a line graph, you’ll notice that rises and falls in the
line show how one variable is affected by another. Let’s look at
an example.
Bar Graphs
Bar graphs are also used to compare data and show relationships
between two or more variables (or groups or items).
Each independent variable is discrete, such as race or gender
(which only has two categories: male and female). If you
wanted to display data comparing, for example, the prevalence
of smoking in people of different ages, using a bar graph, you
would group the age variable into categories (such as ages 15-19
or 20-24) for clarity of presentation.
Bar graphs are a quick and intuitive way to show big differ-
ences in data.
differences in the people who live in district 6 vs. district 4 for example, or whether differences in
prevalence results from random error or bias in our sample of respondents.
Again, in order to determine whether differences in prevalence between two or more districts
are statistically significant, we would have to see a p-value. Recall that the p-value is obtained from
performing a statistical test.
Pie Charts
Pie charts are frequently used to show how part of something
relates to the whole. Pie charts are useful for showing the
component parts of a single group or variable. The basic design
is a circle, the shape of a pie, and the components, or slices of
the pie, are usually percentages of the different categories of the
variable.
Pie charts are a way to effectively present percentages in
which the “slices” of the pie add up to 100%.
to unintentional injuries), and the information you want to convey is how parts relate to the whole
(like how the causes of deaths due to unintentional injuries relate to the total number of injuries), you
should consider using a pie chart to display your data.
Summary
In this course, we covered some basic concepts you will need to
understand and talk about public health data. These concepts
include:
• Measures of disease frequency, such as prevalence,
incidence, mortality, and case-fatality
• Biostatistical tools, such as the mean, median, mode,
confidence interval, and p-value
• Graphical forms of displaying data, such as tables, line
graphs, bar graphs and pie charts
Remember, knowing how to read understand, and interpret data specific to your community will
help you better understand your community’s health needs.
Resources
Here is a list of useful resources that provide further information
about this topic. You can also print out the list of these resources
by clicking the resources link in the attachments drop-down box
located at the top of the screen.
Online Resources
CDC WONDER, http://wonder.cdc.gov/. Provides a single
point of access to a wide variety of reports and numeric
public health data.
E is for EPI, North Carolina Center for Public Health
Preparedness. http://www.sph.unc.edu/nccphp/training/
training_list/t_e_epi.htm.
Principles of Epidemiology, CDC, Second Edition, 1992. http://www.phppo.cdc.
gov/phtn/catalog/pdf.
Books
Basic and Clinical Biostatistics, Beth Dawson and Robert G. Trapp. McGraw Hill, 2004.
A Cartoon Guide to Statistics, Larry Gonick and Woollcott Smith, Harper Collins, 1994.
Epidemiology, Leon Gordis, W.B. Saunders Company, 2000.
Epidemiologic Methods, Thomas Koepsell and Noel White, Oxford University Press, 2003.
Epidemiology for Public Health Practice. Robert H. Friis and Thomas A Sellers. Jones and Parlett
Publishers, 2004.
Intuitive Biostatistics, Harvey Motulsky, Oxford University Press, 1995.