HealthMap Global Infectious Disease Monitoring Through Automated Classification and Visualization of Internet Media Reports
HealthMap Global Infectious Disease Monitoring Through Automated Classification and Visualization of Internet Media Reports
Model Formulation 䡲
Focus on Media-based Biosurveillance
JAMIA
HealthMap: Global Infectious Disease Monitoring through
Automated Classification and Visualization of Internet
Media Reports
A b s t r a c t Objective: Unstructured electronic information sources, such as news reports, are proving to be
valuable inputs for public health surveillance. However, staying abreast of current disease outbreaks requires
scouring a continually growing number of disparate news sources and alert services, resulting in information
overload. Our objective is to address this challenge through the HealthMap.org Web application, an automated
system for querying, filtering, integrating and visualizing unstructured reports on disease outbreaks.
Design: This report describes the design principles, software architecture and implementation of HealthMap and
discusses key challenges and future plans.
Measurements: We describe the process by which HealthMap collects and integrates outbreak data from a variety of
sources, including news media (e.g., Google News), expert-curated accounts (e.g., ProMED Mail), and validated official
alerts. Through the use of text processing algorithms, the system classifies alerts by location and disease and then
overlays them on an interactive geographic map. We measure the accuracy of the classification algorithms based on the
level of human curation necessary to correct misclassifications, and examine geographic coverage.
Results: As part of the evaluation of the system, we analyzed 778 reports with HealthMap, representing 87 disease
categories and 89 countries. The automated classifier performed with 84% accuracy, demonstrating significant
usefulness in managing the large volume of information processed by the system. Accuracy for ProMED alerts is 91%
compared to Google News reports at 81%, as ProMED messages follow a more regular structure.
Conclusion: HealthMap is a useful free and open resource employing text-processing algorithms to identify
important disease outbreak information through a user-friendly interface.
䡲 J Am Med Inform Assoc. 2008;15:150 –157. DOI 10.1197/jamia.M2544.
information, enabling the user to quickly and easily see those reports, provide flexible and useful visualization output,
elements pertinent to her area of interest. and be responsive under heavy usage load.
Classification
Background The system is only useful to the extent that it can correctly
HealthMap is part of a new generation of health surveillance identify the primary locations, diseases and other outbreak-
systems that help supplement existing public health systems related factors of a large percentage of alerts, based on
by focusing on event-based monitoring of infectious dis- words, phrases and other available contextual information
eases by leveraging Internet news and other electronic for each report.
media. One of the earliest systems to harness some of these
resources is the Global Public Health Intelligence Network In addition to the “correctness” of classification, the system
(GPHIN).9,10 GPHIN has shown that extensive monitoring must also take end-user objectives into account. For exam-
and analysis of news media around the world can effectively ple, if a single alert contains references to fifty different
aid in early detection of emerging disease threats. Most places, the strictly correct classification would generate
notably, GPHIN was able to identify the 2002–2003 outbreak markers in all fifty locations. However, this alert, likely a
of Severe Acute Respiratory Syndrome (SARS) well in advance summary of known ongoing activity, would then overload
of official reporting.10,11 On an ongoing basis, GPHIN also the map view with less important information and provide
provides a large fraction of initial outbreak reports directly to little benefit to the user. Another condition where optimum
the WHO for investigation.1,2 Another successful online dis- classification is difficult is in the case of multiple country
ease alerting service is the ProMED Mail email announcement involvement in a single outbreak. For instance, Switzerland
list, with 38,000 subscribers and a panel of expert modera- may send disease specialists to help combat a dengue fever
tors.12–14 Other systems include MedISys,15 Argus,16 and outbreak in Paraguay. In this case the primary locations of
EpiSPIDER,17 all of which also leverage informal electronic the alert are Switzerland and Paraguay, but if the system
datasets for disease outbreak information. presents alert classifications in such a way as to imply that
an outbreak of dengue fever is occurring in Switzerland, the
While projects such as GPHIN and ProMED serve public user will be justifiably confused. The classifier must thus be
health authorities, infectious disease Web sites that serve the designed to integrate its output with the user display.
general public are also gaining in popularity and helping to
increase awareness of public health issues, especially for Visualization
international travelers. One such site, FluWikie.com, which With respect to visualization, a key objective of the system is
reports on avian influenza and other topics relating to pan- to maximize flexibility in two key areas: in the user interface
demic influenza, is heavily trafficked and was cited along with and in the collection of the underlying data. Specifically,
similar sites by the CDC as “critical to CDC’s ability to prepare HealthMap is designed to organize data across different
for and respond to an influenza pandemic.”18 dimensions (such as date, location and disease) and allow
users to customize the view according to the geographic
In addition to existing online public health resources, recent
location, disease, and type of outbreak. However, the system
years have seen the rise of “Web 2.0” technologies19 includ-
must balance flexibility with simplicity; in certain cases, it
ing the proliferation of Really Simple Syndication (RSS)20
should impose assumptions in organizing the data, so as not
and Asynchronous JavaScript and XML (AJAX).21,22 These
to overwhelm the user with customization controls. In
tools create new opportunities for interactive software such
general, the visualization interface should be intuitive and
as HealthMap. On the backend, RSS is a first step towards
easy to use for the novice user—who may be a novice with
the goal of a “semantic Web,”23 allowing for greater possi-
respect to both software interfaces and infectious disease
bilities in extracting structure algorithmically from a variety
epidemiology—as well as allow the advanced user sophis-
of disparate data sources. On the frontend, the Google Maps
ticated and flexible customization of the display.
public API allows the Web developer to create mapping
applications using a powerful and well-known user interface. Behind the user interface, as the system collects reports, the
Finally, rich JavaScript and asynchronous HTTP requests, goal is to allow the underlying data to shape the view as much
the AJAX building blocks, enable us to create responsive, as possible. Avian influenza, for example, is currently a topic of
highly customizable Web user interfaces that begin to ap- significant public health concern and extensive media cover-
proach the desktop software experience. age. However, the system should not place a priori emphasis on
any given disease; instead it should adapt its mode of display
The power of HealthMap as a disease surveillance tool lies
to infectious disease threats as they emerge. The next global
in its potential to bring together automated processing of a
threat may come from an unexpected source, or the focus of
broad range of Internet data sources and rich, accessible
public health and media attention may shift.
visualization tools for lay and public health users alike. In
this report, we describe the software architecture and imple- Accordingly, while HealthMap focuses primarily on human
mentation, as well as challenges and future plans. disease surveillance, one of our design objectives is compre-
hensive coverage of disease activity, encompassing animal
Formulation Process and plant diseases, as well as some insect pests and other
The principal objective of HealthMap is to provide access to invasive species. This disease coverage is important as many
the greatest amount of potentially useful health information infectious diseases of public health concern are zoonotic,
across the widest range of geography and pathogens, with- naturally circulating among wildlife reservoir hosts before
out overwhelming the user with excess information or emerging in the human population.24 –26
obscuring important and urgent elements. To accomplish Along the same lines, the system should, where possible,
this goal, the system must be able to correctly classify avoid biases towards specific geographic areas. The next
152 FREIFELD ET AL., Internet-based Surveillance of Infectious Diseases
Classification Engine
The classification engine determines the primary locations
and diseases associated with each alert. It is comprised of
two modules: the Preparation Module, which takes the raw
input from the source, segments it and prepares it for input
to the parser, and the Parser Module, which takes text input
and produces disease and location codes as output.
F i g u r e 3. User Interface.
decay more rapidly as we progress into the past.) Locations country for the currently selected parameters (Figure 3c).
that have a greater number of feeds and diseases associated Clicking on a country name zooms the map view to that
with them are also given increased weighting. Our qualita- country for easy viewing of alerts in that location. The
tive justification for this boost is that if multiple sources have “Latest alerts” box displays the most recent alerts in reverse
corroborated an outbreak it deserves more emphasis, and if chronological order (Figure 3d). An icon next to each head-
the same source is reporting the same disease, it deserves line indicates the alert source.
less emphasis. Moving across to the map display window, the date slider at
After the computation is complete, the system normalizes the bottom allows the user to control the date range of
the heat scores across the set of markers and assigns each displayed alerts (Figure 3e). The end date is fixed as the
marker an integer value from 0 to 10. Because it computes current date, but the user can set the start date to any point
the Heat scores for the currently requested marker set, the in the previous thirty days. “Full Screen” mode expands the
user can, for example, choose a particular disease category map to cover the full browser window, allowing for richer
and quickly see where the hotspots are for that disease, in display and navigation (Figure 3f). It also allows for “situa-
addition to the default view indicating general levels of tion room” use, allowing the user to display the map on a
outbreak activity. non-interactive screen and monitor ongoing alert activity.
User Interface On the map itself, the color of a marker indicates the Heat
Figure 3 shows the HealthMap main page, featuring a Index value for the location, with the deeper red color
variety of information boxes and user controls. The “Avail- indicating more intense recent activity as contrasted with the
able feeds” box (Figure 3a) allows the user to select which paler yellow color.
sources to display on the map by means of the checkboxes
along the left-hand side. Below the feeds menu, the “Dis- Validation
eases, last 30 days” box serves both to display the currently Example Report Illustrating Classifier Operation
active diseases as well as to allow the user to select which To illustrate the functioning of the system, we examine a
diseases to display (Figure 3b). The “i” button brings up a sample report and how it is processed by the HealthMap
menu with links to further information about the particular classification engine. A local newspaper report concerning
disease from the Wikipedia, WHO, CDC, PubMed, and an outbreak of shigellosis at a school in Wisconsin enters the
Google Trends Web sites. In the next section, the “Alerts system via the Google News aggregator. The system begins
by country” box indicates the number of alerts active in each by examining the article headline:
Journal of the American Medical Informatics Association Volume 15 Number 2 Mar / Apr 2008 155
Elementary School Deals with Outbreak of Bacteria Table 1 y Location and Disease Classifier
As there are no known patterns found for either location or Performance over the One Month Period from 10
disease, the classifier then progresses to the article “descrip- October 2007 to 9 November 2007
tion,” an extract provided by Google News: Source Total Edited Location Disease Accuracy
Elementary School Deals with Outbreak of Bacteria 58 minutes All 778 123 (16%) 87 (11%) 47 (6%) 84%
Smith A bacterial outbreak at a Fond du Lac school is prompting ProMED only 207 19 (9%) 14 (7%) 5 (2%) 91%
the district to alert parents and do some extra cleaning in hopes of Google News only 547 104 (19%) 73 (13%) 42 (8%) 81%
stopping the . . .
While there is an indication of the location provided in this
extract, “Fond du Lac” is currently not included in the binary classification metrics such as precision and recall to
dictionary, and therefore not recognized. Still lacking both measure its performance. However, because we curate all
location and disease information, the classifier examines the reports on a daily basis to correct misclassifications, we can
article body text, as prepared by the parsing engine from the examine various aspects of performance based on the
original HTML: changes performed.
WEB SEARCH BY A bacterial outbreak at a Fond du Lac school is At the most basic level, the accuracy of the classifier can be
prompting the district to alert parents and do some extra cleaning measured by the percentage of reports entering the system
in hopes of stopping the bacteria from spreading. State health
that need not have their disease or location classifications
officials say there were 14 confirmed cases of shigellosis, a
bacterial infection, in Fond du Lac County in the past three months.
corrected in any way. At a more detailed level, we can
Five confirmed cases prompted Roberts Elementary School in Fond examine the number of alerts requiring a correction of
du Lac to notify parents. ”We want to get the information out to disease classification as compared with the number requir-
parents: Here it is and here are steps you can take,” Marian ing a location correction. Table 1 provides a full breakdown
Sheridan, the Fond du Lac school health and safety coordinator of the classifier performance both by source and by disease
said. The concern is that this infection is fast-spreading. Although and location. As shown, the overall accuracy of the system is
the Wisconsin health department says 300 to 400 cases are 84%, thus correctly classifying 655 out of 778 reports over
reported each year, the uncomfortable abdominal cramps, fever, and the one-month period from October 10 to November 9. As
diarrhea are symptoms no one wants running rampant through one might expect, performance on ProMED alerts, at 91%, is
schools. ”I think we re getting the message out early enough, and
substantially better than on Google News reports (81%), as
I think that s one of the benefits of working with school districts
staff to get the word out so we can contain it before it s widespread,”
ProMED messages represent data curated specifically for
Joyce Mann of the Department of Health and Family Services said. disease outbreak reporting and follow a more regular struc-
”Parents are used to the school sending them health notices, and it ture.
s never to alarm but it s rather to inform,” Sheridan said. There are, however, important limitations to this perfor-
”Normally what we do is go in with a ten-percent bleach solution mance analysis. In particular, in some cases, the correction of
and everything gets wiped down—telephones, door knobs, desk the classification serves merely to shift between related
chairs, desktops, the bathrooms are thoroughly gone through,”
categories, such as reclassifying Gastroenteritis as Norovi-
building and
rus, or UK as England. In other cases, the correction is more
As indicated in bold, the classifier now matches three drastic, such as correcting Influenza to Equine Influenza, or
different patterns in the text. The first identifies the disease Washington, DC to Washington State. Clearly the change is
category as Shigellosis; the second places the report in more significant in the latter cases, but we don’t capture this
Wisconsin. The third match corresponds to the Diarrhea distinction in the current analysis. As it is difficult to capture
disease category, but based on the container relationships rigorously, for the moment we take the most conservative
described above, the system correctly identifies Diarrhea as view in computing accuracy. As part of our ongoing re-
redundant with Shigellosis, and eliminates the former. At search, we are developing more fine-grained metrics.
this point, the classifier has completed its work, and pro-
ceeds to the next report. Had it not identified both disease Discussion
and location from the body text, it would have further As HealthMap is still in the early stages of development, a
examined the name of the publication as provided by number of important enhancements are either currently
Google News: under development or in the planning stage. The primary
WBAY, WI design goal of HealthMap is to provide broad coverage of
ongoing outbreaks without overwhelming the user. In the
Upon processing of this text, it would also have identified pursuit of improved coverage, we are exploring the use of
the location based on the abbreviation WI, which is listed in other sources, including additional news aggregators—such
the dictionary as a synonym of Wisconsin. However, in this as Yahoo news, Factiva, and LexisNexis— blogs, and veter-
particular case, the publication information is ignored as the inary news sources such as the World Organization for
classifier has already achieved matches using other compo- Animal Health (OIE). In pursuit of improved filtering, we
nents of the report. are developing natural language processing techniques for
Classifier Performance additional automated data categorization, such as clustering
Because the classification engine places alerts into many similar reports, identifying specific outbreak pertinence,
hundreds of different location and disease categories (cur- distinguishing discrete outbreaks from endemic activity,
rently over 700 total), as well as combinations of multiple and identifying reports indicating the absence of disease or
categories of each type, it is not possible to apply traditional the end of a previously identified outbreak.
156 FREIFELD ET AL., Internet-based Surveillance of Infectious Diseases
As part of our own evaluation, as mentioned in the Formu- community collaboration as a mechanism for alert acquisi-
lation Process, an important goal of the system is to cover as tion and classification.
broad a range of geography and disease as possible, without As we expand functionality, performance will naturally
bias toward particular regions or pathogens. While the become an increasing concern. We have a few optimizations
internal architecture of the system itself largely meets these in progress, such as moving to memory-based caching, more
goals (particularly as we add more geographical subdivi- intelligent, “lazy” loading of the pattern dictionary, and
sions around the world), the alert data we process and better optimized database queries. We are also exploring
display leaves much to be improved. Because we currently ways to better employ client-side caching without overload-
rely heavily on the US edition of Google News for reports, ing the browser.
the system is biased toward the US and Canada as well as
other English-speaking countries around the world, as On the frontend, we have plans to improve the user expe-
shown in Figure 4. To address this problem, we have rience with added features and improved customization.
developed a Spanish-language version of the system and are Examples include keyword searching, RSS output, saved
currently expanding to other languages and data sources as preferences, endemic background disease rates, notification
resources permit. However, given the uneven distribution of messaging via email, and temporal visualization. (Notably,
media and reporting resources around the world, we will the EpiSPIDER system has already taken steps in this area,
continue to face this issue for the foreseeable future. incorporating a timeline view of ProMED reports.17) We also
plan to conduct a usability observation study, to gather
In addition to adding new capability, we are also working feedback from our target demographic on priority features
on improving the accuracy of the existing classifier, both by as well as how best to improve the HealthMap user inter-
expanding the pattern dictionary and by improving the face.
preparation module. We will add more locations and dis-
eases, including administrative divisions for countries such Along with user-level evaluation, we are also working to
as Indonesia, Brazil, Sudan and Mexico as well as major develop more rigorous evaluation metrics for the integrated
cities worldwide. For diseases, we will be adding more system, including its ability to cover a broad range of
disease categories and refining our disease taxonomy, as geography and pathogens, limit noise, detect outbreaks
well as tagging diseases with category metadata to allow for early, and accurately characterize alerts in each dimension of
improved searching. We will also explore more advanced classification.
techniques such as fuzzy matching and Bayesian machine HealthMap is part of a new generation of disease surveil-
learning for improving the resolution and accuracy of our lance systems that process unstructured and unclassified
automated classification algorithms, as well as categorizing data sources. Comprehensive evaluation of these types of
alerts by relevancy, clustering similar alerts, and extracting systems and data sources is also an important area and part
other useful attributes.27–29 On the human side, taking of our ongoing and future research and collaboration with
inspiration from the highly successful Wikipedia model,30 other disease tracking systems such as GPHIN, EpiSPIDER,
we plan to work with networks of experts to evaluate MedISys, and Argus, would enable an in-depth comparison.
Journal of the American Medical Informatics Association Volume 15 Number 2 Mar / Apr 2008 157
With that said, there are a few broad comparisons we can draw chine Translation in the Americas 2006. Available at:
between systems. One key area is accessibility: HealthMap is www.mt-archive.info/MTS-2005-Mawudeku.pdf. Accessed Apr
freely available to the public, whereas some systems are 26, 2007.
currently closed systems, requiring either paid subscription 11. Eysenbach G. SARS and population health technology. J Med
or approved access. Another key area is in the use of Internet Res. 2003 Apr-Jun;5(2):e14.
12. Morse SS, Rosenberg BH, Woodall J. ProMED global monitoring
automation. While we certainly perform manual curation in
of emerging diseases: design for a demonstration program.
maintaining HealthMap, our goal is to maximize automa-
Health Policy. 1996 Dec;38(3):135–53.
tion in order to leverage the human contribution. The value 13. Madoff LC. ProMED-mail: an early warning system for emerg-
of a full-time staff of language and domain experts to read ing diseases. Clin Infect Dis. 2004 Jul 15;39(2):227–32.
and analyze reports around the clock should also be ad- 14. Madoff LC, Woodall JP. The internet and the global monitoring
dressed as part of a broader research initiative.6,7 of emerging diseases: lessons from the first 10 years of ProMED-
mail. Arch Med Res. 2005 Nov-Dec;36(6):724 –30.
Conclusion 15. Health Threats Unit at Directorate General Health and Con-
The promise of HealthMap lies in its ability to extract useful, sumer Affairs of the European Commission. MedISys (Medical
customizable messaging and views from a mass of unstruc- Intelligence System). Available at: http://medusa.jrc.it/. Ac-
tured data. While the site has already generated significant cessed Apr 4, 2007.
16. Wilson J. Argus: Use of Indications and Warnings for Global
interest as a publicly available surveillance tool, many
Tactical Detection and Tracking of Biological Events.” George-
improvements remain to be made for it to be a truly useful
town Hosts 3rd Annual Conference on Infectious Disease; 2007;
resource for both public health professionals and the general Washington, DC; 2007.
public. In particular, adding more languages and expanding 17. Tolentino H. Scanning the Emerging Infectious Diseases Hori-
our usage of general data sources such as newspapers and zon—Visualizing ProMED Emails Using EpiSPIDER. Interna-
blogs will increase coverage and further demonstrate the tional Society for Disease Surveillance Annual Conference; 2006;
value of the visualization and filtering features. Moreover, Baltimore, MD; 2006.
only as time progresses, as more people use the system, and 18. Bernhardt JM. Centers for Disease Control and Prevention:
further significant outbreaks unfold in the global disease Director’s Blog. Health Marketing Musings 2006. Available at:
ecosystem, will we know the true potential of the software, http://www.cdc.gov/healthmarketing/blog_101106.htm. Ac-
and how best to improve it. cessed Apr 17, 2007.
19. O’Reilly T. What Is Web 2.0: Design Patterns and Business
References y Models for the Next Generation of Software 2007. Available at:
http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/
1. Grein TW, Kamara KB, Rodier G, Plant AJ, Bovier P, Ryan MJ,
09/30/what-is-web-20.html. Accessed Apr 17, 2007.
et al. Rumors of disease in the global village: outbreak verifica-
20. Cayzer S. Semantic blogging and decentralized knowledge
tion. Emerg Infect Dis. 2000 Mar-Apr;6(2):97–102.
management. Communications of the ACM. 2004;47(12):47–52.
2. Heymann DL, Rodier GR. Hot spots in a wired world: WHO
21. Garrett JJ. Ajax: A New Approach to Web Applications. 2005.
surveillance of emerging and re-emerging infectious diseases.
Lancet Infect Dis. 2001 Dec;1(5):345–53. Available at: http://www.adaptivepath.com/publications/
3. Hiltz SR, Murray T. Structuring computer-mediated communi- essays/archives/000385.php. Accessed May 10, 2007.
cation systems to avoid information overload. Communications 22. Paulson LD. Building Rich Web Applications with Ajax. Com-
of the ACM. 1985;28(7):680 –9. puter. 2005;38(10):14 –7.
4. Berghel H. Cyberspace 2000: Dealing with information overload 23. Berners-Lee T, Hendelr J, Lassila O. The semantic Web. Scien-
Communications of the ACM. 1997;40(2):19 –24. tific American. 2001;284(5):28 –37.
5. Brownstein JS, Freifeld CC, Reis BY, Mandl KD. HealthMap: 24. Gratz NG. Emerging and resurging vector-borne diseases.
Internet-based emerging infectious disease intelligence. In: In- Annu Rev Entomol. 1999;44:51–75.
stitute of Medicine, editor. Infectious Disease Surveillance and 25. Dobson A, Foufopoulos J. Emerging infectious pathogens of
Detection: Assessing the Challenges—Finding Solutions. Wash- wildlife. Philos Trans R Soc Lond B Biol Sci. 2001 Jul 29;
ington, DC.; 2007. 183–204. 356(1411):1001–12.
6. Holden C. Netwatch: Diseases on the move. Science. 2006 26. Brownstein JS, Holford TR, Fish D. Enhancing National West
December 1st.;314(5804):1363d. Nile Virus Surveillance. Emerg Infect Dis. 2004; In press.
7. Captain S. Get your daily plague forecast. Wired News. Avail- 27. Zheng W, Milios E, Watters C. Filtering for medical news items
able at: http://www.wired.com/science/discoveries/news/ using a machine learning approach. Proc AMIA Symp. 2002:
2006/10/71961. Accessed Apr 4, 2007. 949 –53.
8. Larkin M. Technology and public health: Healthmap tracks 28. Chen H. Machine learning for information retrieval: Neural
global diseases. Lancet Infect Dis. 2007 February;7:91. networks, symbolic learning, and genetic algorithms. J Am Soc
9. Mykhalovskiy E, Weir L. The Global Public Health Intelligence Inform Sci 1999;46(3):194 –216.
Network and early warning outbreak detection: a Canadian 29. Ribeiro-Neto B, Laender AHF, deLima LRS. An Experimental
contribution to global public health. Can J Public Health. 2006 Study in Automatically Categorizing Medical Documents. J Am
Jan-Feb;97(1):42– 4. Soc Inform Sci 2001;52(5):391– 401.
10. Mawudeku A, Blench M. Global Public Health Intelligence 30. Giles J. Internet encyclopaedias go head to head. Nature. 2005
Network (GPHIN). 7th Conference of the Association for Ma- Dec 15;438(7070):900 –1.