0% found this document useful (0 votes)
79 views8 pages

HealthMap Global Infectious Disease Monitoring Through Automated Classification and Visualization of Internet Media Reports

HealthMap is a system that monitors online media sources for reports of infectious disease outbreaks. It collects data from news reports, expert discussions, and official alerts. Using text processing algorithms, HealthMap automatically classifies reports by location and disease. It displays outbreak information on an interactive map. The system processes about 30 disease reports per day. An evaluation of 778 reports found the automated classifier to have 84% accuracy in identifying locations and diseases. HealthMap aims to organize the large volume of online outbreak information and present it in a user-friendly interface to help public health experts monitor global disease threats.

Uploaded by

olgranados
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views8 pages

HealthMap Global Infectious Disease Monitoring Through Automated Classification and Visualization of Internet Media Reports

HealthMap is a system that monitors online media sources for reports of infectious disease outbreaks. It collects data from news reports, expert discussions, and official alerts. Using text processing algorithms, HealthMap automatically classifies reports by location and disease. It displays outbreak information on an interactive map. The system processes about 30 disease reports per day. An evaluation of 778 reports found the automated classifier to have 84% accuracy in identifying locations and diseases. HealthMap aims to organize the large volume of online outbreak information and present it in a user-friendly interface to help public health experts monitor global disease threats.

Uploaded by

olgranados
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

150 FREIFELD ET AL.

, Internet-based Surveillance of Infectious Diseases

Model Formulation 䡲
Focus on Media-based Biosurveillance
JAMIA
HealthMap: Global Infectious Disease Monitoring through
Automated Classification and Visualization of Internet
Media Reports

CLARK C. FREIFELD, KENNETH D. MANDL, BEN Y. REIS, JOHN S. BROWNSTEIN

A b s t r a c t Objective: Unstructured electronic information sources, such as news reports, are proving to be
valuable inputs for public health surveillance. However, staying abreast of current disease outbreaks requires
scouring a continually growing number of disparate news sources and alert services, resulting in information
overload. Our objective is to address this challenge through the HealthMap.org Web application, an automated
system for querying, filtering, integrating and visualizing unstructured reports on disease outbreaks.
Design: This report describes the design principles, software architecture and implementation of HealthMap and
discusses key challenges and future plans.
Measurements: We describe the process by which HealthMap collects and integrates outbreak data from a variety of
sources, including news media (e.g., Google News), expert-curated accounts (e.g., ProMED Mail), and validated official
alerts. Through the use of text processing algorithms, the system classifies alerts by location and disease and then
overlays them on an interactive geographic map. We measure the accuracy of the classification algorithms based on the
level of human curation necessary to correct misclassifications, and examine geographic coverage.
Results: As part of the evaluation of the system, we analyzed 778 reports with HealthMap, representing 87 disease
categories and 89 countries. The automated classifier performed with 84% accuracy, demonstrating significant
usefulness in managing the large volume of information processed by the system. Accuracy for ProMED alerts is 91%
compared to Google News reports at 81%, as ProMED messages follow a more regular structure.
Conclusion: HealthMap is a useful free and open resource employing text-processing algorithms to identify
important disease outbreak information through a user-friendly interface.
䡲 J Am Med Inform Assoc. 2008;15:150 –157. DOI 10.1197/jamia.M2544.

Introduction sources.1,2 However, electronic sources of infectious disease


Internet-based resources, such as online newspapers, blogs, news are not well organized or integrated. Reading and
and discussion forums, have increased in number, volume, assimilating a broad range and large number of reports as they
and coverage, and show potential as useful data sources for appear on a daily basis has already become increasingly
disease surveillance and early outbreak detection— currently, burdensome.3,4
nearly all major outbreaks investigated by the World Health The HealthMap project has begun to address this challenge
Organization are first identified through these informal online through automated querying, filtering, integration, and visual-
ization of Web-based reports on infectious disease outbreaks,
Affiliations of the authors: Children’s Hospital Informatics Program to facilitate knowledge management and early detection.5,8 A
at the Harvard-MIT Division of Health Sciences and Technology freely available Web site operating since September 2006,
(CCF, KDM, BYR, JSB), Boston, MA; Division of Emergency Medi- HealthMap.org integrates data from a variety of electronic
cine, Children’s Hospital Boston (KDM, BYR, JSB), Boston, MA; sources, including news through the Google News aggregator,
Department of Pediatrics, Harvard Medical School (KDM, BYR,
expert-curated accounts such as ProMED Mail, and validated
JSB), Boston, MA.
official alerts such as World Health Organization announce-
This work was supported by R21LM009263-01, 1 R01 LM007677,
ments. Through the use of automated text processing algo-
and N01-LM-3-3515 from the National Library of Medicine, Na-
tional Institutes of Health, and the Canadian Institutes of Health rithms, the system classifies alerts by location and disease and
Research. then overlays them on an interactive geographic map. It
Correspondence: Clark C. Freifeld, Children’s Hospital Informatics currently processes an average of 30 disease alerts per day;
Program at the Harvard-MIT Division of Health Sciences and with the default 30-day time window, the system typically
Technology, 300 Longwood Ave., Boston, MA 02115; e-mail: displays approximately 1,000 alerts at any particular time. The
[email protected]⬎. filtering and visualization features of HealthMap thus serve to
Received for review: 06/29/07; accepted for publication: 11/29/07 bring structure to an otherwise overwhelming amount of
Journal of the American Medical Informatics Association Volume 15 Number 2 Mar / Apr 2008 151

information, enabling the user to quickly and easily see those reports, provide flexible and useful visualization output,
elements pertinent to her area of interest. and be responsive under heavy usage load.
Classification
Background The system is only useful to the extent that it can correctly
HealthMap is part of a new generation of health surveillance identify the primary locations, diseases and other outbreak-
systems that help supplement existing public health systems related factors of a large percentage of alerts, based on
by focusing on event-based monitoring of infectious dis- words, phrases and other available contextual information
eases by leveraging Internet news and other electronic for each report.
media. One of the earliest systems to harness some of these
resources is the Global Public Health Intelligence Network In addition to the “correctness” of classification, the system
(GPHIN).9,10 GPHIN has shown that extensive monitoring must also take end-user objectives into account. For exam-
and analysis of news media around the world can effectively ple, if a single alert contains references to fifty different
aid in early detection of emerging disease threats. Most places, the strictly correct classification would generate
notably, GPHIN was able to identify the 2002–2003 outbreak markers in all fifty locations. However, this alert, likely a
of Severe Acute Respiratory Syndrome (SARS) well in advance summary of known ongoing activity, would then overload
of official reporting.10,11 On an ongoing basis, GPHIN also the map view with less important information and provide
provides a large fraction of initial outbreak reports directly to little benefit to the user. Another condition where optimum
the WHO for investigation.1,2 Another successful online dis- classification is difficult is in the case of multiple country
ease alerting service is the ProMED Mail email announcement involvement in a single outbreak. For instance, Switzerland
list, with 38,000 subscribers and a panel of expert modera- may send disease specialists to help combat a dengue fever
tors.12–14 Other systems include MedISys,15 Argus,16 and outbreak in Paraguay. In this case the primary locations of
EpiSPIDER,17 all of which also leverage informal electronic the alert are Switzerland and Paraguay, but if the system
datasets for disease outbreak information. presents alert classifications in such a way as to imply that
an outbreak of dengue fever is occurring in Switzerland, the
While projects such as GPHIN and ProMED serve public user will be justifiably confused. The classifier must thus be
health authorities, infectious disease Web sites that serve the designed to integrate its output with the user display.
general public are also gaining in popularity and helping to
increase awareness of public health issues, especially for Visualization
international travelers. One such site, FluWikie.com, which With respect to visualization, a key objective of the system is
reports on avian influenza and other topics relating to pan- to maximize flexibility in two key areas: in the user interface
demic influenza, is heavily trafficked and was cited along with and in the collection of the underlying data. Specifically,
similar sites by the CDC as “critical to CDC’s ability to prepare HealthMap is designed to organize data across different
for and respond to an influenza pandemic.”18 dimensions (such as date, location and disease) and allow
users to customize the view according to the geographic
In addition to existing online public health resources, recent
location, disease, and type of outbreak. However, the system
years have seen the rise of “Web 2.0” technologies19 includ-
must balance flexibility with simplicity; in certain cases, it
ing the proliferation of Really Simple Syndication (RSS)20
should impose assumptions in organizing the data, so as not
and Asynchronous JavaScript and XML (AJAX).21,22 These
to overwhelm the user with customization controls. In
tools create new opportunities for interactive software such
general, the visualization interface should be intuitive and
as HealthMap. On the backend, RSS is a first step towards
easy to use for the novice user—who may be a novice with
the goal of a “semantic Web,”23 allowing for greater possi-
respect to both software interfaces and infectious disease
bilities in extracting structure algorithmically from a variety
epidemiology—as well as allow the advanced user sophis-
of disparate data sources. On the frontend, the Google Maps
ticated and flexible customization of the display.
public API allows the Web developer to create mapping
applications using a powerful and well-known user interface. Behind the user interface, as the system collects reports, the
Finally, rich JavaScript and asynchronous HTTP requests, goal is to allow the underlying data to shape the view as much
the AJAX building blocks, enable us to create responsive, as possible. Avian influenza, for example, is currently a topic of
highly customizable Web user interfaces that begin to ap- significant public health concern and extensive media cover-
proach the desktop software experience. age. However, the system should not place a priori emphasis on
any given disease; instead it should adapt its mode of display
The power of HealthMap as a disease surveillance tool lies
to infectious disease threats as they emerge. The next global
in its potential to bring together automated processing of a
threat may come from an unexpected source, or the focus of
broad range of Internet data sources and rich, accessible
public health and media attention may shift.
visualization tools for lay and public health users alike. In
this report, we describe the software architecture and imple- Accordingly, while HealthMap focuses primarily on human
mentation, as well as challenges and future plans. disease surveillance, one of our design objectives is compre-
hensive coverage of disease activity, encompassing animal
Formulation Process and plant diseases, as well as some insect pests and other
The principal objective of HealthMap is to provide access to invasive species. This disease coverage is important as many
the greatest amount of potentially useful health information infectious diseases of public health concern are zoonotic,
across the widest range of geography and pathogens, with- naturally circulating among wildlife reservoir hosts before
out overwhelming the user with excess information or emerging in the human population.24 –26
obscuring important and urgent elements. To accomplish Along the same lines, the system should, where possible,
this goal, the system must be able to correctly classify avoid biases towards specific geographic areas. The next
152 FREIFELD ET AL., Internet-based Surveillance of Infectious Diseases

Classification Engine
The classification engine determines the primary locations
and diseases associated with each alert. It is comprised of
two modules: the Preparation Module, which takes the raw
input from the source, segments it and prepares it for input
to the parser, and the Parser Module, which takes text input
and produces disease and location codes as output.

Preparation Module: Tiered Approach


While many alerts contain references to multiple locations or
multiple diseases, the aim of the classifier is to identify the
primary locations and diseases for each alert. To this end, the
input is processed in stages: if the classifier is unable to
identify location and disease from the initial input provided
by the feed, namely the modified headline, it can request
additional text from the feed. For example, in the case of the
Google News aggregator, the system examines the headline,
F i g u r e 1. HealthMap System Architecture. then the description, which generally consists of the first one
or two sentences of the article, followed by the article’s body
text, and finally, the name of the online news source.
noteworthy outbreak may as easily come from a major Frequently, a publication originating in one area will refer to
urban center in North America as a rural village in Africa. events occurring in another area, making the publication
name and location an unreliable source for the location of
Performance and Scalability
the alert. However, articles that don’t refer to a well-known
As the system scales to include more sources and more
location, such as “Suburban school closed after flu out-
dimensions of classification, it must be capable of rapidly
break,” generally refer to a location near the publication
processing a large number of reports. And as the user
headquarters. By processing the input in stages, the classifier
interface is enhanced to provide more sophisticated data
avoids the incorrect classification of the first case while
visualization and customization, it must be able to accom-
capturing the true location in the second case.
modate a large number of simultaneous users and still be
responsive. This scalability is critical, as the Web site could The extraction component of the preparation module pro-
receive a burst of traffic in the event of a broadly publicized cesses the full HTML body of the article itself. Clearly, the
disease outbreak. article text contains the best indicators as to the locations and
diseases of the event in question. However, blindly feeding
Model Description the full article into the parser, while increasing sensitivity,
The HealthMap system consists of five modules: the Data would also significantly increase the false positive rate,
Acquisition Engine, Classification Engine, Database, Web especially due to JavaScript code, CSS and hyperlinks mixed
Backend, and Web Frontend. As illustrated in Figure 1, the with the body text, any of which may contain text elements
system gathers alerts, classifies them by location and dis- that would trigger an incorrect match. The extractor must
ease, stores them in a database, and then displays them to also contend with the wide variety of HTML formats of
the user. different news sources, including potentially malformed
HTML code. By means of a collection of regular expressions
Data Acquisition Engine and cautious assumptions about the input, the system
As the system loads raw data from the Web, it converts each confronts some of these challenges.
disease outbreak report into a standard “alert” format,
containing four fields: headline, date, description, and info text. Parser Module
The headline is the alert headline, date is the date of issue of The Parser Module uses a word-level N-gram approach to
the alert, and the description is a brief summary of the alert, match input against a dictionary of known patterns (an
generally the first few sentences of the article. The info text is N-gram, as applied in the HealthMap software, is an N-
the text that will be fed into the parsing engine for the initial word text extract, generally 1 to 10 words in length). After
classification pass. In general, this initial text consists of the the initial data acquisition, the parser receives the input text,
alert headline, stripped of elements that may trigger a false strips it of non-alphanumeric characters and splits it into
positive. For example, with Google News, the system re- word tokens. It then converts all capital letters to lowercase,
moves the name of the originating publication from the except for those tokens that are two characters or fewer in
headline. length. The parser then compares the input to its dictionary
The standardization of the alerts, when not already available of place and disease patterns, mapping text patterns to the
from the RSS structure, is accomplished through the use of database IDs of all locations and diseases known to the
basic assumptions about the HTML and text formatting of system. As part of the ongoing development of HealthMap,
the input for each feed. The drawback to making these the dictionary is updated daily to improve the accuracy of
assumptions is that the data source may change its format the system; at the time of this writing it consists of over 2,300
without warning, creating unexpected results in the data location and 1,100 disease patterns.
acquisition and requiring rapid adaptation of the system, Because the dictionary patterns are stored in memory as a
though this has not yet proven to be a problem. tree, where each node is a hash table that maps single tokens
Journal of the American Medical Informatics Association Volume 15 Number 2 Mar / Apr 2008 153

as “UK (England),” or “Boston, MA.” In these cases, each


input contains two distinct patterns that are coded as
separate locations in the dictionary. However, Boston is
contained by Massachusetts and England is contained by the
United Kingdom. In order to correctly process this type of
input, after it has identified a list of locations, the classifica-
tion engine executes a secondary step, eliminating appar-
ently redundant locations based on container relationships.
In the given example, the system will initially identify both
Boston and Massachusetts as locations for the alert, and then
eliminate Massachusetts, as it is considered to be redundant
F i g u r e 2. Lookup Tree. with Boston.
We also apply container relationships to disease matching,
as the input can contain analogous cases. For instance, avian
to either subnodes or IDs (leaves), the system can look up influenza is a type of influenza and Norwalk-like viruses
each input token in constant time (see Figure 2). Thus, the cause gastroenteritis—if the system identifies both Norwalk-
classification time is linear on the number of input tokens, like virus and gastroenteritis in an alert, it thus eliminates
i.e., the length of the input. In the case where a word may gastroenteritis as a redundant, less specific disease category.
have multiple spellings, for example the American “diar- One key difference in the case of disease taxonomy is that
rhea” and the British “diarrhoea,” we simply stock the unlike a location, a disease can “belong” to more than one
dictionary with multiple patterns. With the addition of container disease: E. coli is more specific than food poison-
patterns to the dictionary, memory consumption increases, ing, while norovirus, cholera and E. coli can each cause
but lookup time does not increase substantively. diarrhea (or gastroenteritis). If no disease category can be
The disadvantage of this approach is that because the input identified from the text, we designate the alert as Not Yet
is hashed, each token must match exactly, making it difficult Classified. Such alerts may be non-disease-related news
to accommodate fuzzy matching, wildcards, or regular items that have slipped through the filter, but they may be
expression approaches. Further, if we change the input important if they indicate initial investigation of an un-
processing, for example, to retain more of the input data, known disease or a rare condition that is not yet represented
such as capitalization and punctuation, we must update the in the HealthMap database.
entire pattern dictionary. A further disadvantage of the
dictionary approach is that the system can only identify Database
locations and diseases already known and stocked in the Once the alerts are classified by location and disease, the
database. Moreover, a key step in enhancing the parser system stores them in a MySQL database. The database is
resolution consists of augmenting the database by capturing designed according to standard relational database normal-
correct locations and disease names, often involving careful ization principles. The primary tables store alerts, diseases
manual data entry. As national borders shift and names of and locations, while linking tables map alerts to their
places change, albeit infrequently, the system must be man- respective categories as identified by the classification en-
ually updated to reflect new geography. For example, we gine. This standardized data model allows the HealthMap
have already been affected by this issue, as we needed to software flexibility to perform a variety of queries and
update the parsing system to reflect the designation of display different views of the data. While the database is
Serbia and Montenegro as separate nations on June 5, 2006. designed primarily to support features of the Web applica-
tion, the data as they are stored are readily accessible for
A key advantage to the pattern dictionary approach is that it retrospective epidemiological studies, public health risk
is relatively easily translated to other languages: we can mapping and other research applications.
simply employ a different dictionary within the existing
architecture. A language expert is needed to perform the Output Renderer
initial translation, refine the pattern library, help with capi- The initial Web page is loaded by the user’s browser from a
talization and punctuations subtleties, and provide other server-side cache which is updated every hour, following
adaptations, but the basic approach can be re-applied with- the capture and classification of new alert data. If the user
out major changes to the system. Further, the language adjusts the viewing parameters, he will trigger an AJAX
expert need have only very minimal technical knowledge request to the server. The request indicates the current state
with respect to natural language syntax or software devel- of the page controls, and from it the server generates a
opment to contribute to the dictionary. With the help of database query. The database then returns the alerts that
collaborators at the Naval Medical Research Center Detach- match these parameters.
ment in Peru, we have already successfully adapted the From these query results, the system then tallies the number
classification engine to accommodate Spanish-language in- of alerts, diseases, and feeds for each day at each location. To
put, albeit with a smaller pattern dictionary. this tally it applies an algorithm, based on an exponentially
Container Relationships weighted average, to determine a “heat” rating for each
A key component of the location classifier is its use of location. In order to give particular emphasis to more recent
relationships among geographical entities. Our goal is to alerts, through qualitative assessments, we have currently
identify the most specific primary location or locations for a set the decay parameter “alpha” of the exponential weight-
given alert. In many cases, we are presented with input such ing to 0.17. (A greater alpha value means the weighting will
154 FREIFELD ET AL., Internet-based Surveillance of Infectious Diseases

F i g u r e 3. User Interface.

decay more rapidly as we progress into the past.) Locations country for the currently selected parameters (Figure 3c).
that have a greater number of feeds and diseases associated Clicking on a country name zooms the map view to that
with them are also given increased weighting. Our qualita- country for easy viewing of alerts in that location. The
tive justification for this boost is that if multiple sources have “Latest alerts” box displays the most recent alerts in reverse
corroborated an outbreak it deserves more emphasis, and if chronological order (Figure 3d). An icon next to each head-
the same source is reporting the same disease, it deserves line indicates the alert source.
less emphasis. Moving across to the map display window, the date slider at
After the computation is complete, the system normalizes the bottom allows the user to control the date range of
the heat scores across the set of markers and assigns each displayed alerts (Figure 3e). The end date is fixed as the
marker an integer value from 0 to 10. Because it computes current date, but the user can set the start date to any point
the Heat scores for the currently requested marker set, the in the previous thirty days. “Full Screen” mode expands the
user can, for example, choose a particular disease category map to cover the full browser window, allowing for richer
and quickly see where the hotspots are for that disease, in display and navigation (Figure 3f). It also allows for “situa-
addition to the default view indicating general levels of tion room” use, allowing the user to display the map on a
outbreak activity. non-interactive screen and monitor ongoing alert activity.
User Interface On the map itself, the color of a marker indicates the Heat
Figure 3 shows the HealthMap main page, featuring a Index value for the location, with the deeper red color
variety of information boxes and user controls. The “Avail- indicating more intense recent activity as contrasted with the
able feeds” box (Figure 3a) allows the user to select which paler yellow color.
sources to display on the map by means of the checkboxes
along the left-hand side. Below the feeds menu, the “Dis- Validation
eases, last 30 days” box serves both to display the currently Example Report Illustrating Classifier Operation
active diseases as well as to allow the user to select which To illustrate the functioning of the system, we examine a
diseases to display (Figure 3b). The “i” button brings up a sample report and how it is processed by the HealthMap
menu with links to further information about the particular classification engine. A local newspaper report concerning
disease from the Wikipedia, WHO, CDC, PubMed, and an outbreak of shigellosis at a school in Wisconsin enters the
Google Trends Web sites. In the next section, the “Alerts system via the Google News aggregator. The system begins
by country” box indicates the number of alerts active in each by examining the article headline:
Journal of the American Medical Informatics Association Volume 15 Number 2 Mar / Apr 2008 155

Elementary School Deals with Outbreak of Bacteria Table 1 y Location and Disease Classifier
As there are no known patterns found for either location or Performance over the One Month Period from 10
disease, the classifier then progresses to the article “descrip- October 2007 to 9 November 2007
tion,” an extract provided by Google News: Source Total Edited Location Disease Accuracy
Elementary School Deals with Outbreak of Bacteria 58 minutes All 778 123 (16%) 87 (11%) 47 (6%) 84%
Smith A bacterial outbreak at a Fond du Lac school is prompting ProMED only 207 19 (9%) 14 (7%) 5 (2%) 91%
the district to alert parents and do some extra cleaning in hopes of Google News only 547 104 (19%) 73 (13%) 42 (8%) 81%
stopping the . . .
While there is an indication of the location provided in this
extract, “Fond du Lac” is currently not included in the binary classification metrics such as precision and recall to
dictionary, and therefore not recognized. Still lacking both measure its performance. However, because we curate all
location and disease information, the classifier examines the reports on a daily basis to correct misclassifications, we can
article body text, as prepared by the parsing engine from the examine various aspects of performance based on the
original HTML: changes performed.
WEB SEARCH BY A bacterial outbreak at a Fond du Lac school is At the most basic level, the accuracy of the classifier can be
prompting the district to alert parents and do some extra cleaning measured by the percentage of reports entering the system
in hopes of stopping the bacteria from spreading. State health
that need not have their disease or location classifications
officials say there were 14 confirmed cases of shigellosis, a
bacterial infection, in Fond du Lac County in the past three months.
corrected in any way. At a more detailed level, we can
Five confirmed cases prompted Roberts Elementary School in Fond examine the number of alerts requiring a correction of
du Lac to notify parents. ”We want to get the information out to disease classification as compared with the number requir-
parents: Here it is and here are steps you can take,” Marian ing a location correction. Table 1 provides a full breakdown
Sheridan, the Fond du Lac school health and safety coordinator of the classifier performance both by source and by disease
said. The concern is that this infection is fast-spreading. Although and location. As shown, the overall accuracy of the system is
the Wisconsin health department says 300 to 400 cases are 84%, thus correctly classifying 655 out of 778 reports over
reported each year, the uncomfortable abdominal cramps, fever, and the one-month period from October 10 to November 9. As
diarrhea are symptoms no one wants running rampant through one might expect, performance on ProMED alerts, at 91%, is
schools. ”I think we re getting the message out early enough, and
substantially better than on Google News reports (81%), as
I think that s one of the benefits of working with school districts
staff to get the word out so we can contain it before it s widespread,”
ProMED messages represent data curated specifically for
Joyce Mann of the Department of Health and Family Services said. disease outbreak reporting and follow a more regular struc-
”Parents are used to the school sending them health notices, and it ture.
s never to alarm but it s rather to inform,” Sheridan said. There are, however, important limitations to this perfor-
”Normally what we do is go in with a ten-percent bleach solution mance analysis. In particular, in some cases, the correction of
and everything gets wiped down—telephones, door knobs, desk the classification serves merely to shift between related
chairs, desktops, the bathrooms are thoroughly gone through,”
categories, such as reclassifying Gastroenteritis as Norovi-
building and
rus, or UK as England. In other cases, the correction is more
As indicated in bold, the classifier now matches three drastic, such as correcting Influenza to Equine Influenza, or
different patterns in the text. The first identifies the disease Washington, DC to Washington State. Clearly the change is
category as Shigellosis; the second places the report in more significant in the latter cases, but we don’t capture this
Wisconsin. The third match corresponds to the Diarrhea distinction in the current analysis. As it is difficult to capture
disease category, but based on the container relationships rigorously, for the moment we take the most conservative
described above, the system correctly identifies Diarrhea as view in computing accuracy. As part of our ongoing re-
redundant with Shigellosis, and eliminates the former. At search, we are developing more fine-grained metrics.
this point, the classifier has completed its work, and pro-
ceeds to the next report. Had it not identified both disease Discussion
and location from the body text, it would have further As HealthMap is still in the early stages of development, a
examined the name of the publication as provided by number of important enhancements are either currently
Google News: under development or in the planning stage. The primary
WBAY, WI design goal of HealthMap is to provide broad coverage of
ongoing outbreaks without overwhelming the user. In the
Upon processing of this text, it would also have identified pursuit of improved coverage, we are exploring the use of
the location based on the abbreviation WI, which is listed in other sources, including additional news aggregators—such
the dictionary as a synonym of Wisconsin. However, in this as Yahoo news, Factiva, and LexisNexis— blogs, and veter-
particular case, the publication information is ignored as the inary news sources such as the World Organization for
classifier has already achieved matches using other compo- Animal Health (OIE). In pursuit of improved filtering, we
nents of the report. are developing natural language processing techniques for
Classifier Performance additional automated data categorization, such as clustering
Because the classification engine places alerts into many similar reports, identifying specific outbreak pertinence,
hundreds of different location and disease categories (cur- distinguishing discrete outbreaks from endemic activity,
rently over 700 total), as well as combinations of multiple and identifying reports indicating the absence of disease or
categories of each type, it is not possible to apply traditional the end of a previously identified outbreak.
156 FREIFELD ET AL., Internet-based Surveillance of Infectious Diseases

F i g u r e 4. Geographic coverage of the HealthMap system.

As part of our own evaluation, as mentioned in the Formu- community collaboration as a mechanism for alert acquisi-
lation Process, an important goal of the system is to cover as tion and classification.
broad a range of geography and disease as possible, without As we expand functionality, performance will naturally
bias toward particular regions or pathogens. While the become an increasing concern. We have a few optimizations
internal architecture of the system itself largely meets these in progress, such as moving to memory-based caching, more
goals (particularly as we add more geographical subdivi- intelligent, “lazy” loading of the pattern dictionary, and
sions around the world), the alert data we process and better optimized database queries. We are also exploring
display leaves much to be improved. Because we currently ways to better employ client-side caching without overload-
rely heavily on the US edition of Google News for reports, ing the browser.
the system is biased toward the US and Canada as well as
other English-speaking countries around the world, as On the frontend, we have plans to improve the user expe-
shown in Figure 4. To address this problem, we have rience with added features and improved customization.
developed a Spanish-language version of the system and are Examples include keyword searching, RSS output, saved
currently expanding to other languages and data sources as preferences, endemic background disease rates, notification
resources permit. However, given the uneven distribution of messaging via email, and temporal visualization. (Notably,
media and reporting resources around the world, we will the EpiSPIDER system has already taken steps in this area,
continue to face this issue for the foreseeable future. incorporating a timeline view of ProMED reports.17) We also
plan to conduct a usability observation study, to gather
In addition to adding new capability, we are also working feedback from our target demographic on priority features
on improving the accuracy of the existing classifier, both by as well as how best to improve the HealthMap user inter-
expanding the pattern dictionary and by improving the face.
preparation module. We will add more locations and dis-
eases, including administrative divisions for countries such Along with user-level evaluation, we are also working to
as Indonesia, Brazil, Sudan and Mexico as well as major develop more rigorous evaluation metrics for the integrated
cities worldwide. For diseases, we will be adding more system, including its ability to cover a broad range of
disease categories and refining our disease taxonomy, as geography and pathogens, limit noise, detect outbreaks
well as tagging diseases with category metadata to allow for early, and accurately characterize alerts in each dimension of
improved searching. We will also explore more advanced classification.
techniques such as fuzzy matching and Bayesian machine HealthMap is part of a new generation of disease surveil-
learning for improving the resolution and accuracy of our lance systems that process unstructured and unclassified
automated classification algorithms, as well as categorizing data sources. Comprehensive evaluation of these types of
alerts by relevancy, clustering similar alerts, and extracting systems and data sources is also an important area and part
other useful attributes.27–29 On the human side, taking of our ongoing and future research and collaboration with
inspiration from the highly successful Wikipedia model,30 other disease tracking systems such as GPHIN, EpiSPIDER,
we plan to work with networks of experts to evaluate MedISys, and Argus, would enable an in-depth comparison.
Journal of the American Medical Informatics Association Volume 15 Number 2 Mar / Apr 2008 157

With that said, there are a few broad comparisons we can draw chine Translation in the Americas 2006. Available at:
between systems. One key area is accessibility: HealthMap is www.mt-archive.info/MTS-2005-Mawudeku.pdf. Accessed Apr
freely available to the public, whereas some systems are 26, 2007.
currently closed systems, requiring either paid subscription 11. Eysenbach G. SARS and population health technology. J Med
or approved access. Another key area is in the use of Internet Res. 2003 Apr-Jun;5(2):e14.
12. Morse SS, Rosenberg BH, Woodall J. ProMED global monitoring
automation. While we certainly perform manual curation in
of emerging diseases: design for a demonstration program.
maintaining HealthMap, our goal is to maximize automa-
Health Policy. 1996 Dec;38(3):135–53.
tion in order to leverage the human contribution. The value 13. Madoff LC. ProMED-mail: an early warning system for emerg-
of a full-time staff of language and domain experts to read ing diseases. Clin Infect Dis. 2004 Jul 15;39(2):227–32.
and analyze reports around the clock should also be ad- 14. Madoff LC, Woodall JP. The internet and the global monitoring
dressed as part of a broader research initiative.6,7 of emerging diseases: lessons from the first 10 years of ProMED-
mail. Arch Med Res. 2005 Nov-Dec;36(6):724 –30.
Conclusion 15. Health Threats Unit at Directorate General Health and Con-
The promise of HealthMap lies in its ability to extract useful, sumer Affairs of the European Commission. MedISys (Medical
customizable messaging and views from a mass of unstruc- Intelligence System). Available at: http://medusa.jrc.it/. Ac-
tured data. While the site has already generated significant cessed Apr 4, 2007.
16. Wilson J. Argus: Use of Indications and Warnings for Global
interest as a publicly available surveillance tool, many
Tactical Detection and Tracking of Biological Events.” George-
improvements remain to be made for it to be a truly useful
town Hosts 3rd Annual Conference on Infectious Disease; 2007;
resource for both public health professionals and the general Washington, DC; 2007.
public. In particular, adding more languages and expanding 17. Tolentino H. Scanning the Emerging Infectious Diseases Hori-
our usage of general data sources such as newspapers and zon—Visualizing ProMED Emails Using EpiSPIDER. Interna-
blogs will increase coverage and further demonstrate the tional Society for Disease Surveillance Annual Conference; 2006;
value of the visualization and filtering features. Moreover, Baltimore, MD; 2006.
only as time progresses, as more people use the system, and 18. Bernhardt JM. Centers for Disease Control and Prevention:
further significant outbreaks unfold in the global disease Director’s Blog. Health Marketing Musings 2006. Available at:
ecosystem, will we know the true potential of the software, http://www.cdc.gov/healthmarketing/blog_101106.htm. Ac-
and how best to improve it. cessed Apr 17, 2007.
19. O’Reilly T. What Is Web 2.0: Design Patterns and Business
References y Models for the Next Generation of Software 2007. Available at:
http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/
1. Grein TW, Kamara KB, Rodier G, Plant AJ, Bovier P, Ryan MJ,
09/30/what-is-web-20.html. Accessed Apr 17, 2007.
et al. Rumors of disease in the global village: outbreak verifica-
20. Cayzer S. Semantic blogging and decentralized knowledge
tion. Emerg Infect Dis. 2000 Mar-Apr;6(2):97–102.
management. Communications of the ACM. 2004;47(12):47–52.
2. Heymann DL, Rodier GR. Hot spots in a wired world: WHO
21. Garrett JJ. Ajax: A New Approach to Web Applications. 2005.
surveillance of emerging and re-emerging infectious diseases.
Lancet Infect Dis. 2001 Dec;1(5):345–53. Available at: http://www.adaptivepath.com/publications/
3. Hiltz SR, Murray T. Structuring computer-mediated communi- essays/archives/000385.php. Accessed May 10, 2007.
cation systems to avoid information overload. Communications 22. Paulson LD. Building Rich Web Applications with Ajax. Com-
of the ACM. 1985;28(7):680 –9. puter. 2005;38(10):14 –7.
4. Berghel H. Cyberspace 2000: Dealing with information overload 23. Berners-Lee T, Hendelr J, Lassila O. The semantic Web. Scien-
Communications of the ACM. 1997;40(2):19 –24. tific American. 2001;284(5):28 –37.
5. Brownstein JS, Freifeld CC, Reis BY, Mandl KD. HealthMap: 24. Gratz NG. Emerging and resurging vector-borne diseases.
Internet-based emerging infectious disease intelligence. In: In- Annu Rev Entomol. 1999;44:51–75.
stitute of Medicine, editor. Infectious Disease Surveillance and 25. Dobson A, Foufopoulos J. Emerging infectious pathogens of
Detection: Assessing the Challenges—Finding Solutions. Wash- wildlife. Philos Trans R Soc Lond B Biol Sci. 2001 Jul 29;
ington, DC.; 2007. 183–204. 356(1411):1001–12.
6. Holden C. Netwatch: Diseases on the move. Science. 2006 26. Brownstein JS, Holford TR, Fish D. Enhancing National West
December 1st.;314(5804):1363d. Nile Virus Surveillance. Emerg Infect Dis. 2004; In press.
7. Captain S. Get your daily plague forecast. Wired News. Avail- 27. Zheng W, Milios E, Watters C. Filtering for medical news items
able at: http://www.wired.com/science/discoveries/news/ using a machine learning approach. Proc AMIA Symp. 2002:
2006/10/71961. Accessed Apr 4, 2007. 949 –53.
8. Larkin M. Technology and public health: Healthmap tracks 28. Chen H. Machine learning for information retrieval: Neural
global diseases. Lancet Infect Dis. 2007 February;7:91. networks, symbolic learning, and genetic algorithms. J Am Soc
9. Mykhalovskiy E, Weir L. The Global Public Health Intelligence Inform Sci 1999;46(3):194 –216.
Network and early warning outbreak detection: a Canadian 29. Ribeiro-Neto B, Laender AHF, deLima LRS. An Experimental
contribution to global public health. Can J Public Health. 2006 Study in Automatically Categorizing Medical Documents. J Am
Jan-Feb;97(1):42– 4. Soc Inform Sci 2001;52(5):391– 401.
10. Mawudeku A, Blench M. Global Public Health Intelligence 30. Giles J. Internet encyclopaedias go head to head. Nature. 2005
Network (GPHIN). 7th Conference of the Association for Ma- Dec 15;438(7070):900 –1.

You might also like