Subset Scanning For Event and Pattern Detection
Subset Scanning For Event and Pattern Detection
those regions which are most likely to represent both computational efficiency (scaling to much
anomalous spatial clusters. A typical approach larger datasets) and detection power.
is to search over all regions of a given shape
such as circles (Kulldorff 1997), space-time
cylinders (Kulldorff 2001), rectangles (Neill Scientific Fundamentals
and Moore 2004), or ellipses (Kulldorff et al.
2006). These approaches perform well when Additional Motivation for the Subset
the true spatial region (“cluster”) of interest Scanning Approach
is well approximated by the set of search As noted above, subset scanning originated from
regions but suffer from reduced detection power the spatial statistics literature, building on Kull-
otherwise. For example, the original spatial dorff’s spatial scan approach (Kulldorff 1997)
scan approach (Kulldorff 1997), searching over in order to accurately detect irregularly shaped
circular regions, loses power for highly elongated spatial clusters or subsets satisfying other rele-
clusters, and all of the above approaches lose vant constraints (e.g., anomalous subgraphs of
power for clusters with highly irregular shapes. a larger real-world network). It has since been
Many recent approaches search over larger sets generalized beyond the spatial domain to identify
of irregularly shaped regions, such as subsets of subsets of similar data records for which some
locations connected by spatial adjacency (Patil subset of attributes are anomalous (McFowland
and Taillie 2004; Tango and Takahashi 2005; III et al. 2013), as well as to many other do-
Duczmal and Assuncao 2004). Most of these mains, e.g., detecting patterns in massive image
approaches perform approximate rather than data (Somanchi and Neill 2013) and detecting
exhaustive searches over the chosen set of events using online social network data such as
irregularly shaped regions, using some heuristic Twitter (Chen and Neill 2014).
optimization approach such as simulated From a machine learning and data mining
annealing (Duczmal and Assuncao 2004) or perspective, the idea of detecting subsets that are
genetic algorithms (Duczmal et al. 2007), collectively interesting or anomalous is a natural
and thus do not guarantee that an optimal extension of typical, single record-based anomaly
or even near-optimal region will be found. detection approaches (Das et al. 2008). How-
Alternative approaches search exhaustively over ever, most previous approaches to this problem
a much smaller set of regions based on upper are either heuristic search methods (which are
level sets (Patil and Taillie 2004) or spanning not guaranteed to find optimal or approximately
trees (Costa et al. 2012), but again, these optimal subsets), top-down detection methods
approaches may fail to identify the highest- (which search for globally interesting patterns
scoring subset. Finally, the FlexScan approach and then drill down to more carefully investigate
of Tango and Takahashi (2005) performs an the most interesting sub-partitions), or bottom- S
exhaustive search over connected subsets within up detection approaches (which find individually
the local neighborhood of each spatial location anomalous data points and then aggregate them
but is computationally expensive and does not into clusters). However, top-down methods often
scale to even moderately large neighborhood fail to detect small-scale patterns that may not be
sizes. These limitations of previous methods led evident from the global aggregate statistics, while
to the development of the fast subset scan (Neill bottom-up methods can fail for subtle patterns
2012), enabling exact and efficient search over that are only evident when a sufficiently large
subsets of locations. Recent extensions of fast group of data records are considered collectively.
subset scanning allow incorporation of various For example, a subtle increase in the number of
constraints including spatial proximity and emergency department visits in several nearby
connectivity. Moreover, empirical comparisons hospitals may be indicative of an emerging dis-
suggest that fast subset scanning approaches ease outbreak, but this signal might not be visible
outperform competing methods with respect to when observing only a single hospital or when
2220 Subset Scanning for Event and Pattern Detection
exponential family, with mean equal to bit . score function F .S / for all 2N subsets S
The alternative hypothesis H1 .S / assumes a fs1 : : : sN g, quickly becomes computationally in-
constant multiplicative increase q > 1 for all feasible. This motivated the development of fast
expected counts in the region S . For example, subset scanning approaches designed to find the
for the expectation-based Poisson (EBP) scan highest-scoring subsets S D arg maxS F .S /
statistic (Neill 2009b), we have cit Poisson.bit / without an exhaustive search.
everywhere under H0 . For H1 .S /, we have The key idea underlying these approaches, as
cit Poisson.qbit / for all counts inside region S described in Neill (2012), is that many relevant
and cit Poisson.bit / outside S . The maximum score functions satisfy a property (linear-time
likelihood estimate of q is max 1; C.S/ , where subset scanning or LTSS) that allows efficient
P t P B.S/t optimization over subsets, by sorting the spatial
C.S / D S ci and B.S / D S bi . Plugging
locations si according to some “priority” function
this value of q and the Poisson likelihoods into
G.si / and evaluating only those subsets consist-
the equation above results in the EBP score
ing of the top-j highest priority locations, for
function:
j D 1 : : : N . For functions satisfying the LTSS
property, maxS F .S / D maxj F .fs.1/ : : : s.j / g/,
C.S /
FEBP .S / D C.S / log C B.S / C.S /; where s.j / is the j th highest priority location, and
B.S /
thus, we are guaranteed that the highest scoring
of all 2N subsets will be one of the N subsets
if C.S / > B.S / and F .S / D 0 otherwise.
that are evaluated. This fast subset scan approach
The EBP score function is slightly different
dramatically reduces computation time while still
than Kulldorff’s original Poisson scan statistic
guaranteeing an exact solution to the uncon-
(Kulldorff 1997) and has the advantage of high
strained (all subsets) search problem. For exam-
detection power for both large and small clusters,
ple, the highest-scoring subset of 97 zip codes in
while Kulldorff’s statistic loses detection
Allegheny County, Pennsylvania, can be found in
power for large clusters (Neill 2009a). More
approximately 40 ms, while an exhaustive search
generally, for expectation-based scan statistics
would require about 1020 years (Neill 2012).
in a separable exponential family, including the
The fact that an exact, rather than approximate,
Poisson, Gaussian, and exponential distributions,
solution is found makes fast subset scanning fun-
the score function can be written in the form
damentally different from other approaches based
F .S / D B.S /D C.S/ B.S/
; 1 , where D is a
on submodular function optimization (Leskovec
Bregman divergence. See Neill (2012) for more et al. 2007), which produce provably good ap-
details. proximations but do not necessarily identify the
optimal subset.
Linear-Time Subset Scanning and the Fast The linear-time subset scanning property S
Subset Scan has been shown to hold for many useful
As noted above, typical spatial and space-time score functions, including both parametric log-
scan approaches suffer from reduced detection likelihood ratio statistics (such as the expectation-
power when the true affected subset of locations based scan statistics and Kulldorff’s original
does not correspond well to the set of search spatial scan statistic) and nonparametric scan
regions, e.g., for elongated or irregularly shaped statistics (McFowland III et al. 2013). Following
clusters. Detection power can be substantially Neill (2012), we consider three different
improved by optimizing the score function F .S / conditions, each of which is sufficient for the
over all subsets S , typically with additional con- LTSS property to hold:
straints (such as spatial proximity) to ensure that
the discovered clusters are feasible solutions to
the problem under consideration. However, an • Let F .S / D F .X.S /; Y.S // be a convex
exhaustive search over subsets, evaluating the (or quasi-convex) function of two additive,
2222 Subset Scanning for Event and Pattern Detection
some fixed neighborhood size k. Fast localized scoring subset found by unconstrained LTSS may
scan performs a separate, unconstrained search be disconnected, making it challenging to apply
over subsets for each of the N neighborhoods LTSS directly for optimization with connectivity
formed in this way, reducing the computational constraints, an alternate formulation of the LTSS
complexity of searching each neighborhood from property can be used to speed up the search. As
O.2k / to O.k/ using the LTSS property. Thus, noted above, in the unconstrained case, we can
the overall complexity is reduced from exponen- prove that a subset S is suboptimal if there exist
tial to O.N k C N log N /, where the first term locations sin 2 S and sout 62 S where G.sin /
describes the complexity of searching over the N G.Sout /. When optimizing over connected sub-
neighborhoods and the second term corresponds graphs instead of all subsets, a variant of this
to the initial step of sorting the N locations property still applies, but subgraph S is only
by priority. If a good choice of neighborhood provably suboptimal if the resulting subgraphs
size k is not known, an alternative is the fast S [ fsout g and S n fsin g remain connected.
multiscan (Neill 2012), which compares the pe- This property was recently incorporated into
nalized scores maxS F .S j k/ k for all neigh- a depth-first search procedure, the GraphScan al-
borhood sizes k D 1 : : : N and some constant gorithm (Speakman et al. 2014a). By identifying
> 0. Given labeled training data, the value of and pruning paths that are provably suboptimal,
k for fast localized scan or the value of for GraphScan can rule out large numbers of subsets
fast multiscan can be chosen by cross-validation. without evaluating each one individually. This
Neill (2012) examined the detection power and approach dramatically reduces the size of the
spatial accuracy of the fast localized scan and fast search space and the resulting computation time.
multiscan approaches as compared to the tradi- Additional speed improvements are obtained by
tional Kulldorff’s spatial scan (searching over cir- branch and bounding, using the unconstrained
cular regions), using simulated disease outbreaks maximum score (which is efficiently computable
injected into real-world hospital emergency de- using LTSS) as an upper bound on the max-
partment data. The proximity-constrained subset imum score of connected subgraphs and rul-
scans substantially improved the timeliness and ing out large numbers of subsets with upper
accuracy of detection, detecting 2 days faster with bounds less than the highest-scoring subgraph
fewer than half as many missed outbreaks. found so far. See Speakman et al. (2014a) for
For the extension of linear-time subset scan- details. The resulting GraphScan algorithm, like
ning to graph or network data, we monitor one FlexScan (Tango and Takahashi 2005), still re-
or more data streams at each node of the graph quires exponential computation time in the worst
and wish to detect the most anomalous subset case, but it scales to graphs an order of magnitude
of nodes subject to the graph connectivity con- larger than FlexScan, with a 450,000x speedup
straints (i.e., the given subset of nodes must form for graphs of size 30 (Speakman et al. 2014a). S
a connected subgraph of the original graph). For Moreover, GraphScan still identifies the highest-
spatial data, the graph edges could represent spa- scoring subgraph: it is an exact, rather than ap-
tial adjacency or travel patterns, but this frame- proximate, algorithm. An alternative approach,
work also enables analysis of nonspatial network Additive GraphScan (Speakman et al. 2013), can
data. As noted above, exact optimization of the only be used for additive score functions and is
score function F .S / over connected subgraphs not guaranteed to find the highest-scoring sub-
is difficult: the FlexScan approach (Tango and graph. However, Additive GraphScan can scale to
Takahashi 2005) performs an exhaustive search graphs with tens of thousands of nodes and iden-
and thus does not scale beyond 25 or 30 nodes, tifies near-optimal subsets with high probability
while other approaches (Patil and Taillie 2004; in practice (Speakman et al. 2013).
Duczmal and Assuncao 2004; Duczmal et al. Another recent extension of the fast subset
2007; Costa et al. 2012) are not guaranteed to find scanning framework, the Dynamic Subset
the highest-scoring subgraph. While the highest- Scan (Speakman et al. 2013), focuses on
2224 Subset Scanning for Event and Pattern Detection
detecting dynamic patterns, where the affected to optimize over all subsets of locations for a
subset of locations can grow, shrink, or given time step, with location-specific bonuses or
move over time. Typical space-time scan penalties based on the detected subsets for the
approaches (Kulldorff 2001; Neill et al. 2005) previous and next time steps, and incorporating
search over space-time cylinders, assuming that a flexible, generative model of event propaga-
the affected subset of locations remains constant tion. This efficient conditional optimization step
for the duration of the event. However, this over- is iterated until convergence, thus propagating
constrained approach leads to reduced detection information both backward and forward in time.
power for dynamically evolving events, as well as Connectivity constraints can also be incorporated
failing to accurately capture the event dynamics. into the Dynamic Subset Scan framework, re-
An alternative, under-constrained approach of quiring the use of Additive GraphScan (rather
performing independent spatial scans for each than simply including all locations that make a
time step results in identified patterns that display positive contribution to the score) for each step.
unrealistic temporal trends (e.g., affecting the Speakman et al. (2013) applied the Dynamic
east side of the city on day 1, the west side on Subset Scan (with connectivity and temporal con-
day 2, and back to the east side on day 3). Thus, sistency constraints) to detection, tracking, and
the Dynamic Subset Scan optimizes the score source tracing of spreading contaminants in a
function F .S / over subsets of locations at each water distribution network. Dynamic Subset Scan
time step while enforcing temporal consistency demonstrated earlier detection of contamination
constraints, considering the patterns detected at events and more accurate identification of the
adjacent time steps, and rewarding patterns that affected subset of nodes through time.
are not dramatically different between time steps
t and t C 1. This approach allows the spatial Multivariate Fast Subset Scanning
extent of an event to evolve smoothly over time While the fast subset scan approaches described
while penalizing unrealistic event dynamics. above focus on the univariate case, monitoring
Unlike the fast subset scanning approaches a single spatiotemporal data stream, these ap-
described above, which enforce hard constraints proaches can also be extended to the multivariate
and thus rule out some subsets from consider- case. The multivariate fast subset scan (Neill
ation, Dynamic Subset Scan enforces soft con- et al. 2013) can be used to monitor multiple
straints, which can be interpreted as applying streams of space-time data, identifying subsets
bonuses or penalties to the score function for of streams where the recently observed counts
including or excluding certain locations. This is are significantly higher than expected. Similarly,
a specific case of the more general, penalized the Fast Generalized Subset Scan (FGSS) can be
fast subset scanning (PFSS) framework described used to discover patterns in general multivari-
by Speakman et al. (2014b). Incorporating penal- ate datasets, identifying subsets of similar data
ties is difficult because a penalized version of the records with anomalous values for some subset
score function may not satisfy the LTSS prop- of attributes (McFowland III et al. 2013). The
erty. However, Speakman et al. (2014b) show key idea for both approaches is similar: linear-
that any expectation-based scan statistic in the time subset scanning can be used for efficient op-
exponential family can be written as an additive timization over subsets of locations (or records)
function conditional on the relative risk param- for a given subset of streams (or attributes) but
eter q. Only a linear number of ranges for q can also be used for efficient optimization over
must be considered, and optimization over sub- subsets of streams (or attributes) for a given
sets for each q range is very efficient. Moreover, subset of locations (or records). Thus, we can
this formulation allows bonuses or penalties for iterate between these two efficient conditional
each location to be incorporated into the score optimization steps until a local maximum of the
function while maintaining efficient optimiza- score function is reached and perform multiple
tion. In the Dynamic Subset Scan, PFSS is used restarts in order to approach the global maximum.
Subset Scanning for Event and Pattern Detection 2225
The multivariate fast subset scan builds on the value range corresponding to each attribute value
univariate fast subset scanning approach, jointly by ranking the conditional probabilities, where
optimizing a parametric log-likelihood ratio under the null hypothesis we expect empirical
statistic (such as the expectation-based Poisson p-values to be uniformly distributed on [0,1];
statistic described above) over proximity- and (4) using a nonparametric scan statistic to
constrained subsets of locations and over all detect subsets of records and attributes with an
subsets of data streams. The most natural unexpectedly large number of low (significant)
formulation of the multivariate scan statistic empirical p-values. The final step is computa-
in this setting, subset aggregation, assumes tionally expensive (exponential in the numbers
a constant multiplicative increase across all of records and attributes for a naive search), but
affected streams and thus adds counts and LTSS can be used to speed up this search, con-
baselines across the monitored subset of streams. verging to a local maximum of the score function
An alternative formulation by Kulldorff et al. and ensuring that each iteration step is linear
(2007) proposes adding log-likelihood ratios (not exponential) in the number of records or
across streams (assuming that the data streams attributes. FGSS was shown to consistently out-
are conditionally independent). Neill et al. (2013) perform previously proposed methods in terms
demonstrate that the Kulldorff’s multivariate of detection power and characterization accuracy
scan can also be made efficient using the across multiple application domains, and scales
LTSS property, by iterating between two steps: to much larger datasets, thus enabling accurate
optimizing over subsets of records (for given and efficient pattern detection in massive, high-
values of the multiplicative effect of the event dimensional data.
on each data stream) and recalculating the
maximum likelihood values of the event’s effects
for the given subset of records. Regardless of Key Applications
which formulation is used, the multivariate
fast subset scan is computationally efficient One important real-world application of subset
and scales to large numbers of locations and scanning is in the area of disease surveillance,
streams. Moreover, significant gains in detection where we attempt to detect emerging outbreaks
power and spatial accuracy were observed when of disease in their very early stages by identifying
searching over subsets of data streams and anomalous clusters of disease cases. In the
when detecting proximity-constrained subsets multivariate disease surveillance setting, we
of locations rather than searching over circular monitor a set of data streams Dm (m D 1 : : : M )
regions (Neill et al. 2013). on a regular (e.g., daily or hourly) basis at a set of
The FGSS approach (McFowland III et al. spatial locations (e.g., zip codes) si (i D 1 : : : N ).
2013) does not assume space-time data but in- For each combination of location and data stream, S
stead considers an arbitrary set of attributes mea- t
we have a time series of observed counts ci;m ,
sured for each of a large set of data records. t
where each count ci;m could represent the number
Nonparametric scan statistics are used to con- of observed cases of a given type (e.g., emergency
vert the disparate attributes to the same scale department visits with respiratory complaints) in
(empirical p-values between 0 and 1) and to a given zip code on a given day. A typical goal in
integrate these values in a principled statistical this setting is to identify spatial regions (subsets
framework. FGSS consists of four steps: (1) ef- of locations) where some subset of the monitored
ficiently learning a Bayesian network model that data streams have recent counts that are higher
represents the assumed null distribution of the t
than expected. Here, the expected counts bi;m
data; (2) computing the conditional probability of are obtained through time series analysis of
each attribute value in the dataset given the Bayes historical data and can account for trends
Net, conditioned on the other attribute values such as the day of week, seasonality, holidays,
for that record; (3) computing an empirical p- and known events. Multiple variants of subset
2226 Subset Scanning for Event and Pattern Detection
scanning have been evaluated on the disease cancer and identify other regions of potential
surveillance task, typically through semisynthetic interest in digital pathology slides. Nobles
testing (in which simulated outbreaks are et al. (2014) apply the subset scan approach to
injected into real-world background data). identifying emerging “novel” disease outbreaks
Searching over proximity-constrained subsets with previously unseen or anomalous patterns of
of locations to identify irregularly shaped spatial symptoms, using free-text emergency department
clusters (Neill 2012) enables earlier and more chief complaint data. Chen and Neill (2014)
accurate outbreak detection, as measured by developed a new approach to event detection
the average number of days to detect and in heterogeneous social media graphs and
proportion of outbreaks detected for a given applied this approach to advance prediction of
false-positive rate, as well as the spatial overlap civil unrest events (strikes, protests, and riots)
between true and identified outbreak regions. and early warning for rare disease outbreaks
Further improvements in detection power can (hantavirus), using Twitter data from Latin
be obtained by integrating information from America. In both application domains, their
multiple health data streams (Neill et al. 2013), approach outperformed five competing, state-
searching over subsets of streams as well as of-the-art methods for both event detection
proximity-constrained subsets of locations. This and forecasting, increasing detection power,
approach also helps to characterize outbreaks by forecasting accuracy, and forecasting lead time
identifying the affected subset of streams, and a while reducing time to detection (Chen and Neill
further extension, the “multidimensional subset 2014).
scan” (Neill and Kumar 2013) can also identify
differentially affected subpopulations (e.g., by
gender, age, socioeconomic status, or behavioral Future Directions
risk factors).
Incorporating other constraints into the While subset scanning is a rapidly emerging and
subset scan, such as graph connectivity and highly promising field, a number of challenging
temporal consistency (Speakman et al. 2013, open problems remain to be addressed. One av-
2014a) or similarity between records in general enue for future research is continuing to extend
datasets (McFowland III et al. 2013), enables a the range of score functions for which the linear-
wide variety of other applications to be addressed time subset scanning property can be proven to
using this framework. For example, Speakman hold, as well as the range of constraints that
et al. (2013) demonstrate improved performance can be incorporated into the fast subset scan
for detecting, tracking, and source tracing framework while still allowing computationally
contamination events spreading through a efficient and scalable solutions. These extensions
water distribution system. McFowland III et al. have the potential to expand the use of subset
(2013) evaluate their approach on outbreak scanning for a variety of real-world applications
detection, customs monitoring of container requiring analysis of massive, complex datasets.
shipments, and computer network intrusion A second important direction is gaining a bet-
detection, demonstrating improvements over the ter understanding of the statistical properties of
current state of the art in all three application subset scanning, for example, identifying nec-
domains. Finally, recent extensions of subset essary and sufficient conditions for which the
scanning have been applied to detect patterns highest-scoring subset converges to the true af-
in massive, complex real-world datasets such as fected subset or a provably good approximation
images (Somanchi and Neill 2013), text (Nobles or quantifying the detection power of constrained
et al. 2014), and online social networks such as subset scans as a function of how well the cho-
Twitter (Chen and Neill 2014). Somanchi and sen constraints correspond to the true pattern of
Neill (2013) demonstrate that their approach interest. Finally, in many cases, event detection
can be used to accurately detect prostate can be thought of as integrating information from
Subset Scanning for Event and Pattern Detection 2227
many noisy sensors. This sensor fusion problem, discovery and data mining, Las Vegas, pp 169–176
assuming a given set of noisy sensors, comple- Duczmal L, Assuncao R (2004) A simulated annealing
strategy for the detection of arbitrary shaped spatial
ments the sensor placement problem, in which clusters. Comput Stat Data Anal 45:269–286
sensors are often assumed to be perfect and the Duczmal L, Cancado A, Takahashi R, Bessegato L (2007)
focus is on optimally placing sensors in space A genetic algorithmic for irregularly shaped scan
or on a network. Another useful property, sub- statistics. Comput Stat Data Anal 52(1):43–52
Kulldorff M (1997) A spatial scan statistic. Commun Stat
modularity, can be used to efficiently find near- Theory Methods 26(6):1481–1496
optimal solutions to a variety of sensor place- Kulldorff M (2001) Prospective time-periodic geographi-
ment problems (Leskovec et al. 2007), and it cal disease surveillance using a scan statistic. J R Stat
is an open problem whether the submodularity Soc A 164:61–72
Kulldorff M, Huang L, Pickle L, Ducmzal L (2006) An
and linear-time subset scanning properties can be elliptic spatial scan statistic. Stat Med 25:3929–3943
effectively combined to solve problems requir- Kulldorff M, Mostashari F, Duczmal L, Yih WK, Klein-
ing both placement of, and integration of data man K, Platt R (2007) Multivariate scan statistics for
from, noisy sensors. One example application disease surveillance. Stat Med 26:1824–1833
Leskovec J, Krause A, Guestrin C, Faloutsos C, Van-
where this might be useful is in the crowdsourced Briesen J, Glance N (2007) Cost-effective outbreak
collection of environmental and ecological data detection in networks. In: Proceedings of the 13th
(e.g., observations of plant and animal species ACM SIGKDD conference on knowledge discovery
or measurement of air, water, and soil qual- and data mining, San Jose, pp 420–429
McFowland III E, Speakman S, Neill DB (2013) Fast gen-
ity) by “citizen scientists.” In this case, data eralized subset scan for anomalous pattern detection. J
quality varies considerably based on individuals’ Mach Learn Res 14:1533–1561
expertise, and the accuracy of the data might be Naus JI (1965) The distribution of the size of the max-
substantially improved by asking individuals with imum cluster of points on the line. J Am Stat Assoc
60:532–538
relevant expertise if they are willing to perform Neill DB (2009a) An empirical comparison of spatial scan
analyses of specific types or in specific locations. statistics for outbreak detection. Int J Health Geogr
8:20
Neill DB (2009b) Expectation-based scan statistics for
monitoring spatial time series data. Int J Forecast
Cross -References 25:498–517
Neill DB (2012) Fast subset scan for spatial pattern detec-
tion. J R Stat Soc Ser B Stat Methodol 74(2):337–360
Hotspot Detection, Prioritization, and Security
Neill DB, Kumar T (2013) Fast multidimensional subset
Irregular Shaped Spatial Clusters: Detection scan for outbreak detection and characterization. On-
and Inference line J Publ Health Inf 5(1):156
Linear Anomalous Window Neill DB, Moore AW (2004) Rapid detection of signifi-
cant spatial clusters. In: Proceedings of the 10th ACM
Movement Patterns in Spatio-Temporal Data
SIGKDD conference on knowledge discovery and data
Public Health and Spatial Modeling mining, Seattle, pp 256–265
Neill DB, Moore AW, Sabhnani M, Daniel K (2005) De-
S
tection of emerging space-time clusters. In: Proceed-
ings of the 11th ACM SIGKDD conference on knowl-
References edge discovery and data mining, Chicago, pp 218–227
Neill DB, McFowland III E, Zheng H (2013) Fast sub-
Chen F, Neill DB (2014) Non-parametric scan statistics set scan for multivariate event detection. Stat Med
for event detection and forecasting in heterogeneous 32:2185–2208
social media graphs. In: Proceedings of the 20th ACM Nobles M, Deyneka L, Ising A, Neill DB (2015) Identi-
SIGKDD conference on knowledge discovery and data fying emerging novel outbreaks in textual emergency
mining, New York, pp 1166–1175 department data. Online J Publ Health Inf 7(1): e45
Costa MA, Assuncao RM, Kulldorff M (2012) Con- Patil GP, Taillie C (2004) Upper level set scan statistic
strained spanning tree algorithms for irregularly- for detecting arbitrarily shaped hotspots. Environ Ecol
shaped spatial clustering. Comput Stat Data Anal Stat 11:183–197
56(6):1771–1783 Somanchi S, Neill DB (2013) Discovering anomalous pat-
Das K, Schneider J, Neill DB (2008) Anomaly pattern terns in large digital pathology images. In: Proceedings
detection in categorical datasets. In: Proceedings of of the 8th INFORMS workshop on data mining and
the 14th ACM SIGKDD conference on knowledge health informatics, Minneapolis
2228 Summary Information
Supplementary Material
SVG
Metadata and Interoperability, Geospatial
Scalable Vector Graphics (SVG)
Web Mapping and Web Cartography
Surface Modeling