0% found this document useful (0 votes)
29 views

Subset Scanning For Event and Pattern Detection

anomaly
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Subset Scanning For Event and Pattern Detection

anomaly
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

2218 StreamInsight

group on knowledge discovery and data mining,


Boston, pp 71–80 Streams
Hand DJ, Mannila H, Smyth P (2001) Principles of data
mining. MIT, Cambridge  Queries in Spatiotemporal Databases, Time Pa-
Hershberger J, Shrivastava N, Suri S (2006) Cluster hulls:
a technique for summarizing spatial data streams. In: rameterized
Proceedings of IEEE international conference on data
engineering, Atlanta, p 138
Hulten G, Spencer L, Domingos P (2001) Mining time-
changing data streams. In: Proceedings of ACM spe- Subset Scanning for Event and
cial interest group on knowledge discovery and data Pattern Detection
mining, San Francisco, pp 97–106
Natwichai J, Li X (2004) Knowledge maintenance on data
streams with concept drifting: international sympo- Daniel B. Neill
sium on computation and information sciences (CIS), Event and Pattern Detection Laboratory, H.J.
Shanghai, pp 705–710 Heinz III College, Carnegie Mellon University,
O’Callaghan L, Mishra N, Meyerson A, Guha S, Motwani
R (2002) Streaming-data algorithms for high-quality Pittsburgh, PA, USA
clustering. In: Proceedings of IEEE international con-
ference on data engineering, San Jose, pp 685–694
Pan F, Wang B, Ren D, Hu X, Perrizo W (2003) Proximal Synonyms
support vector machine for spatial data using peano
trees. In: ISCA computer applications in industry and Fast subset scan; GraphScan; Linear-time subset
engineering, Las Vegas, pp 292–297
Perrizo W, Jockheck W, Perera A, Ren D, Wu W, Zhang scanning
Y (2002) Multimedia data mining using P-trees. In:
International workshop on multimedia data mining
(MDM/KDD), Edmonton, pp 19–29 Definition
Rao KR, Yip P (1990) Discrete cosine transform: al-
gorithms, advantages, applications. Academic, San Subset scanning is an accurate and computa-
Diego
Ruoming J, Agrawal G (2003) Efficient decision tree
tionally efficient framework for detecting events
construction on streaming data. In: ACM special in- and other patterns in both spatial and nonspatial
terest group on knowledge discovery and data mining datasets, through constrained optimization of a
(SIGKDD), Washington, DC, pp 571–576 score function (e.g., a likelihood ratio statistic)
Versavel J (1999) Road safety through video detection.
In: Proceedings of IEEE international conference on over subsets of the data. Many score functions
intelligent transportation system, Boulder, pp 753–757 of interest satisfy the linear-time subset scan-
Wang B, Pan F, Ren D, Cui Y, Ding Q, Perrizo W (2003) ning property (Neill 2012), enabling exact and
Efficient OLAP operations for spatial data using peano efficient optimization over subsets. This efficient
trees: In ACM special interest group on management
of data workshop (SIGMOD), San Diego, pp 28–34
unconstrained optimization step, the fast subset
Yao Y, Gehrke JE (2002) The cougar approach to in- scan, can be used as a building block for scalable
network query processing in sensor networks: ACM solutions to event and pattern detection problems
special interest group on data management of data incorporating a variety of real-world constraints.
(SIGMOD) record, pp 9–18
Zhao J, Lu CT, Kou Y (2003) Detecting region out-
liers in meteorological data. In: Proceedings of ACM Historical Background
international symposium on advances in geographic
information system, New Orleans, pp 49–55
The spatial and space-time scan statistics (Kull-
dorff 1997, 2001), building on earlier work on
scan statistics by Naus (1965) and others, are
powerful and widely used methods for event
StreamInsight detection in spatiotemporal data. These methods
evaluate a score function F .S /, typically a
 Data Stream Systems, Empowering with Spa- likelihood ratio statistic, over a large set of
tiotemporal Capabilities spatial or space-time regions S , identifying
Subset Scanning for Event and Pattern Detection 2219

those regions which are most likely to represent both computational efficiency (scaling to much
anomalous spatial clusters. A typical approach larger datasets) and detection power.
is to search over all regions of a given shape
such as circles (Kulldorff 1997), space-time
cylinders (Kulldorff 2001), rectangles (Neill Scientific Fundamentals
and Moore 2004), or ellipses (Kulldorff et al.
2006). These approaches perform well when Additional Motivation for the Subset
the true spatial region (“cluster”) of interest Scanning Approach
is well approximated by the set of search As noted above, subset scanning originated from
regions but suffer from reduced detection power the spatial statistics literature, building on Kull-
otherwise. For example, the original spatial dorff’s spatial scan approach (Kulldorff 1997)
scan approach (Kulldorff 1997), searching over in order to accurately detect irregularly shaped
circular regions, loses power for highly elongated spatial clusters or subsets satisfying other rele-
clusters, and all of the above approaches lose vant constraints (e.g., anomalous subgraphs of
power for clusters with highly irregular shapes. a larger real-world network). It has since been
Many recent approaches search over larger sets generalized beyond the spatial domain to identify
of irregularly shaped regions, such as subsets of subsets of similar data records for which some
locations connected by spatial adjacency (Patil subset of attributes are anomalous (McFowland
and Taillie 2004; Tango and Takahashi 2005; III et al. 2013), as well as to many other do-
Duczmal and Assuncao 2004). Most of these mains, e.g., detecting patterns in massive image
approaches perform approximate rather than data (Somanchi and Neill 2013) and detecting
exhaustive searches over the chosen set of events using online social network data such as
irregularly shaped regions, using some heuristic Twitter (Chen and Neill 2014).
optimization approach such as simulated From a machine learning and data mining
annealing (Duczmal and Assuncao 2004) or perspective, the idea of detecting subsets that are
genetic algorithms (Duczmal et al. 2007), collectively interesting or anomalous is a natural
and thus do not guarantee that an optimal extension of typical, single record-based anomaly
or even near-optimal region will be found. detection approaches (Das et al. 2008). How-
Alternative approaches search exhaustively over ever, most previous approaches to this problem
a much smaller set of regions based on upper are either heuristic search methods (which are
level sets (Patil and Taillie 2004) or spanning not guaranteed to find optimal or approximately
trees (Costa et al. 2012), but again, these optimal subsets), top-down detection methods
approaches may fail to identify the highest- (which search for globally interesting patterns
scoring subset. Finally, the FlexScan approach and then drill down to more carefully investigate
of Tango and Takahashi (2005) performs an the most interesting sub-partitions), or bottom- S
exhaustive search over connected subsets within up detection approaches (which find individually
the local neighborhood of each spatial location anomalous data points and then aggregate them
but is computationally expensive and does not into clusters). However, top-down methods often
scale to even moderately large neighborhood fail to detect small-scale patterns that may not be
sizes. These limitations of previous methods led evident from the global aggregate statistics, while
to the development of the fast subset scan (Neill bottom-up methods can fail for subtle patterns
2012), enabling exact and efficient search over that are only evident when a sufficiently large
subsets of locations. Recent extensions of fast group of data records are considered collectively.
subset scanning allow incorporation of various For example, a subtle increase in the number of
constraints including spatial proximity and emergency department visits in several nearby
connectivity. Moreover, empirical comparisons hospitals may be indicative of an emerging dis-
suggest that fast subset scanning approaches ease outbreak, but this signal might not be visible
outperform competing methods with respect to when observing only a single hospital or when
2220 Subset Scanning for Event and Pattern Detection

aggregating visit counts across all hospitals in a t


responding to each observed count ci;m , typically
given area. by time series analysis of historical data. Given
Of course, subset scanning creates both statis- all of this data, a typical goal is to identify spatial
tical and computational challenges, the most se- regions where the recent counts for some subset
rious of which is the computational infeasibility of the monitored data streams are significantly
of exhaustively searching over the exponentially higher than expected. For example, in disease
many subsets. This computational challenge has surveillance, we monitor a variety of health-
been addressed by the fast subset scan approach related data streams, such as emergency depart-
described below, which exploits the “linear-time ment visits (for different symptom categories)
subset scanning” property of many commonly and over-the-counter medication sales (for dif-
used score functions to perform exact and ef- ferent product categories) and search for spatial
ficient search over subsets. A second, statisti- areas with emerging overdensities of disease.
cal challenge is that multiple testing may result We focus first on the univariate case, in which
in large numbers of false-positives, particularly we have only a single monitored data stream
when searching over a huge number of subsets. (M D 1). The spatial and space-time scan
Randomization testing can be used within the statistics are commonly used methodological ap-
spatial and subset scanning framework to bound proaches to this problem, where we define a set
the overall number of false-positives under the of search regions S and evaluate a score func-
null hypothesis H0 but is computationally ex- tion F .S / for each region. The highest-scoring
pensive and can still result in high false-positive regions are considered to be the most likely
rates when H0 is mis-specified. An alternative clusters, and statistical significance of each re-
approach that mitigates these problems is em- gion can be determined by randomization testing;
pirical calibration using historical background see Kulldorff (1997) for details. In the original
data (Neill 2009a; Chen and Neill 2014). In either spatial scan approach (Kulldorff 1997), the set
case, an under-constrained search over an exces- of search regions S is assumed to be the N 2
sive number of subsets can lead to higher thresh- distinct circular regions centered at each of the
old scores required for detection at a given false- N locations and consisting of that location and
positive rate, reducing detection power, while an its k  1 nearest neighbors, for k D 1 : : : N .
over-constrained search loses detection power For the space-time scan (Kulldorff 2001), the
whenever the true affected region falls outside time duration W of each cluster is also allowed
the search space. This motivates the recent de- to vary between 1 and some maximum tem-
velopment of constrained fast subset scan ap- poral window size Wmax , resulting a larger set
proaches that can incorporate relevant real-world of N 2 Wmax cylindrical space-time regions. The
constraints such as spatial proximity and graph score function is typically a log-likelihood ratio
connectivity, leading to substantially improved statistic that incorporates parametric models of
detection power. how counts are generated both under the null
hypothesis H0 , assuming no clusters, and the al-
Likelihood Ratio Statistics and the Spatial ternative hypothesis H1 .S /, assuming a cluster in
Scan region S . Given these models, the log-likelihood
In a typical formulation of the multivariate event ratio score is defined as:
detection problem, we are given a dataset D  
consisting of multiple data streams Dm (m D Pr.D j H1 .S //
F .S / D log :
1 : : : M ) monitored at a set of spatial locations Pr.D j H0 /
si (i D 1 : : : N ). For each combination of stream
Dm and location si , we are given a time series For the expectation-based scan statistics (Neill
t
of observed counts ci;m . In the expectation-based et al. 2005; Neill 2009b), the null hypothesis H0
scan statistic framework (Neill et al. 2005; Neill assumes that each count cit is drawn from some
t
2009b), we also infer an expected count bi;m cor- parametric distribution in a single-parameter
Subset Scanning for Event and Pattern Detection 2221

exponential family, with mean equal to bit . score function F .S / for all 2N subsets S
The alternative hypothesis H1 .S / assumes a fs1 : : : sN g, quickly becomes computationally in-
constant multiplicative increase q > 1 for all feasible. This motivated the development of fast
expected counts in the region S . For example, subset scanning approaches designed to find the
for the expectation-based Poisson (EBP) scan highest-scoring subsets S  D arg maxS F .S /
statistic (Neill 2009b), we have cit  Poisson.bit / without an exhaustive search.
everywhere under H0 . For H1 .S /, we have The key idea underlying these approaches, as
cit  Poisson.qbit / for all counts inside region S described in Neill (2012), is that many relevant
and cit  Poisson.bit / outside S . The maximum score functions satisfy a property (linear-time
likelihood estimate of q is max 1; C.S/ , where subset scanning or LTSS) that allows efficient
P t P B.S/t optimization over subsets, by sorting the spatial
C.S / D S ci and B.S / D S bi . Plugging
locations si according to some “priority” function
this value of q and the Poisson likelihoods into
G.si / and evaluating only those subsets consist-
the equation above results in the EBP score
ing of the top-j highest priority locations, for
function:
j D 1 : : : N . For functions satisfying the LTSS
  property, maxS F .S / D maxj F .fs.1/ : : : s.j / g/,
C.S /
FEBP .S / D C.S / log C B.S /  C.S /; where s.j / is the j th highest priority location, and
B.S /
thus, we are guaranteed that the highest scoring
of all 2N subsets will be one of the N subsets
if C.S / > B.S / and F .S / D 0 otherwise.
that are evaluated. This fast subset scan approach
The EBP score function is slightly different
dramatically reduces computation time while still
than Kulldorff’s original Poisson scan statistic
guaranteeing an exact solution to the uncon-
(Kulldorff 1997) and has the advantage of high
strained (all subsets) search problem. For exam-
detection power for both large and small clusters,
ple, the highest-scoring subset of 97 zip codes in
while Kulldorff’s statistic loses detection
Allegheny County, Pennsylvania, can be found in
power for large clusters (Neill 2009a). More
approximately 40 ms, while an exhaustive search
generally, for expectation-based scan statistics
would require about 1020 years (Neill 2012).
in a separable exponential family, including the
The fact that an exact, rather than approximate,
Poisson, Gaussian, and exponential distributions,
solution is found makes fast subset scanning fun-
the score function can be written in the form
damentally different from other approaches based
F .S / D B.S /D C.S/ B.S/
; 1 , where D is a
on submodular function optimization (Leskovec
Bregman divergence. See Neill (2012) for more et al. 2007), which produce provably good ap-
details. proximations but do not necessarily identify the
optimal subset.
Linear-Time Subset Scanning and the Fast The linear-time subset scanning property S
Subset Scan has been shown to hold for many useful
As noted above, typical spatial and space-time score functions, including both parametric log-
scan approaches suffer from reduced detection likelihood ratio statistics (such as the expectation-
power when the true affected subset of locations based scan statistics and Kulldorff’s original
does not correspond well to the set of search spatial scan statistic) and nonparametric scan
regions, e.g., for elongated or irregularly shaped statistics (McFowland III et al. 2013). Following
clusters. Detection power can be substantially Neill (2012), we consider three different
improved by optimizing the score function F .S / conditions, each of which is sufficient for the
over all subsets S , typically with additional con- LTSS property to hold:
straints (such as spatial proximity) to ensure that
the discovered clusters are feasible solutions to
the problem under consideration. However, an • Let F .S / D F .X.S /; Y.S // be a convex
exhaustive search over subsets, evaluating the (or quasi-convex) function of two additive,
2222 Subset Scanning for Event and Pattern Detection

nonnegative sufficient statistics of subset S , • Let F .S / be an expectation-based scan statis-


P P
X.S / D si 2S xi and Y .S / D si 2S yi . If tic for any distribution in the exponential fam-
F .S / is monotonically increasing with X.S / ily. Then F .S / satisfies LTSS with priority
or decreasing with Y .S /, then F .S / satis- function G.si / D qmax .xi ; i /, where xi is
fies the LTSS property with priority function the observed value at location si , i is the
G.si / D yxii . The key step in this proof is to expected value at location si , and qmax .xi ; i /
show that, if there exist two locations sin 2 S is the value q > 1 such that F .S j q/ D 0
and sout 62 S , where G.sin /  G.sout /, then for S D fsi g. See Speakman et al. (2014b)
F .S /  max.F .S [fsout g/; F .S nfsin g//, i.e., for details. As a corollary, the commonly used
the score of subset S will be increased by ei- expectation-based Poisson and Gaussian scan
ther adding the higher-priority location sout or statistics satisfy LTSS, as do the expectation-
removing the lower-priority location sin . This based exponential, binomial, and negative bi-
step follows from the convexity of function F . nomial. In earlier work, Neill (2012) proved
As a corollary, the original formulation of the that all expectation-based scan statistics in
spatial scan statistic (Kulldorff 1997) satisfies the separable exponential family, a subfamily
LTSS. of the exponential family which contains the
• Let F .S / D F .X.S /; jS j/ be a function of Poisson, Gaussian, and exponential distribu-
one additive sufficient statistic of subset S and tions but not the binomial and negative bino-
the cardinality of S . If F .S / is monotoni- mial, satisfy LTSS, with the simpler (and eas-
cally increasing with X.S /, then F .S / satis- ier to compute) priority function G.si / D xii .
fies the LTSS property with priority function However, Speakman et al. (2014b) present a
G.si / D xi . Moreover, F .S / also satisfies counterexample showing that the expectation-
the strong LTSS property, which guarantees based binomial does not satisfy LTSS with this
that S D fs.1/ : : : s.j / g is the highest-scoring priority function.
subset among those subsets with cardinality j .
The key step in this proof is to show that, if
there exist two locations sin 2 S and sout 62 S , Incorporating Constraints into Fast Subset
where G.sin /  G.sout /, then F .S /  F .S [ Scanning
fsout g n fsin g/, i.e., the score of subset S will be Since the LTSS property only guarantees an exact
increased by substituting the higher-priority solution to the unconstrained (all subsets) opti-
element sout for the lower-priority element sin . mization problem, the biggest challenge within
This step follows from the monotonicity of the fast subset scanning framework is to in-
function F . As a corollary, we can show that corporate real-world constraints such as spatial
a large class of nonparametric scan statistics, proximity, graph connectivity, and temporal con-
which compare the actual and expected num- sistency to ensure that relevant and useful subsets
bers of p-values in subset S that are significant are detected. A number of recent extensions use
at level ˛, satisfy LTSS and strong LTSS. the unconstrained fast subset scan as a build-
See McFowland III et al. (2013) for details. ing block to develop powerful methods for con-
Functions satisfying strong LTSS allow some strained optimization. The original work on fast
useful optimization approaches that functions subset scanning (Neill 2012) demonstrated how
which only satisfy (weak) LTSS do not, such spatial proximity constraints can be incorporated,
as efficient constrained optimization over sub- using spatial information to constrain the search
sets with hard constraints on region density. by penalizing or excluding unlikely subsets (e.g.,
However, most commonly used scan statistics spatially dispersed or highly irregular regions).
only satisfy the weak but not strong LTSS The fast localized scan approach (Neill 2012)
property. Nevertheless, the weak property is constrains the search to subsets of the local neigh-
sufficient for efficient optimization over sub- borhoods formed by considering each spatial lo-
sets of the data. cation si and its k  1 nearest neighbors, for
Subset Scanning for Event and Pattern Detection 2223

some fixed neighborhood size k. Fast localized scoring subset found by unconstrained LTSS may
scan performs a separate, unconstrained search be disconnected, making it challenging to apply
over subsets for each of the N neighborhoods LTSS directly for optimization with connectivity
formed in this way, reducing the computational constraints, an alternate formulation of the LTSS
complexity of searching each neighborhood from property can be used to speed up the search. As
O.2k / to O.k/ using the LTSS property. Thus, noted above, in the unconstrained case, we can
the overall complexity is reduced from exponen- prove that a subset S is suboptimal if there exist
tial to O.N k C N log N /, where the first term locations sin 2 S and sout 62 S where G.sin / 
describes the complexity of searching over the N G.Sout /. When optimizing over connected sub-
neighborhoods and the second term corresponds graphs instead of all subsets, a variant of this
to the initial step of sorting the N locations property still applies, but subgraph S is only
by priority. If a good choice of neighborhood provably suboptimal if the resulting subgraphs
size k is not known, an alternative is the fast S [ fsout g and S n fsin g remain connected.
multiscan (Neill 2012), which compares the pe- This property was recently incorporated into
nalized scores maxS F .S j k/  k for all neigh- a depth-first search procedure, the GraphScan al-
borhood sizes k D 1 : : : N and some constant gorithm (Speakman et al. 2014a). By identifying
> 0. Given labeled training data, the value of and pruning paths that are provably suboptimal,
k for fast localized scan or the value of for GraphScan can rule out large numbers of subsets
fast multiscan can be chosen by cross-validation. without evaluating each one individually. This
Neill (2012) examined the detection power and approach dramatically reduces the size of the
spatial accuracy of the fast localized scan and fast search space and the resulting computation time.
multiscan approaches as compared to the tradi- Additional speed improvements are obtained by
tional Kulldorff’s spatial scan (searching over cir- branch and bounding, using the unconstrained
cular regions), using simulated disease outbreaks maximum score (which is efficiently computable
injected into real-world hospital emergency de- using LTSS) as an upper bound on the max-
partment data. The proximity-constrained subset imum score of connected subgraphs and rul-
scans substantially improved the timeliness and ing out large numbers of subsets with upper
accuracy of detection, detecting 2 days faster with bounds less than the highest-scoring subgraph
fewer than half as many missed outbreaks. found so far. See Speakman et al. (2014a) for
For the extension of linear-time subset scan- details. The resulting GraphScan algorithm, like
ning to graph or network data, we monitor one FlexScan (Tango and Takahashi 2005), still re-
or more data streams at each node of the graph quires exponential computation time in the worst
and wish to detect the most anomalous subset case, but it scales to graphs an order of magnitude
of nodes subject to the graph connectivity con- larger than FlexScan, with a 450,000x speedup
straints (i.e., the given subset of nodes must form for graphs of size 30 (Speakman et al. 2014a). S
a connected subgraph of the original graph). For Moreover, GraphScan still identifies the highest-
spatial data, the graph edges could represent spa- scoring subgraph: it is an exact, rather than ap-
tial adjacency or travel patterns, but this frame- proximate, algorithm. An alternative approach,
work also enables analysis of nonspatial network Additive GraphScan (Speakman et al. 2013), can
data. As noted above, exact optimization of the only be used for additive score functions and is
score function F .S / over connected subgraphs not guaranteed to find the highest-scoring sub-
is difficult: the FlexScan approach (Tango and graph. However, Additive GraphScan can scale to
Takahashi 2005) performs an exhaustive search graphs with tens of thousands of nodes and iden-
and thus does not scale beyond 25 or 30 nodes, tifies near-optimal subsets with high probability
while other approaches (Patil and Taillie 2004; in practice (Speakman et al. 2013).
Duczmal and Assuncao 2004; Duczmal et al. Another recent extension of the fast subset
2007; Costa et al. 2012) are not guaranteed to find scanning framework, the Dynamic Subset
the highest-scoring subgraph. While the highest- Scan (Speakman et al. 2013), focuses on
2224 Subset Scanning for Event and Pattern Detection

detecting dynamic patterns, where the affected to optimize over all subsets of locations for a
subset of locations can grow, shrink, or given time step, with location-specific bonuses or
move over time. Typical space-time scan penalties based on the detected subsets for the
approaches (Kulldorff 2001; Neill et al. 2005) previous and next time steps, and incorporating
search over space-time cylinders, assuming that a flexible, generative model of event propaga-
the affected subset of locations remains constant tion. This efficient conditional optimization step
for the duration of the event. However, this over- is iterated until convergence, thus propagating
constrained approach leads to reduced detection information both backward and forward in time.
power for dynamically evolving events, as well as Connectivity constraints can also be incorporated
failing to accurately capture the event dynamics. into the Dynamic Subset Scan framework, re-
An alternative, under-constrained approach of quiring the use of Additive GraphScan (rather
performing independent spatial scans for each than simply including all locations that make a
time step results in identified patterns that display positive contribution to the score) for each step.
unrealistic temporal trends (e.g., affecting the Speakman et al. (2013) applied the Dynamic
east side of the city on day 1, the west side on Subset Scan (with connectivity and temporal con-
day 2, and back to the east side on day 3). Thus, sistency constraints) to detection, tracking, and
the Dynamic Subset Scan optimizes the score source tracing of spreading contaminants in a
function F .S / over subsets of locations at each water distribution network. Dynamic Subset Scan
time step while enforcing temporal consistency demonstrated earlier detection of contamination
constraints, considering the patterns detected at events and more accurate identification of the
adjacent time steps, and rewarding patterns that affected subset of nodes through time.
are not dramatically different between time steps
t and t C 1. This approach allows the spatial Multivariate Fast Subset Scanning
extent of an event to evolve smoothly over time While the fast subset scan approaches described
while penalizing unrealistic event dynamics. above focus on the univariate case, monitoring
Unlike the fast subset scanning approaches a single spatiotemporal data stream, these ap-
described above, which enforce hard constraints proaches can also be extended to the multivariate
and thus rule out some subsets from consider- case. The multivariate fast subset scan (Neill
ation, Dynamic Subset Scan enforces soft con- et al. 2013) can be used to monitor multiple
straints, which can be interpreted as applying streams of space-time data, identifying subsets
bonuses or penalties to the score function for of streams where the recently observed counts
including or excluding certain locations. This is are significantly higher than expected. Similarly,
a specific case of the more general, penalized the Fast Generalized Subset Scan (FGSS) can be
fast subset scanning (PFSS) framework described used to discover patterns in general multivari-
by Speakman et al. (2014b). Incorporating penal- ate datasets, identifying subsets of similar data
ties is difficult because a penalized version of the records with anomalous values for some subset
score function may not satisfy the LTSS prop- of attributes (McFowland III et al. 2013). The
erty. However, Speakman et al. (2014b) show key idea for both approaches is similar: linear-
that any expectation-based scan statistic in the time subset scanning can be used for efficient op-
exponential family can be written as an additive timization over subsets of locations (or records)
function conditional on the relative risk param- for a given subset of streams (or attributes) but
eter q. Only a linear number of ranges for q can also be used for efficient optimization over
must be considered, and optimization over sub- subsets of streams (or attributes) for a given
sets for each q range is very efficient. Moreover, subset of locations (or records). Thus, we can
this formulation allows bonuses or penalties for iterate between these two efficient conditional
each location to be incorporated into the score optimization steps until a local maximum of the
function while maintaining efficient optimiza- score function is reached and perform multiple
tion. In the Dynamic Subset Scan, PFSS is used restarts in order to approach the global maximum.
Subset Scanning for Event and Pattern Detection 2225

The multivariate fast subset scan builds on the value range corresponding to each attribute value
univariate fast subset scanning approach, jointly by ranking the conditional probabilities, where
optimizing a parametric log-likelihood ratio under the null hypothesis we expect empirical
statistic (such as the expectation-based Poisson p-values to be uniformly distributed on [0,1];
statistic described above) over proximity- and (4) using a nonparametric scan statistic to
constrained subsets of locations and over all detect subsets of records and attributes with an
subsets of data streams. The most natural unexpectedly large number of low (significant)
formulation of the multivariate scan statistic empirical p-values. The final step is computa-
in this setting, subset aggregation, assumes tionally expensive (exponential in the numbers
a constant multiplicative increase across all of records and attributes for a naive search), but
affected streams and thus adds counts and LTSS can be used to speed up this search, con-
baselines across the monitored subset of streams. verging to a local maximum of the score function
An alternative formulation by Kulldorff et al. and ensuring that each iteration step is linear
(2007) proposes adding log-likelihood ratios (not exponential) in the number of records or
across streams (assuming that the data streams attributes. FGSS was shown to consistently out-
are conditionally independent). Neill et al. (2013) perform previously proposed methods in terms
demonstrate that the Kulldorff’s multivariate of detection power and characterization accuracy
scan can also be made efficient using the across multiple application domains, and scales
LTSS property, by iterating between two steps: to much larger datasets, thus enabling accurate
optimizing over subsets of records (for given and efficient pattern detection in massive, high-
values of the multiplicative effect of the event dimensional data.
on each data stream) and recalculating the
maximum likelihood values of the event’s effects
for the given subset of records. Regardless of Key Applications
which formulation is used, the multivariate
fast subset scan is computationally efficient One important real-world application of subset
and scales to large numbers of locations and scanning is in the area of disease surveillance,
streams. Moreover, significant gains in detection where we attempt to detect emerging outbreaks
power and spatial accuracy were observed when of disease in their very early stages by identifying
searching over subsets of data streams and anomalous clusters of disease cases. In the
when detecting proximity-constrained subsets multivariate disease surveillance setting, we
of locations rather than searching over circular monitor a set of data streams Dm (m D 1 : : : M )
regions (Neill et al. 2013). on a regular (e.g., daily or hourly) basis at a set of
The FGSS approach (McFowland III et al. spatial locations (e.g., zip codes) si (i D 1 : : : N ).
2013) does not assume space-time data but in- For each combination of location and data stream, S
stead considers an arbitrary set of attributes mea- t
we have a time series of observed counts ci;m ,
sured for each of a large set of data records. t
where each count ci;m could represent the number
Nonparametric scan statistics are used to con- of observed cases of a given type (e.g., emergency
vert the disparate attributes to the same scale department visits with respiratory complaints) in
(empirical p-values between 0 and 1) and to a given zip code on a given day. A typical goal in
integrate these values in a principled statistical this setting is to identify spatial regions (subsets
framework. FGSS consists of four steps: (1) ef- of locations) where some subset of the monitored
ficiently learning a Bayesian network model that data streams have recent counts that are higher
represents the assumed null distribution of the t
than expected. Here, the expected counts bi;m
data; (2) computing the conditional probability of are obtained through time series analysis of
each attribute value in the dataset given the Bayes historical data and can account for trends
Net, conditioned on the other attribute values such as the day of week, seasonality, holidays,
for that record; (3) computing an empirical p- and known events. Multiple variants of subset
2226 Subset Scanning for Event and Pattern Detection

scanning have been evaluated on the disease cancer and identify other regions of potential
surveillance task, typically through semisynthetic interest in digital pathology slides. Nobles
testing (in which simulated outbreaks are et al. (2014) apply the subset scan approach to
injected into real-world background data). identifying emerging “novel” disease outbreaks
Searching over proximity-constrained subsets with previously unseen or anomalous patterns of
of locations to identify irregularly shaped spatial symptoms, using free-text emergency department
clusters (Neill 2012) enables earlier and more chief complaint data. Chen and Neill (2014)
accurate outbreak detection, as measured by developed a new approach to event detection
the average number of days to detect and in heterogeneous social media graphs and
proportion of outbreaks detected for a given applied this approach to advance prediction of
false-positive rate, as well as the spatial overlap civil unrest events (strikes, protests, and riots)
between true and identified outbreak regions. and early warning for rare disease outbreaks
Further improvements in detection power can (hantavirus), using Twitter data from Latin
be obtained by integrating information from America. In both application domains, their
multiple health data streams (Neill et al. 2013), approach outperformed five competing, state-
searching over subsets of streams as well as of-the-art methods for both event detection
proximity-constrained subsets of locations. This and forecasting, increasing detection power,
approach also helps to characterize outbreaks by forecasting accuracy, and forecasting lead time
identifying the affected subset of streams, and a while reducing time to detection (Chen and Neill
further extension, the “multidimensional subset 2014).
scan” (Neill and Kumar 2013) can also identify
differentially affected subpopulations (e.g., by
gender, age, socioeconomic status, or behavioral Future Directions
risk factors).
Incorporating other constraints into the While subset scanning is a rapidly emerging and
subset scan, such as graph connectivity and highly promising field, a number of challenging
temporal consistency (Speakman et al. 2013, open problems remain to be addressed. One av-
2014a) or similarity between records in general enue for future research is continuing to extend
datasets (McFowland III et al. 2013), enables a the range of score functions for which the linear-
wide variety of other applications to be addressed time subset scanning property can be proven to
using this framework. For example, Speakman hold, as well as the range of constraints that
et al. (2013) demonstrate improved performance can be incorporated into the fast subset scan
for detecting, tracking, and source tracing framework while still allowing computationally
contamination events spreading through a efficient and scalable solutions. These extensions
water distribution system. McFowland III et al. have the potential to expand the use of subset
(2013) evaluate their approach on outbreak scanning for a variety of real-world applications
detection, customs monitoring of container requiring analysis of massive, complex datasets.
shipments, and computer network intrusion A second important direction is gaining a bet-
detection, demonstrating improvements over the ter understanding of the statistical properties of
current state of the art in all three application subset scanning, for example, identifying nec-
domains. Finally, recent extensions of subset essary and sufficient conditions for which the
scanning have been applied to detect patterns highest-scoring subset converges to the true af-
in massive, complex real-world datasets such as fected subset or a provably good approximation
images (Somanchi and Neill 2013), text (Nobles or quantifying the detection power of constrained
et al. 2014), and online social networks such as subset scans as a function of how well the cho-
Twitter (Chen and Neill 2014). Somanchi and sen constraints correspond to the true pattern of
Neill (2013) demonstrate that their approach interest. Finally, in many cases, event detection
can be used to accurately detect prostate can be thought of as integrating information from
Subset Scanning for Event and Pattern Detection 2227

many noisy sensors. This sensor fusion problem, discovery and data mining, Las Vegas, pp 169–176
assuming a given set of noisy sensors, comple- Duczmal L, Assuncao R (2004) A simulated annealing
strategy for the detection of arbitrary shaped spatial
ments the sensor placement problem, in which clusters. Comput Stat Data Anal 45:269–286
sensors are often assumed to be perfect and the Duczmal L, Cancado A, Takahashi R, Bessegato L (2007)
focus is on optimally placing sensors in space A genetic algorithmic for irregularly shaped scan
or on a network. Another useful property, sub- statistics. Comput Stat Data Anal 52(1):43–52
Kulldorff M (1997) A spatial scan statistic. Commun Stat
modularity, can be used to efficiently find near- Theory Methods 26(6):1481–1496
optimal solutions to a variety of sensor place- Kulldorff M (2001) Prospective time-periodic geographi-
ment problems (Leskovec et al. 2007), and it cal disease surveillance using a scan statistic. J R Stat
is an open problem whether the submodularity Soc A 164:61–72
Kulldorff M, Huang L, Pickle L, Ducmzal L (2006) An
and linear-time subset scanning properties can be elliptic spatial scan statistic. Stat Med 25:3929–3943
effectively combined to solve problems requir- Kulldorff M, Mostashari F, Duczmal L, Yih WK, Klein-
ing both placement of, and integration of data man K, Platt R (2007) Multivariate scan statistics for
from, noisy sensors. One example application disease surveillance. Stat Med 26:1824–1833
Leskovec J, Krause A, Guestrin C, Faloutsos C, Van-
where this might be useful is in the crowdsourced Briesen J, Glance N (2007) Cost-effective outbreak
collection of environmental and ecological data detection in networks. In: Proceedings of the 13th
(e.g., observations of plant and animal species ACM SIGKDD conference on knowledge discovery
or measurement of air, water, and soil qual- and data mining, San Jose, pp 420–429
McFowland III E, Speakman S, Neill DB (2013) Fast gen-
ity) by “citizen scientists.” In this case, data eralized subset scan for anomalous pattern detection. J
quality varies considerably based on individuals’ Mach Learn Res 14:1533–1561
expertise, and the accuracy of the data might be Naus JI (1965) The distribution of the size of the max-
substantially improved by asking individuals with imum cluster of points on the line. J Am Stat Assoc
60:532–538
relevant expertise if they are willing to perform Neill DB (2009a) An empirical comparison of spatial scan
analyses of specific types or in specific locations. statistics for outbreak detection. Int J Health Geogr
8:20
Neill DB (2009b) Expectation-based scan statistics for
monitoring spatial time series data. Int J Forecast
Cross -References 25:498–517
Neill DB (2012) Fast subset scan for spatial pattern detec-
tion. J R Stat Soc Ser B Stat Methodol 74(2):337–360
 Hotspot Detection, Prioritization, and Security
Neill DB, Kumar T (2013) Fast multidimensional subset
 Irregular Shaped Spatial Clusters: Detection scan for outbreak detection and characterization. On-
and Inference line J Publ Health Inf 5(1):156
 Linear Anomalous Window Neill DB, Moore AW (2004) Rapid detection of signifi-
cant spatial clusters. In: Proceedings of the 10th ACM
 Movement Patterns in Spatio-Temporal Data
SIGKDD conference on knowledge discovery and data
 Public Health and Spatial Modeling mining, Seattle, pp 256–265
Neill DB, Moore AW, Sabhnani M, Daniel K (2005) De-
S
tection of emerging space-time clusters. In: Proceed-
ings of the 11th ACM SIGKDD conference on knowl-
References edge discovery and data mining, Chicago, pp 218–227
Neill DB, McFowland III E, Zheng H (2013) Fast sub-
Chen F, Neill DB (2014) Non-parametric scan statistics set scan for multivariate event detection. Stat Med
for event detection and forecasting in heterogeneous 32:2185–2208
social media graphs. In: Proceedings of the 20th ACM Nobles M, Deyneka L, Ising A, Neill DB (2015) Identi-
SIGKDD conference on knowledge discovery and data fying emerging novel outbreaks in textual emergency
mining, New York, pp 1166–1175 department data. Online J Publ Health Inf 7(1): e45
Costa MA, Assuncao RM, Kulldorff M (2012) Con- Patil GP, Taillie C (2004) Upper level set scan statistic
strained spanning tree algorithms for irregularly- for detecting arbitrarily shaped hotspots. Environ Ecol
shaped spatial clustering. Comput Stat Data Anal Stat 11:183–197
56(6):1771–1783 Somanchi S, Neill DB (2013) Discovering anomalous pat-
Das K, Schneider J, Neill DB (2008) Anomaly pattern terns in large digital pathology images. In: Proceedings
detection in categorical datasets. In: Proceedings of of the 8th INFORMS workshop on data mining and
the 14th ACM SIGKDD conference on knowledge health informatics, Minneapolis
2228 Summary Information

Speakman S, Zhang Y, Neill DB (2013) Dynamic pattern


detection with temporal consistency and connectivity Susceptibility Analysis
constraints. In: Proceedings of the 13th IEEE interna-
tional conference on data mining, Dallas, pp 697–706  Sensitivity Analysis
Speakman S, McFowland III E, Neill DB (2015) Scalable
detection of anomalous patterns with connectivity con-
straints. J Comput Graph Stat 24(4):1014–1033
Speakman S, Somanchi S, McFowland III E, Neill DB
(2016, in press) Penalized fast subset scanning. J
Comput Graph Stat Sustainability Risk
Tango T, Takahashi K (2005) A flexibly shaped spatial
scan statistic for detecting clusters. Int J Health Geogr  Climate Risk Analysis for Financial Institu-
4:11
tions

Summary Information Sustainable Development

 Metadata and Interoperability, Geospatial  Climate Change and Developmental Economies

Supplementary Material
SVG
 Metadata and Interoperability, Geospatial
 Scalable Vector Graphics (SVG)
 Web Mapping and Web Cartography
Surface Modeling

 Aggregate Data: Geostatistical Solutions for


Reconstructing Attribute Surfaces Sweep Line Algorithm

 Plane Sweep Algorithm


Surveillance

 Data Collection, Reliable Real-Time Synchronization of Spatial Data


 Evolution of Earth Observation
 Positional Accuracy Improvement (PAI)

Survey Knowledge Synonymy

 Wayfinding, Landmarks  Retrieval Algorithms, Spatial

You might also like