An Introduction To Point Pattern Analysis Using Crimestat
An Introduction To Point Pattern Analysis Using Crimestat
Luc Anselin
Spatial Analysis Laboratory
Department of Agricultural and Consumer Economics
University of Illinois, Urbana-Champaign
http://sal.agecon.uiuc.edu/
June 24, 2003
Introduction
This is a brief introduction to the analysis of patterns in points (as events) using Ned
Levines CrimeStat 2.0 software package. This package is freely available and can be
obtained on the web from http://www.icpsr.umich.edu/NACJD/crimestat.html . The data
used in this tutorial are the Pittsburgh homicide locations (various Pitt* files) and the
Cardiff juvenile offender addresses (juvenile), both obtainable as shape files from the
SAL sample data repository http://sal.agecon.uiuc.edu/stuff/data.html. Some familiarity is
assumed with either ArcView or ArcGIS, optionally with the Spatial Analyst extension,
to implement visualization of various results. CrimeStat does not have its own
visualization capability, but relies on an external GIS through the export of result files.
Getting started with CrimeStat
Start CrimeStat by double clicking its icon. Click on the welcome screen to open the
main interface, shown in Figure 1.
Note how in some systems (like the one used for this tutorial, running Windows Xp) the
bottom buttons are not fully legible. They stand for, left to right, Compute, Quit and
Help. Help brings up an extensive help system.
The four tabs at the top of the interface correspond to some logical steps in the way
CrimeStat implements analysis. First, one needs to set up the data and possibly set some
options, then choose the type of analysis (spatial description or spatial modeling). The
analysis is run by clicking on the Compute button in the bottom left.
Data setup in CrimeStat
CrimeStat reads data from various format files, including shape files. You will be using
the juvenile.shp data set from the Bailey-Gatrell text in this example. This is a very
simple data file with only the X, Y coordinates of the offender addresses. You can load
this into ArcView to have a quick sense of the overall pattern, as in Figure 2.1
Open ArcView, and, with the Views icon active, click on New. Then add a theme by clicking on the +
(plus) icon and locate the juvenile.shp file. Make sure to click on the check mark to make the theme visible.
OK) and the file name will appear in all the input fields in the user interface, as shown in
Figure 4.
Save the Grid specifications for later use to a file in your working directory. This is a
two-step process. First, you Save the grid specification by clicking on the Save button
in the interface (Figure 6). This brings up a dialog to name the particular grid setup, as in
Figure 7. However, this does not save it to a file. To store this (and other) named grid
specifications in a file, first Load them (Load button in Figure 6) and then Save to File in
the following dialog, as in Figure 8.
Practice
Start a second instance of CrimeStat and use the Pitthom.dbf file as the primary file. Set
up the coordinates and the reference grid (use the Identify button in ArcView to
determine the coordinates of the lower-left and upper-right corners of the bounding
rectangle).
The results will appear in a screen, as in Figure 10. Note the tabs on the top of the screen,
which let you select the output for each set of descriptive statistics (Mcsd, Sde, and
MdnCntr). You can now save these results to a text file, or print them out.
2SDEfile.shp, etc. (see the CrimeStat manual for a full list of options). For example, in
Figure 11, the mean center (green cross), median center (red dot), standard deviational
rectangle (black rectangle), standard distance deviation (blue circle) and standard
deviational ellipse (green ellipse) are illustrated for the juvenile point pattern. The
matching shape file names are given in the legend panel. In this particular application, the
mean and median centers are practically the same and there is only a slight indication of a
directional effect (the circle and ellipse are very close, a strong directional effect would
be shown when the ellipse would be very elongated along one axis).
Practice
Carry out an centrographic analysis of the point pattern for the Pittsburgh homicides (use
Pitthom.shp as the Primary File). Overlay the results on the map of points in ArcView (or
ArcGIS). For a challenge, compare and visualize the summary statistics between the
homicides in 93 and 94. To accomplish this, you will need to build a query in
ArcView/ArcGIS to select those observations for which the Event_yr is 93 (or 94),
followed by a Theme > Convert to Shape file command to create a separate shape file for
the selected observations. Then specify the new shape file as the Primary File in the
CrimeStat analysis.
To visualize the estimated surface, open up ArcView and add the Kjuvgrid0 theme. You
must use the legend editor to turn this into a meaningful grid map. In ArcView (use
similar commands in other GIS software) select Graduated Color as the legend type and
use Z as the Calculation Field. For now, you can keep the default Natural Breaks
classification (you can experiment with different types of classifications later). The
resulting grid map will be as in Figure 15, with the original point pattern superimposed.
If you are familiar with ESRIs Spatial Analyst extension, you can convert the polygon
grid shape file to Spatial Analystss grid format and then use the Surface analysis
functionality to superimpose contour lines on the density surface. For example, Figure 16
is the result of such an operation, with the extent of the grid and contour themes set to the
extent of juvenile.shp, and with 0.005 as the contour step.2 Note how the normal kernel
density tends to smooth the surface and remove a lot of the underlying detail in the
original pattern. Experiment with changing some of the parameters, such as the kernel
bandwidth.
The steps involved in obtaining this result are as follows, illustrated for ArcView (they are slightly
different in ArcGIS). First, make sure the Spatial Analyst extension has been activated. With the grid
polygon theme active, select Theme > Convert to Grid, set the extent to that of juvenile.shp, the number of
rows, columns to 100, and Z as the field for the grid values. Add the new grid theme to a View (make sure
to set the input file type to grid instead of Feature) and then select Surface > Create Contours. Set the
interval to 0.005 and you should see the same result as in Figure 16. Make sure to use the legend editor to
turn the contour theme into a graduated color with contour as the variable.
10
11
The kernel density results are very sensitive to the choice of kernel model and settings.
To illustrate this, consider a Triangular kernel with the same settings as for the Normal
(25 points cutoff for the Adaptive bandwidth). The result is as in Figure 17, which is
much spikier and focuses on several hot spots rather than the central tendency reflected
in Figures 15-16. Experiment with different settings, for example, changing the
bandwidth to 50. The kernel densities differ with respect to how steep the cutoff is and
how smooth the resulting surface will be. Some trial and error will start to show some
persistent patterns in the data, suggesting potential clusters or hot spots.
Practice
Use the Pittburgh homicide file to construct a normal and triangular (or other type) kernel
density surface. If you are comfortable with the Spatial Analyst, display the surfaces as
isoline maps (contour maps). If you have created separate shape files for 93 and 94, you
can compare the clusters and hot spots suggested by the kernel densities between the
two years. You can also create different shape files using the other variables to
distinguish between point patterns, such as gang-related vs. non gang-related, or using
guns vs. not using guns.
Nearest Neighbor distance statistic
In order to more formally assess the extent to which a point pattern shows clustering or
dispersion, two main classes of techniques can be applied. The first uses the magnitude
and/or distribution of inter-point distances, or the distances between the points and
reference locations as an indicator (distance based tests). The second set of methods uses
the number of points within a given area as the basis for test statistics (quadrat counts).
The simplest of the distance based statistics uses the distribution of the distance to the
nearest neighbor as a measure. If this distance tends to be smaller than what it would be
under complete spatial randomness, this suggests clustering. If, on the other hand, it tends
to be larger, then dispersion is the suggested alternative.
A Nearest Neighbor statistic is implemented in CrimeStat in the Distance Analysis tab of
Spatial Description. Make sure you have the juvenile.shp file as the primary file with the
proper coordinates and projection set. You do not need the Reference File for these
calculations. In the Distance Analysis dialog, check the box next to Nearest Neighbor
Analysis (Nna) and leave the defaults to their settings, as in Figure 18.
12
Click on the Compute button to carry out the analysis. The result window will contain the
summary statistics. You can save these to a text file, the contents of which are as in
Figure 19. Note that it does not make much sense to save the results to a dbf file, since
only the summary distance statistic will be saved, not the full distribution of nearest
neighbor distances. The results yield a Nearest Neighbor Index of 0.7003, which is
obtained by taking the ratio of the observed mean nearest neighbor distance to the mean
random distance. The value less than 1 suggests clustering. A test statistic can be
constructed by taking the difference between the observed and random mean nearest
neighbor distance and standardizing by the standard error. The resulting Z-value of -7.43
is well above the usual critical values, suggesting significant clustering. However, these
tests have to be interpreted with some caution. Also, there are many nearest neighbor
based statistics, and they dont necessarily lead to the same conclusion.
You can assess the sensitivity of the results to a number of settings, such as the use of
border corrections.3
Practice
Compute the nearest neighbor index to assess the extent of clustering of the Pittsburgh
homicide point pattern. Compare the results for the two periods combined and for each
time period separately. Also compare the findings for different types of crimes (guns or
not, gangs or not). Try different border corrections to assess the sensitivity of the results.
Note that in order to use the Manhattan distance (linear nearest neighbor index) feature, you must specify
the total length of the street network, which is not available for the Juvenile or Pittsburgh data sets.
13
Ripleys K function
The nearest neighbor distance statistics are described as first order statistics, since they
only consider the distance to the nearest point. Second order distance statistics consider
the complete distribution of all distances in the point pattern. Ripleys K function is an
example of such a second order statistic, and is essentially a test on the cumulative
distribution function of the full set of inter-point distances. This distribution can be
compared to a reference distribution under complete spatial randomness. A higher
proportion of shorter distances than random would suggest clustering, whereas a higher
proportion of longer distances suggests dispersion.
CrimeStat implements Ripleys K function under the Distance Analysis of the Spatial
Description tab. The program does not report the actual K function results, but instead the
L function, which is simply a rescaled K function such that the reference for complete
spatial randomness is linear and horizontal (at zero).
Make sure the juvenile.shp primary file is set, with the proper coordinates and projection.
Check the box next to Ripleys K statistic (and uncheck the box next to Nna), set the
number of simulation runs to 1000 and specify a dBase file for the output, as illustrated in
Figure 20. The cumulative distribution, organized in 100 distance bins will then be
written to a dBase file. Note that the program will add the prefix Ripley to whatever
filename you specify, so in the example of Figure 20, the dBase file will be called
Ripleyjuvripley.dbf (and not juvripley.dbf as you might expect).
Click on Compute to start the calculation and simulation runs. This may take a while,
depending on how many simulation runs you specified. When the program is finished,
click on the Ripley tab in the results window to see the output, as illustrated in Figure 21.
14
Practice
As before, use the Pittsburgh homicide data and Ripleys K function to assess overall
clustering of homicides (overall, by year and/or by type). Visualize the computed
distributions in a spreadsheet or graphing package. Experiment with border adjustments
to assess the sensitivity of the results.
Hot Spot detection (STAC)
Quadrat methods assess the presence of clusters by comparing the number of events
(points) within a given region to the number expected under complete spatial
randomness. The STAC (Spatial and Temporal Analysis of Crime) method is a form of
quadrat method. More precisely, it is a combination of a scan statistic (counting the
number of events within a circle) and a hierarchical clustering technique (points that are
present in more than one identified clustered circles result in all the points in the two
circles to be combined). The results are visualized as a standard deviational ellipse
computed for the points identified to be a cluster or hot spot. The significance of the
identified cluster can be assessed by means of a Monte Carlo randomization method.
STAC is implemented in CrimeStat under the Hot Spot Analysis II tab of the Spatial
Description tab. Make sure the juvenile.shp file is set as the Primary File with the proper
coordinates and projection specified, and set the Reference File as the 100 x 100 grid
with origin at 0, 0, as before. Also set the Data Units to Kilometers. Check the box next
to STAC on the interface, and set the Output Units to Kilometers, as in Figure 23. Click
on the save ellipses to button to specify the output file for the standard deviational
ellipses as a shape file and enter the file name in the text box, as in Figure 24. Finally,
you need to set the parameters for the STAC algorithm (make sure you have specified the
Grid option in the Reference File tab or STAC wont work). As in any clustering
operation, the results of STAC are quite sensitive to these parameters. The most
important ones are the search radius (STAC uses a circle with a fixed radius in the scan
operation) and the minimum number of points to consider a cluster. Both of these are
context specific and may require some trial and error. For example, setting the search
radius too large or too small may not yield any clusters. In Figure 25, the settings are 10
for the search radius and 5 for the minimum number of points. Also specify 1000 for the
number of randomizations (this is not required for the STAC algorithm to work). Click
on Compute to start the analysis. This yields 3 clusters, as shown in the results window in
Figure 26.
16
For the search radius of 10, the results are not that useful. Three clusters are identified,
and their mean center, area, number of points and density are listed in the results page
(Figure 26). When superimposing the ellipse shape file on the point pattern, it is obvious
that the first (largest) cluster is not a useful hot spot in that it contains 128 out of the
168 points in the pattern, as shown in Figure 27. Resetting the search radius to 5 yields 10
clusters, shown in Figure 28. You can further experiment with setting a different search
radius, changing the minimum number of points for a cluster, etc.
Another interesting comparison is to overlay the STAC ellipses on the kernel density
grid, to get further insight into the overall patterns in the points. As shown in Figure 29,
there is some correspondence between some of the clusters and the higher elevation
densities, but not total. In part this is due to the different densities in the clusters (not all
of them are high density since they may have resulted from collapsing several initial
clusters).
Practice
Use the Pittsburgh homicide data (pitthom.shp) to carry out a hot spot analysis using
STAC. Experiment with different search radii. Start with 500, using miles as the distance
unit and 50194,87016 110183,127712 as the bounding box. Increase the radius and
assess the effect. As before, you can also carry out analyses for the individual years
and/or crime types. Compare the STAC ellipses to one of the kernel density estimates and
assess the degree of similarity in the suggestion of clusters and hot spots.
19