Criminal Incident Data Association Using The OLAP Technology
Criminal Incident Data Association Using The OLAP Technology
Abstract. Associating criminal incidents committed by the same person is important in crime analysis. In this paper, we introduce concepts from OLAP (online-analytical processing) and data-mining to resolve this issue. The criminal incidents are modeled into an OLAP data cube; a measurement function, called the outlier score function is defined on the cube cells. When the score is significant enough, we say that the incidents contained in the cell are associated with each other. The method can be used with a variety of criminal incident features to include the locations of the crimes for spatial analysis. We applied this association method to the robbery dataset of Richmond, Virginia. Results show that this method can effectively solve the problem of criminal incident association. Keywords. Criminal incident association, OLAP, outlier
1 Introduction
Over the last two decades, computer technologies have developed at an exceptional rate, and become an important part of our life. Consequently, information technology now plays an important role in the law enforcement community. Police officers and crime analysts can access much larger amounts of data than ever before. In addition, various statistical methods and data mining approaches have been introduced into the crime analysis field. Crime analysis personnel are capable of performing complicated analyses more efficiently. People committing multiple crimes, known as serial criminals or career criminals, are a major threat in the modern society. Understanding the behavioral patterns of these career criminals and apprehending them is an important task for law enforcement officers. As the first step, identifying criminal incidents committed by the same person and linking them together is of major importance for crime analysts. According to the rational choice theory [5] in criminology, a criminal evaluates the benefit and the risk for committing an incident and makes a rational choice to maximize the profit. In the routine activity theory [9], a criminal incident is considered as the product of an interactive process of three key elements: a ready criminal, a suitable target, and lack of effective guardians. Brantingham and Brantingham [2] claim that the environment sends out some signals, or cues (physical, spatial, cultural, etc.), about its characteristics, and the criminal uses these cues to evaluate the target and make the decision. A criminal incident is usually an outcome
H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 1326, 2003. Springer-Verlag Berlin Heidelberg 2003
14
of a decision process involving a multi-staged search in the awareness space. During the search phase, the criminal associates these cues, clusters of cues, or cue sequences with a good target. These cues form a template of the criminal, and once the template is built, it is self-reinforcing and relatively enduring. Due to the limit of the searching ability of a human being, a criminal normally does not have many decision templates. Therefore, we can observe criminal incidents with the similar temporal, spatial, and modus operandi (MO) features, which possibly come from the same template of the same criminal. It is possible to identify the serial criminal by associating these similar incidents. Different approaches have been proposed and several software programs have been developed to resolve the crime association problem. They can be classified into two major categories: suspect association and incident association. The Integrated Criminal Apprehension Program (ICAP) developed by Heck [12] enables police officers to match between the suspects and the arrested criminals using MO features; the Armed Robbery Eidetic Suspect Typing (AREST) program [1] employs an expert approach to perform the suspect association and classify a potential offender into three categories: probable, possible, or non suspect. The Violent Criminal Apprehension Program developed by the Federal Bureau of Investigation (FBI) (ViCAP) [13] is an incident association system. MO features are primarily considered in ViCAP. In the COPLINK [10] project undertaken by the researchers in the University of Arizona, a novel concept space model is built and can be used to associate searching terms with suspects in the database. A total similarity method was proposed by Brown and Hagen [3], and it can solve problems for both incident association and suspect association. Besides these theoretical methods, crime analysts normally use the SQL (Structure Query Language) in practice. They build the SQL string and make the system return all records that match their searching criteria. In this paper, we describe a crime association method that combines both OLAP concepts from the data warehousing area and outlier detection ideas from the data mining field. Before presenting our method, let us briefly review some concepts in OLAP and data mining.
15
dimensional OLAP data structures. Imielinski et al. proposed the cubegrade problem [14]. The cubegrade problem can be treated as a generalized version of the association rule. Imielinski et al. claim that the association rule can be viewed as the change of count aggregates when imposing another constraint, or in OLAP terminology, making a drill-down operation on an existing cube cell. They think that other aggregates like sum, average, max, or min can also be incorporated, and the cubgegrade could support the what if analysis better. Similar to the cubegrade problem, the constrained gradient analysis was proposed by Dong et al. [7]. The constrained gradient analysis focuses on retrieving pairs of OLAP cubes that are quite different in aggregates and similar in dimensions (usually one cell is the ascendant, descendent, or sibling of the other cell). More than one aggregates can be considered simultaneously in the constrained gradient analysis. The discovery-driven exploration problem was proposed by Sarawagi et al. [18]. It aims at finding exceptions in the cube cells. They build a formula to estimate the anticipated value and the standard deviation () of a cell. When the difference between the actual value of the cell and the anticipated value is greater than 2.5, the cell is selected as an exception. Similar to above approaches, our crime association method also focuses on the cells of the OLAP data cube. We define an outlier score function to measure the distinctiveness of the cell. Incidents contained in the same cell are determined to be associated with each other when the score is significant. The definition of the outlier score function and the association method is given in section 3.
3 Method
3.1 Rationale The rationale of this method is explained as follows: although theoretically the template (see section 1) is unique for each serial criminal, the data collected in the police department does not contain every aspect of the template. Some observed parts of the templates are common so that we may see a large overlap in these common templates. The creators (criminals) of those common templates are not separable. Some templates are special. For these special templates, we are more confident to say that the incidents come from the same criminal. For example, consider the weapon used in a robbery incident. We may observe many incidents with the value gun for weapon used. However, no crime analyst would say that the same person commits all these robberies because gun is a common template shared by many criminals. If we observe several robberies with a Japanese sword an uncommon template, we are more confident in asserting that these incidents result from a same criminal. (This Japanese sword claim was first proposed by Brown and Hagen [4]). In this paper, we describe an outlier score function to measure this distinctiveness of the template.
16
3.2 Definitions In this section, we give the mathematical definitions used to build the outlier score function. People familiar with OLAP concepts can see that our notation derives from terms used in OLAP field. A1, A2, , Am are m attributes that we consider relevant to our study, and D1, D2, , Dm are their domains respectively. Currently, these attributes are confined to be categorical (categorical attributes like MO are important in crime association analysis). Let z(i) be the i-th incident, and z(i).Aj be the value on the j-th attribute of ( ( incident i. z(i) can be represented as z (i ) = ( z1(i ) , z 2i ) ,..., z mi ) ) , where
( z ki ) = z ( i ) . Ak D k , k {1,..., m} . Z is the set of all incidents.
Definition 1. Cell Cell c is a vector of the values of attributes with dimension t, where tm. A cell can be represented as c = (ci1 , ci2 ,..., cit ) . In order to standardize the definition of a cell, for each Di, we add a wildcard element *. Now we allow Di= Di{*}. For cell
and cj=* if and only if j {i1 , i2 ,..., it } . C denotes the set of all cells. Since each incident can also be treated as a cell, we define a function Cell: Z C. Cell(z)= (z1,z2,,zm), if z=(z1,z2,,zm), Definition 2. Contains relation We say that cell c = (ci1 , ci2 ,...,cit ) contains incident z if and only if z.Aj=cj or cj=*, j=1,2,,m. For two cell, we say that cell c = (c1 , c 2 ,..., c m ) contains cell
ck = *
17
Definition 5. Neighborhood P is called the neighborhood of cell c on the k-th attribute when P is a set of cells that takes the same values as cell c in all attributes but k, and does not take the wildcard value * on the k-th attribute, i.e., P= {c (1) , c ( 2 ) ,..., c (|P|) } where
neighborhood of cell c on attribute k. (In OLAP field, the neighborhood is sometimes called siblings.) Definition 6. Relative frequency We call freq(c, k ) = respect to attribute k. Definition 7. Uncertainty function We use function U to measure the uncertainty of a neighborhood. This uncertainty measure is defined on the relative frequencies. If we use P = {c (1) , c ( 2) ,..., c denote the neighborhood of cell c on attribute k, then,
(P)
} to
(P)
, k ))
P
Obviously, U should be symmetric for all c , c ( 2) ,..., c . U takes a smaller value if the uncertainty in the neighborhood is low. One candidate uncertainty function is entropy, which comes from information theory:
(1)
U (c , k ) = H (c , k ) =
c neighbor ( c , k )
For
the
freq=0, we define 0 log(0) = 0 , as is common in information theory. 3.3 Outlier Score Function (OSF) and the Crime Association Method Our goal is to build a function to measure the confidence or the significance level of associating crimes. This function is built over OLAP cube cells. We start building this function from analyzing the requirements that it needs to satisfy. Consider the following three scenarios: I. We have 100 robberies. 5 take the value of Japanese sword for the weapon used attributes, and 95 takes gun. Obviously, the 5 Japanese swords is of more interest than the 95 guns. Now we add another attribute: method of escape. Assume we have 20 different values: by car, by foot, etc. for the method of escape attribute. Each of them has 5 incidents. Although both Japanese sword and by car has 5 incidents, they should not be treated equally.
II.
18
III.
Japanese sword highlights itself because all other incidents are guns, or in other words, the uncertainty level of the weapon used attribute is smaller. If we have some incidents takes Japanese sword on the weapon used attribute, and by car on the method of escape attribute, then the combination of Japanese sword and by car is more significant than both Japanese sword only and by car only. The reason is that we have more evidences.
Now we define function f as follows: log( freq(c, k )) ) max ( f ( parent(c, k )) + f (c) = k takes all non* dim ensionof c H (c, k ) 0 c = (*,*,...,*) When H(c,k) = 0, we say log( freq (c, k )) = 0. H (c , k )
(1)
It is simple to verify that f satisfies above three requirements. We call f the outlier score function. (The term outlier is commonly used in the field of statistics. Outliers are observations significantly different that other observations and possibly are generated from a unique mechanism [11].) Based on the outlier score function, we give the following rule to associate criminal incidents: Given a pair of incidents, if there exists a cell containing both these incidents, and the outlier score of the cell is greater than some threshold value , we say that these two incidents are associated with each other. This association method is called an OLAP-outlier-based association method, or outlier-based method for abbreviation.
4 Application
We applied this criminal incident association method to a real-world dataset. The dataset contained information on robbery incidents that occurred in Richmond, Virginia in 1998. The dataset consisted of two parts: the incident dataset and the suspect dataset. The incident dataset had 1198 records, and the temporal, spatial, and MO information were stored in the incident database. The name (if known), height, and weight information of the suspect were recorded in the suspect database. We applied our method to the incident dataset and used the suspect dataset for verification. Robbery was selected for two reasons: first, compared with some violent crime such as murder or sexual attack, serial robberies were more common; second, compared with breaking and entering crimes, more robbery incidents were solved (criminal arrested) or partially solved (the suspects name is known). These two points made the robbery favorable for evaluation purposes.
19
4.1 Attribute Selection We used three types of attributes in our analysis. The first set of attributes consisted of MO features. MO was primarily considered in crime association analysis. 6 MO attributes were picked. The second set of attributes was census attributes (the census data was obtained directly from the census CD held in library of the University of Virginia). Census data represented the spatial characteristics of the location where the criminal incident occurred, and it might help to reveal the spatial aspect of the criminals templates. For example, some criminals preferred to attack high-income areas. Lastly, we chose some distance attributes. They were distances from the incident location to some spatial landmarks such as a major highway or a church. Distance features were also important in analyzing criminals behaviors. For example, a criminal might preferred to initiate an attack from a certain distance range from a major highway so that the offense could not be observed during the attack, and he or she could leave the crime scene as soon as possible after the attack. There were a total of 5 distances. The names of all attributes and their descriptions are given in appendix I. They have also been used in a previous study on predicting breaking and entering crimes by Brown et al. [4]. An attribute selection was performed on all numerical attributes (census and distance attributes) before using the association method. The reason was that some attributes were redundant. These redundant attributes were unfavorable to the association algorithm in terms of both accuracy and efficiency. We adopted a featureselection-by-clustering methodology to pick the attributes. According to this method, we used the correlation coefficient to measure how similar or close two attributes were, and then we clustered the attributes into a number of groups according to this similarity measure. The attributes in the same group were similar to each other, and were quite different from attributes in other groups. For each group, we picked a representative. The final set of all representative attributes was considered to capture the major characteristics of the dataset. A similar methodology was used by Mitra et al. [16]. We picked the k-medoid clustering algorithm. (For more details about the kmedoid algorithm and other clustering algorithm, see [8].) The reason was that kmedoid method works on similarity / distance matrix (some other methods only work on coordinate data), and it tends to return spherical clusters. In addition, k-medoid returns a medoid for each cluster, based upon which we could select the representative attributes. After making a few slight adjustments and checking the silhouette plot [15], we finally got three clusters, as given in Fig. 1. The algorithm returned three medoids: HUNT_DST (housing unit density), ENRL3_DST (public school enrollment density), and TRAN_PC (expenses on transportation: per capita). We made some adjustments here. We replaced ENRL3_DST with another attribute POP3_DST (population density: age 12-17). The attackers and victims. For similar reasons, we replaced TRAN_PC with MHINC (median household income).
20
There were a total of 9 attributes used in our analysis: 6 MO attributes (categorical) and 3 numerical attributes picked by applying the attributes selection procedure. Since our method was developed on categorical attributes, we converted the numerical attributes to categorical ones by dividing them into 11 equally sized bins. The number was determined by Sturges number of bins rule [19][20].
4.2 Evaluation Criteria We wanted to evaluate whether the association determined by our method corresponded to the true result. The information in the suspect database was considered as the true result. 170 incidents with the names of the suspects were used for evaluation. We generated all incident pairs. If two incidents in a pair had the suspects with the same name and date of birth, we said that the true result for this incident pair was a true association. There were 33 true associations. We used two measures to evaluate our method. The first measure was called detected true associations. We expected that the association method would be able to detect a large portion of true associations. The second measure was called average number of relevant records. This measure was built on the analogy of the search engine. Consider a search engine as Google. For each searching string(s) we give, it returns a list of documents considered to be relevant to the searching criterion. Similarly, for the crime association problem, if we give an incident, the algorithm will return a list of records that are considered as associated with the given incident. A shorter list is always preferred in both cases. The average length of the lists provided the second measure and we called it the average number of relevant records. The algorithm is more accurate when this measure has a smaller
21
value. In the information retrieval area [17], two commonly used criteria in evaluating a retrieval system are recall and precision. The former is the ability for a system to present relevant items, and the latter is the ability to present only the relevant items. Our first measure was a recall measure, and our second measure was equivalent to a precision measure. The above two measures do not work for our approach only; they can be used in evaluating any association algorithms. Therefore, we can use these two measures to compare the performances of different association methods. 4.3 Result and Comparison Different threshold values were set to test our method. Obvious if we set it to 0, we would expect that the method can detect all true associations and the average number of relevant records was 169 (given 170 incidents for evaluation). If we set the threshold, , to infinity, we would expect the method to return 0 for both detected true associations and average number of relevant records. As the threshold increased, we expected a decrease in both number of detected true associations and average number of relevant records. The result is given in Table 1.
Table 1. Result of outlier-based method
Threshold 0 1 2 3 4 5 6 7
Avg. number of relevant records 169.00 121.04 62.54 28.38 13.96 7.51 4.25 2.29 0.00
We compared this outlier-based method with a similarity-based crime association method. The similarity-based method was proposed by Brown and Hagen (Brown and Hagen, 2003). Given a pair of incidents, the similarity-based method first calculates a similarity score for each attribute, and then computes a total similarity score using the weighted average of all individual similarity scores. The total similarity score is used to determine whether the incidents are associated. Using the same evaluation criteria, the result of the similarity-based method is given in Table 2. If we set the average number of relevant records as the X-axis and set the detected true associations as the Y-axis, the comparisons can be illustrated as in Fig. 2. In Fig. 2, the outlier-based method lies above the similarity-based method for most cases. That means given the same accuracy (detected true associations) level, the outlier-based method returns fewer relevant records. Also if we keep the number
22
Avg. number of relevant records 169.00 112.98 80.05 45.52 19.38 3.97 0.00
of relevant records (average length of the returned list) for both methods, the outlierbased method is more accurate. The curve of the similarity-based method sits slightly above the outlier-based method when the average number of relevant records is above 100. Since the size of the evaluation incident set is 170, no crime analyst would consider putting further investigation on any set of over 100 incidents. The outlierbased method is generally more effective.
35
30
25 Detected Associations
20 Similarity Outlier 15
10
5 Conclusion
In this paper, an OLAP-outlier-based method is introduced to solve the crime association problem. The criminal incidents are modeled into an OLAP cube and an outlier-score function is defined over the cube cells. The incidents contained in the
23
cell are determined to be associated with each other when the outlier score is large enough. The method was applied to a robbery dataset and results show that this method can provide significant improvements for crime analysts who need to link incidents in large databases.
References
1. Badiru, A.B., Karasz, J.M. and Holloway, B.T., AREST: Armed Robbery Eidetic Suspect Typing Expert System, Journal of Police Science and Administration, 16, 210216 (1988) Brantingham, P. J. and Brantingham, P. L., Patterns in Crimes, New York: Macmillan (1984) Brown D.E. and Hagen S.C., Data Association Methods with Applications to Law Enforcement, Decision Support Systems, 34, 369378 (2003) Brown, D. E., Liu, H. and Xue, Y., Mining Preference from Spatial-temporal Data, Proc. of the First SIAM International Conference of Data Mining (2001) Clarke, R.V. and Cornish, D.B., Modeling Offenders Decisions: A Framework for Research and Policy, Crime Justice: An Annual Review of Research, Vol. 6, Ed. by Tonry, M. and Morris, N. University of Chicago Press (1985) Chaudhuri, S. and Dayal, U., An Overview of Data Warehousing and OLAP Technology, ACM SIGMOD Record, 26 (1997) Dong, G., Han, J., Lam, J. Pei, J., and Wang, K., Mining Multi-Dimensional Constrained Gradients in Data Cubes, Proc. of the 27th VLDB Conference, Roma, Italy (2001) Everitt, B. Cluster Analysis, John Wiley & Sons, Inc. (1993) Felson, M., Routine Activities and Crime Prevention in the Developing Metropolis, Criminology, 25, 911931 (1987) Hauck, R., Atabakhsh, H., Onguasith, P., Gupta, H., and Chen, H., Using Coplink to Analyse Criminal-Justice Data, IEEE Computer, 35, 3037 (2002) Hawkins, D., Identifications of Outliers, Chapman and Hall, London, (1980) Heck, R.O., Career Criminal Apprehesion Program: Annual Report (Sacramento, CA: Office of Criminal Justice Planning) (1991) Icove, D. J., Automated Crime Profiling, Law Enforcement Bulletin, 55, 2730 (1986) Imielinski, T., Khachiyan, L., and Abdul-ghani, A., Cubegrades: Generalizing association rules, Technical report, Dept. Computer Science, Rutgers Univ., Aug. (2000) Kaufman, L. and Rousseeuw, P. Finding Groups in Data, Wiley (1990) Mitra, P., Murthy, C.A., and Pal, S.K., Unsupervised Feature Selection Using Feature Similarity, IEEE Trans. On Pattern Analysis and Machine Intelligence, 24, 301312 (2002) Salton, G. and McGill, M. Introduction to Modern Information Retrieval, McGraw-Hill Book Company, New York (1983) Sarawagi, S., Agrawal, R., and Megiddo. N., Discovery-driven exploration of OLAP data cubes, Proc. of the Sixth Intl Conference on Extending Database Technology (EDBT), Valencia, Spain (1998) Scott, D. Multivariate Density Estimation: Theory, Practice and Visualization, New York, NY: Wiley (1992) Sturges, H.A., The Choice of a Class Interval, Journal of American Statistician Association, 21, 6566 (1926)
2. 3. 4. 5.
17. 18.
19. 20.
24
Appendix
I. Attributes used in the analysis (a) MO attributes Name Description Rsus_Acts Actions taken by the suspects R_Threats Method used by the suspects to threat the victim R_Force Actions that suspects force the victim to do Rvic_Loc Location type of the victim when robbery was committed Method_Esc Method of escape the scene Premise Premise to commit the crime (b) Census attributes Attribute name Description General POP_DST Population density (density means that the statistic is divided by the area) HH_DST Household density FAM_DST Family density MALE_DST Male population density FEM_DST Female population density Race RACE1_DST RACE2_DST RACE3_DST RACE4_DST RACE5_DST HISP_DST Population Age POP1_DST POP2_DST POP3_DST POP4_DST POP5_DST POP6_DST POP7_DST POP8_DST POP9_DST POP10_DST Householder Age AGEH1_DST AGEH2_DST AGEH3_DST
White population density Black population density American Indian population density Asian population density Other population density Hispanic origin population density
Population density (0-5 years) Population density (6-11 years) Population density (12-17 years) Population density (18-24 years) Population density (25-34 years) Population density (35-44 years) Population density (45-54 years) Population density (55-64 years) Population density (65-74 years) Population density (over 75 years)
Density: age of householder under 25 years Density: age of householder under 25-34 years Density: age of householder under 35-44 years
25
Attribute name AGEH4_DST AGEH5_DST AGEH6_DST Household Size PPH1_DST PPH2_DST PPH3_DST PPH6_DST Housing, misc. HUNT_DST OCCHU_DST VACHU_DST MORT1_DST MORT2_DST COND1_DST OWN_DST RENT_DST Housing Structure HSTR1_DST HSTR2_DST HSTR3_DST HSTR4_DST HSTR6_DST HSTR9_DST HSTR10_DST Income PCINC_97 MHINC_97 AHINC_97 School Enrollment ENRL1_DST ENRL2_DST ENRL3_DST ENRL4_DST ENRL5_DST ENRL6_DST ENRL7_DST Work Force CLS1_DST CLS2_DST
Description Density: age of householder under 45-54 years Density: age of householder under 55-64 years Density: age of householder over 65 years
Density: 1 person households Density: 2 person households Density: 3-5 person households Density: 6 or more person households
Housing units density Occupied housing units density Vacant housing units density Density: owner occupied housing unit with mortgage Density: owner occupied housing unit without mortgage Density: owner occupied condominiums Density: housing unit occupied by owner Density: housing unit occupied by renter
Density: occupied structure with 1 unit detached Density: occupied structure with 1 unit attached Density: occupied structure with 2 unit Density: occupied structure with 3-9 unit Density: occupied structure with 10+ unit Density: occupied structure trailer Density: occupied structure other
School enrollment density: public preprimary School enrollment density: private preprimary School enrollment density: public school School enrollment density: private school School enrollment density: public college School enrollment density: private college School enrollment density: not enrolled in school
Density: private for profit wage and salary worker Density: private for non-profit wage and salary worker
26
Attribute name CLS3_DST CLS4_DST CLS5_DST CLS6_DST CLS7_DST Consumer Expenditures ALC_TOB_PH APPAREL_PH EDU_PH ET_PH FOOD_PH MED_PH HOUSING_PH PCARE_PH REA_PH TRANS_PH ALC_TOB_PC APPAREL_PC EDU_PC ET_PC FOOD_PC MED_PC HOUSING_PC PCARE_PC REA_PC TRANS_PC (c) Distance attributes Name D_Church D_Hospital D_Highway D_Park D_School
Description Density: local government workers Density: state government workers Density: federal government workers Density: self-employed workers Density: unpaid family workers
Expenses on alcohol and tobacco: per household Expenses on apparel: per household Expenses on education: per household Expenses on entertainment: per household Expenses on food: per household Expenses on medicine and health: per household Expenses on housing: per household Expenses on personal care: per household Expenses on reading: per household Expenses on transportation: per household Expenses on alcohol and tobacco: per capita Expenses on apparel: per capita Expenses on education: per capita Expenses on entertainment: per capita Expenses on food: per capita Expenses on medicine and health: per capita Expenses on housing: per capita Expenses on personal care: per capita Expenses on reading: per capita Expenses on transportation: per capita
Description Distance to the nearest church Distance to the nearest hospital Distance to the nearest highway Distance to the nearest park Distance to the nearest school