Crime Prediction Using Data Mining
Crime Prediction Using Data Mining
Crime Prediction Using Data Mining
Abstract. In order to predict the crime in YD county, data mining and machine
learning are used in this paper. The aim of the study is to show the pattern and
rate of crime in YD county based on the data collected and to show the rela-
tionships that exist among the various crime types and crime Variable. Ana-
lyzing this data set can provide insight on crime activities within YD county. By
introducing formula and methods of Bayesian network, random tree and neural
network in machine learning and big data, to analyze the crime rules from the
collected data. According to the statistics released by the YD county From 2012-
09-01 to 2015-07-21, The crime of smuggling, selling, transporting and man-
ufacturing drugs, Theft, Intentional injury, Illegal business crime, Illegal pos-
session of drugs, Rape, Crime of fraud, Gang fighting, manslaughter, Robbery
made the top ten list of crime types with high number of crimes. The crime rate
of drugs was the highest, reaching 46.86%, farmers are the majority, accounting
for 97.07%, people under the age of 35 are the subject of crime. Males
accounted for 90.17% of crimes committed, while females accounted for 9.83%.
For ethnic groups, the top five were han, yi, wa, dai and lang, accounting for
68.43%, 23.43%, 1.88%, 1.67% and 1.25% respectively. By adopting random
forest, Bayesian networks, and neural network methods, we obtained the deci-
sion rules for criminal variables. By comparison, the classification effect of
Random Trees is better than that of Neural Networks and Bayesian Networks.
Through the data collection of the three algorithms, the validity and accuracy of
the random tree algorithm in predicting crime data are observed. The perfor-
mance of the Bayesian network algorithm is relatively poor, probably due to the
existence of certain random factors in various crimes and related features (the
correlation between the three algorithms is low).
1 Introduction
For almost everyone, machine learning (ML) is still a very mysterious field that sounds
complicated and difficult to explain to a person without any skills [1]. However, this is
very important today and will continue in the next few years.
County, population distribution by age, number of violent crimes committed, and the
rate of those crimes are also features that have been incorporated into the test data to
conduct analysis.
The rest of the paper is organized as follows: Sect. 2 gives an overview of data
mining and machine learning. Section 3 provides information about the Crime Clas-
sification in YD County. Section 4 presents the results from each of the algorithms and
Sect. 5 concludes with the findings and discussion of the paper results.
Where c is a reflected Brownian motion. The convergence holds in the sense of weak
convergence on the Skorokhod space D(R+; R+).
In their papers, they introduce the exit measure from a domain D, which is in a
sense uniformly spread over the set of exit points of the Brownian snake paths from D.
they then derive the key integral equation (Theorem 2.2) for the Laplace functional of
the exit measure. In the particular case when the underlying spatial motion e is
d-dimensional Brownian motion, this quickly leads to the connection between the
Brownian snake and the semilinear PDE(Partial differential equation) Du = u2.
Theorem 2.2: Let g be a nonnegative bounded measurable function on E. For every
x 2 E, set
uð xÞ ¼ Nx 1 exp Z D ; g ; x 2 D ð2:3Þ
where ri is the number of possible instantiations of Xi, and qi is the number of unique
instantiations of pi. Nijk is the number of cases in T. The BN (K, H) represents a
factorization of the joint probability over a discrete sample space,
Y
pðXÞ ¼ pðX1 ; . . .; Xn Þ ¼ i¼1;...;n
pðXi jpi Þ ð2:6Þ
for which all probabilities on the right-hand side are given by the CPTs. Therefore,
when a variable Xi is unknown or hidden, Bayes’ rule of inference can be used to
calculate the posterior probability distribution of Xi given evidence of the set of l
variables, that are conditionally dependent on Xi,
Bayesian networks are particularly well suited for crime analysis, as they learn from
data and use the experience of criminologists to select nodes and node sequencing. The
confidence level provided for the criminal files informs the detective about the possible
accuracy of each prediction. In addition, BN’s graphical structure represents the most
important relationship between criminal behavior and crime scene behavior, which may
help develop new scientific assumptions about criminal behavior.
Neural Networks-Artificial Neural Networks (ANN) have been developed as
generalizations of mathematical models of biological nervous systems. The basic
processing elements of neural network are called artificial neurons, or simply neurons
or nodes. The neuron pulse is then calculated as the weighted sum of the input signal of
the transfer function transformation. The artificial neurons’ learning ability can be
realized by adjusting the weight according to the selected learning algorithm [12].
Architectures: An ANN consists of a set of processing elements, also known as
neurons or nodes, which are interconnected. It can be described as a directed graph in
which each node performs a transfer function of the form
Crime Prediction Using Data Mining and Machine Learning 365
X
yi ¼ f w x hi
i¼1;...;n ij j
ð2:8Þ
i
where yi is the output of the node i, xj is the th input to the node, and wij is the
connection weight between nodes i and j. hi is the threshold (or bias) of the node.
Usually, fi is nonlinear, such as a heaviside, sigmoid, or Gaussian function.
In (11), each term in the summation only involves one input xj. High-order ANN’s
are those that contain high-order nodes, i.e. nodes in which more than one input are
involved in some of the terms of the summation. For example, a second-order node can
be described as
X
yi ¼ f i w x x hi
j;k¼1;...;n ijk j k
ð2:9Þ
As it has been said previously, this project is based on a Project of national social
science foundation about the causes and countermeasures of ethnic minority crimes in
the county of YD. In this chapter, the principle and formula of Bayesian network,
random tree and neural network are given briefly.
ids that contain a list of all categories and the probability that each sample belongs to
each category. Remind the training dataset to label the crime types of all samples (10
different).
Then, instead of predicting which category a given sample belongs to, the output
will always be the probability vector.
Data
The data in this article involves the reported cases of the crime of smuggling, selling,
transporting and manufacturing drugs, Theft, Intentional injury, Illegal business crime,
Illegal possession of drugs, Rape, Crime of fraud, Gang fighting, manslaughter, Rob-
bery in YD county between the years 2012 and 2015. The summary of the data is as
provided in Table 1.
The data provides insight into criminal activity, and its research can help reduce
crime (protecting communities) and decision-making. Part of the analysis provided can
be used to explain the relationship between certain criminal activities.
The data can further be analyzed using other statistical methods like Random Trees,
Bayesian, Quasi-neural networks and so on.
From Table 1 and Fig. 3, The crime of smuggling, selling, transporting and
manufacturing drugs is the most important crime types, the following is theft, inten-
tional injury and so on.
Data Analysis
The provided dataset has different “features”, each one being of a different relevance. In
this chapter we will proceed to analyze this database and extract the useful information
out of it. There are 478 samples of crime analysis. These data were collected from
2012-09-01 to 2015-07-21 in YD.
Crime Prediction Using Data Mining and Machine Learning 367
200
150
100
50
0
Sunday Monday Tuesday Wednesday Thursday Friday Saturday
Another interesting analysis is to count the number of crimes that occur every day
of the week so that we know if this is relevant information. Figure 1 shows that the day
when the offenders choose the most is Friday, and the rest of the week is distributed
differently.
We chose the top eight of the crime categories as the basis for this analysis and
discussion.
Numbers of crime
250
200
150
100
50
0
The crime Theft Intentional Illegal Illegal Rape Crime of Gang
of selling injury business possession fraud fighting
drugs crime of drugs
From the Fig. 2, we can see that the crime of smuggling, selling, transporting and
manufacturing drugs 224; Theft 86; Intentional injury 85; Illegal business crime 23,
which are the main parts of crime types.
4 Results
The data set selected in this study is a denormalized data set for community and crime.
Including the social and economic data from 2012-09-01 to 2015-07-21, the law
enforcement data of the YD County People’s Procuratorate, and the crime data from
2012-09-01 to 2015-07-21. It also includes 478 cases or criminal cases and 12 total
attributes reported from the entire YD County, often referred to as features.
368 S. Wu et al.
This section describes all the implementation results of the random tree, Bayesian
algorithm, and neural network algorithm. The algorithm is run to predict the following
characteristics of each data set: he smuggles crimes, sells, transports and manufactures
drugs, illegally holds drugs, rapes, steals, murders, robberies, intentional injuries,
illegal business crimes, fraud, criminal gang fights.
4.2 Relation Between Sex (or Gender), Ethnic, Age, Education and Crime
The relationship between sex and crime as following: first, the number relationship is
as Table 2 and Fig. 4 shows the relationship between sex and crime. From Fig. 4 and
Table 2, number of the males more than females.
140
120
10
100
80
60 5
males
40 females
20
0
0
-20
2012 2013 2014 2015
year
In the 478 samples, the relationship between age and Crime numbers for drugs See
Fig. 5. From Fig. 5, it shows that people for the age from 16 to 35, accounting for
about half of the total sample, are the main criminal group.
The relationship between education and Crime numbers for drugs See Fig. 6. From
Fig. 6, it shows that people for illiteracy or semi-illiteracy, Primary school and Junior
high school, are the main criminal group.
370 S. Wu et al.
1 8 0
1 6 0
1 4 0
1 2 0
Numbers of crime
1 0 0
8 0
6 0
4 0
2 0
< = 2 5 2 6 -3 0 3 1 -3 5 3 6 -4 0 4 1 -4 5 4 6 -5 0 5 1 -5 5 > 5 5 --
A g e
Fig. 5. The relationship between age and Crime numbers for crime types.
Numbers of crime
120
100
80
60
40
20
0
Illiteracy or Primary school Junior high High school Technical The specialized
semi-illiteracy school secondary school subject graduates
Fig. 7. Predictive variable importance of crime types for Random Trees, Bayesian Network and
Neural Networks.
374 S. Wu et al.
In the field of artificial intelligence, machine learning is a very powerful field. If the
model is done correctly, the accuracy that some algorithms can achieve can be sur-
prising. Of course, the current and future of intelligent systems are subject to ML and
big data analysis. From the above discussion, we can draw the following conclusions:
In the data we selected, the crime rate of was the highest, reaching 46.86%, which
was the main crime type in YD county. In fact, it was theft and intentional injury,
reaching 17.99% and 17.78% respectively.
For ethnic groups, the top five were han, yi, wa, dai and lang, accounting for
68.43%, 23.43%, 1.88%, 1.67% and 1.25% respectively. From this, the han nationality
is the main criminal in the nation.
Through the data collection of the three algorithms, the validity and accuracy of the
random tree algorithm in predicting crime data are observed. The performance of the
Bayesian network algorithm is relatively poor, probably due to the existence of certain
random factors in various crimes and related features (the correlation between the three
algorithms is low).
References
1. McClendon, L., Meghanathan, N.: Using machine learning algorithms to analyze crime data.
Mach. Learn. Appl. Int. J. 2(1), 1–2 (2015)
2. Murder. http://www.fbi.gov/ucr/cius2009/offenses/violent_crime/murder_homicide.html
3. Forcible Rape. http://www.fbi.gov/ucr/cius2009/offenses/violent_crime/forcible_rape.html
4. Robbery. http://www.fbi.gov/ucr/cius2009/offenses/violent_crime/robbery.htm
Crime Prediction Using Data Mining and Machine Learning 375
5. Assault. http://www.fbi.gov/ucr/cius2009/offenses/violent_crime/aggravated_assault.html
6. Vaquero Barnadas, M.: Machine learning applied to crime prediction. http://upcommons.
upc.edu/bitstream/handle/2117/96580/machine%20learning%20applied%20TO%20crime%
20precticion.pdf
7. Ronsivalle, G.B.: Neural and bayesian networks to fight crime: the NBNC meta-model of
risk analysis. In: Artificial Neural Networks-Application, pp. 29–42 (2011)
8. Ward, J.T., Ray, J.V., Fox, K.A.: Developed a computer model exploring differences in self-
control across sex, race, age, education, and language: considering a bifactor MIMIC model.
J. Crim. Justice 56, 29–42 (2018)
9. Nirkhi, S.M., Dharaskar, R.V., Thakre, V.M.: Data mining: a prospective approach for
digital forensics. Int. J. Data Min. Knowl. Manag. Process 2(6), 41–48 (2012)
10. Ngai, E.W.T., Xiu, L., Chau, D.C.K.: Application of data mining techniques in customer
relationship management: a literature review and classification. Expert Syst. Appl. 2592–
2602 (2008)
11. McCarthy, J.: Arthur samuel: pioneer in machine learning. AI Mag. 11(3), 10–11 (1990)
12. Aldous, D.: The continuum random tree I. Ann. Probab. 19(1), 1–28 (1991)
13. Aldous, D.: The continuum random tree III. Ann. Probab. 21(1), 248–289 (1993)
14. Haas, B., Miermont, G.: The genealogy of self-similar fragmentations with negative index as
a continuum random tree. Electron. J. Probab. 9, 57–97 (2004)
15. Le Gall, J.F.: Spatial Branching Processes, Random Snakes and Partial Differential
Equations. Birkhauser, Boston (1999)
16. Heckerman, D.: A Bayesian approach to learning causal networks. Technical report MSR-
TR-95-04, pp. 1–23 (1995)
17. Jensen, F.V.: Bayesian Networks and Decision Graphs. Springer, New York (2001)