0% found this document useful (0 votes)

13 views16 pages

Crime Prediction Using Data Mining

This document discusses using data mining and machine learning algorithms to predict crime in YD County, China. It analyzes crime data collected from 2012-2015 to identify patterns and relationships between different crime types. Three algorithms - random forest, Bayesian networks, and neural networks - are applied and random forest is found to most accurately classify and predict crimes. The random forest algorithm could help law enforcement better understand and anticipate criminal behavior in YD County.

Uploaded by

Kumar Vivas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views16 pages

Crime Prediction Using Data Mining

Uploaded by

Kumar Vivas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Crime Prediction Using Data Mining

and Machine Learning

Shaobing Wu1, Changmei Wang2(&), Haoshun Cao1,

and Xueming Jia1
1
Institute of Information Security, Yunnan Police College,
Kunming 650223, China
2
Solar Energy Institute, Yunnan Normal University, Kunming 650092, China
[email protected]

Abstract. In order to predict the crime in YD county, data mining and machine
learning are used in this paper. The aim of the study is to show the pattern and
rate of crime in YD county based on the data collected and to show the rela-
tionships that exist among the various crime types and crime Variable. Ana-
lyzing this data set can provide insight on crime activities within YD county. By
introducing formula and methods of Bayesian network, random tree and neural
network in machine learning and big data, to analyze the crime rules from the
collected data. According to the statistics released by the YD county From 2012-
09-01 to 2015-07-21, The crime of smuggling, selling, transporting and man-
ufacturing drugs, Theft, Intentional injury, Illegal business crime, Illegal pos-
session of drugs, Rape, Crime of fraud, Gang fighting, manslaughter, Robbery
made the top ten list of crime types with high number of crimes. The crime rate
of drugs was the highest, reaching 46.86%, farmers are the majority, accounting
for 97.07%, people under the age of 35 are the subject of crime. Males
accounted for 90.17% of crimes committed, while females accounted for 9.83%.
For ethnic groups, the top five were han, yi, wa, dai and lang, accounting for
68.43%, 23.43%, 1.88%, 1.67% and 1.25% respectively. By adopting random
forest, Bayesian networks, and neural network methods, we obtained the deci-
sion rules for criminal variables. By comparison, the classification effect of
Random Trees is better than that of Neural Networks and Bayesian Networks.
Through the data collection of the three algorithms, the validity and accuracy of
the random tree algorithm in predicting crime data are observed. The perfor-
mance of the Bayesian network algorithm is relatively poor, probably due to the
existence of certain random factors in various crimes and related features (the
correlation between the three algorithms is low).

Keywords: Crime prediction Data mining Machine learning

1 Introduction

For almost everyone, machine learning (ML) is still a very mysterious ﬁeld that sounds
complicated and difﬁcult to explain to a person without any skills [1]. However, this is
very important today and will continue in the next few years.

© Springer Nature Switzerland AG 2020

Q. Liu et al. (Eds.): CENet 2018, AISC 905, pp. 360–375, 2020.
https://doi.org/10.1007/978-3-030-14680-1_40
Crime Prediction Using Data Mining and Machine Learning 361

ML is a fairly multidisciplinary ﬁeld that deals primarily with programming and

mathematics (mainly involving probability and density functions). In addition, because
it is new and quite complex, it requires good research skills.
For crime detection issues, the game organizer provided a huge database of crime
training in San Francisco. The database is tagged, that is, it contains the correct cat-
egory for each entry (e.g. theft, assault, bribery, etc.), so it is a supervised learning
problem. With this in mind, the algorithms used to solve this problem are: Random
tree, neural network, Bayesian network.
The US Federal Bureau of Investigation (FBI) defines violent crime as a crime
involving violence or threats. The United States Federal Bureau of Investigation
(FBI) Unified Crime Report (UCR) program defines each type of criminal behavior as:
(i) Murder-intentional (non-faulty) murder. UCR does not include deaths caused by
accidents, suicides, negligence, proper homicides, and attempts to murder or assault
murder (all of which are classified as serious attacks), in this crime classification [2].
(ii) Forced rape-rape is a sexual assault that violates the will of women. While
attempting or attacking rape by threat or force is considered a crime under this cate-
gory, statutory rape (without the use of force) and other sexual offences are excluded
from [3]. (iii) Robbery-threatening or violent by force or force and/or placing the victim
in fear, gaining or attempting to obtain anything of value from the care, custody or
control of one or more persons. Crimes that aggravate the crime of personal assault and
theft are crimes of robbery. Unfortunately, these types of crimes seem to have become
commonplace in society. Law enforcement officials have turned to data mining and
machine learning to help fight crime prevention and enforcement.
Miquel Vaquero Barnadas [13] proposed machine learning applied to crime pre-
diction. In this paper, he plans to use different algorithms (such as K-Nearest neigh-
bour, Parzen windows and Neural Networks) to solve the real data classification
problem (the San Francisco crime classification).
Gaetano Bruno Ronsivalle [14] presented Neural and Bayesian Networks to Fight
Crime: The NBNC Meta-Model of Risk Analysis. In his paper, he used this tool with
the specific goal of providing an effective model for Italian bank security managers to
“describe” variables and define “robbing” phenomena; “interpret” calculations (i) “ex-
ogenous”, (ii) “Endogenous and (iii) methods of global risk index for each branch;
through simulation modules to “predict” composite risk and changes in different branch
security systems.
Jeffrey T. Ward, James V. Ray, Kathleen A. Fox [15] developed a MIMIC model
for Exploring differences in self-control across sex, race, age, education, and language,
and draw a conclusion that apart from race, testing group differences in self-control
with an observed scale score is largely unbiased. Testing group differences in elements
using observed subscores is frequently biased and generally unsupported.
In this research, we developed the Random Trees, Neural Networks, and Bayesian
Networks algorithms using the same finite set of features, on the communities and
crime un normalized dataset to conduct a comparative study between the violent crime
patterns from this particular dataset and actual crime statistical data for the state of YD
County. The crime statistics used from collected data. Some of the statistical data that
was provided by YD County people’s procuratorate such as the population of YD
362 S. Wu et al.

County, population distribution by age, number of violent crimes committed, and the
rate of those crimes are also features that have been incorporated into the test data to
conduct analysis.
The rest of the paper is organized as follows: Sect. 2 gives an overview of data
mining and machine learning. Section 3 provides information about the Crime Clas-
siﬁcation in YD County. Section 4 presents the results from each of the algorithms and
Sect. 5 concludes with the ﬁndings and discussion of the paper results.

2 Data Mining and Machine Learning Algorithms

2.1 Data Mining

Data mining is part of the interdisciplinary field of knowledge discovery in databases
[9]. Data mining consist of collecting raw data and, (through the processes of inference
and analysis); creating information that can be used to make accurate predictions and
applied to real world situations. It is the application of techniques that are used to
conduct productive analytics. The five tasks that these types of software packages are
designed for are as follows: (i) Association-Identifying correlations among data and
establishing relationships between data that exist together in a given record [9, 10].
(ii) Classification Discovering and sorting data into groups based on similarities of data
[6]. Classification is one of the most common applications of data mining. The goal is
to build a model to predict future outcomes through classification of database records
into a number of predefined classes based on a certain criteria. Some common tools
used for classification analysis include neural networks, decisions trees, and if-then-else
rules [10]. (iii) Clustering-Finding and visually presenting groups of facts previously
unknown or left unnoticed [6]. Heterogeneous data is segmented into a number of
homogenous clusters. Common tools used for clustering include neural networks and
survival analysis [10]. (iv) Forecasting-Discovering patterns and data that may lead to
reasonable predictions [9].

2.2 Machine Learning

Arthur Samuel is a pioneer in the field of machine learning and artificial intelligence.
He defines machine learning as a field of study that allows computers to learn without
explicit programming [11]. In essence, machine learning is a way for computer systems
to learn through examples. There are many machine learning algorithms available to
users that can be implemented on data sets. The algorithm has a better understanding of
the data set because it has more examples to implement. In the field of data mining,
there are five machine learning algorithms for analysis: (i) Classification analysis
algorithms-these algorithms use attributes in the data set to predict the value of one or
more variables taking discrete values. (ii) Regression analysis algorithms-These algo-
rithms use the properties of the data set to predict the value (e.g. profit and loss) of one
or more variables taking continuous values. (3) Segmentation analysis algorithm -
divide data into groups or groups with similar attributes.
Crime Prediction Using Data Mining and Machine Learning 363

2.3 Algorithms Selected for Analysis

Random Trees-Aldous [12, 13] discussed scaling limits of various classes of discrete
trees conditioned to be large. In the case of a Galton-Watson tree with a finite variance
critical offspring distribution and conditioned to have a large number of vertices, he
proved that the scaling limit is a continuous random tree called the Brownian CRT.
Their main result (Theorem 2.1) stated that the rescaled height function associated with
a forest of independent (critical, finite variance) Galton-Watson trees converged in
distribution towards reflected Brownian motion on the positive half-line.
In order to derive the Theorem 2.1, they first state a very simple “moderate devi-
ations” Lemma 1.1 for sums of independent random variables.
Lemma 1.1: Let Y1, Y2, … be a sequence of i.i.d. real random variables. We assume
that there exists a number k > 0 such that E [exp(k|Y1|)] < ∞, and that E[Y1] = 0.
Then, for every a > 0, we can choose N sufficiently large so that for every n N and
l 2 {1, 2, 3, …, n}
h i a=2
P jY1 þ Yl j [ na þ 2 en
1
ð2:1Þ

According to Lemma 1.1, they get the Theorem 2.1 as following:

Theorem 2.1: Let h1, h2, … be a sequence of independent µ-Galton-Watson trees, and
let (Hn; n 0) be the associated height process. Then

1 2
pffiffiffi H½pt ; t 0 ! ct ; t 0 ð2:2Þ
p p!1 r

Where c is a reflected Brownian motion. The convergence holds in the sense of weak
convergence on the Skorokhod space D(R+; R+).
In their papers, they introduce the exit measure from a domain D, which is in a
sense uniformly spread over the set of exit points of the Brownian snake paths from D.
they then derive the key integral equation (Theorem 2.2) for the Laplace functional of
the exit measure. In the particular case when the underlying spatial motion e is
d-dimensional Brownian motion, this quickly leads to the connection between the
Brownian snake and the semilinear PDE(Partial differential equation) Du = u2.
Theorem 2.2: Let g be a nonnegative bounded measurable function on E. For every
x 2 E, set

uð xÞ ¼ Nx 1 exp Z D ; g ; x 2 D ð2:3Þ

The function u solves the integral equation [20]

!
Y ZT
Y
uð x Þ þ 2 x
ue2s ds ¼ x
1fs\1g gðes Þ ð2:4Þ
0
364 S. Wu et al.

Random continuous trees can be used to model the genealogy of self-similar

fragmentations [14]. The Brownian snake has turned out to be a powerful tool in the
study of super-Brownian motion: See the monograph [15] and the references therein.
Since Random Trees was proposed, the algorithm has become a popular and widely
used tool for nonparametric regression applications.
Bayesian Networks-A Bayesian network (BN) approximates the joint probability
distribution for a multivariate system based on expert knowledge and sampled obser-
vations that are assimilated through training [16, 17]. A tractable scoring metric, known
as K2, is obtained from PðF; T Þ using the assumptions in [6], which include ﬁxed
ordering of variables in X:
!
Y ðri 1Þ! Y
g ¼ log N ! ð2:5Þ
j¼1;...;qj N þ r 1 ! k¼1;...ri ijk
ij i

where ri is the number of possible instantiations of Xi, and qi is the number of unique
instantiations of pi. Nijk is the number of cases in T. The BN (K, H) represents a
factorization of the joint probability over a discrete sample space,
Y
pðXÞ ¼ pðX1 ; . . .; Xn Þ ¼ i¼1;...;n
pðXi jpi Þ ð2:6Þ

for which all probabilities on the right-hand side are given by the CPTs. Therefore,
when a variable Xi is unknown or hidden, Bayes’ rule of inference can be used to
calculate the posterior probability distribution of Xi given evidence of the set of l
variables, that are conditionally dependent on Xi,

Pðli jXi ÞPðXi Þ

PðXi jli Þ ¼ ð2:7Þ
Pðli Þ

Bayesian networks are particularly well suited for crime analysis, as they learn from
data and use the experience of criminologists to select nodes and node sequencing. The
confidence level provided for the criminal files informs the detective about the possible
accuracy of each prediction. In addition, BN’s graphical structure represents the most
important relationship between criminal behavior and crime scene behavior, which may
help develop new scientific assumptions about criminal behavior.
Neural Networks-Artificial Neural Networks (ANN) have been developed as
generalizations of mathematical models of biological nervous systems. The basic
processing elements of neural network are called artificial neurons, or simply neurons
or nodes. The neuron pulse is then calculated as the weighted sum of the input signal of
the transfer function transformation. The artificial neurons’ learning ability can be
realized by adjusting the weight according to the selected learning algorithm [12].
Architectures: An ANN consists of a set of processing elements, also known as
neurons or nodes, which are interconnected. It can be described as a directed graph in
which each node performs a transfer function of the form
Crime Prediction Using Data Mining and Machine Learning 365

X
yi ¼ f w x hi
i¼1;...;n ij j
ð2:8Þ
i

where yi is the output of the node i, xj is the th input to the node, and wij is the
connection weight between nodes i and j. hi is the threshold (or bias) of the node.
Usually, fi is nonlinear, such as a heaviside, sigmoid, or Gaussian function.
In (11), each term in the summation only involves one input xj. High-order ANN’s
are those that contain high-order nodes, i.e. nodes in which more than one input are
involved in some of the terms of the summation. For example, a second-order node can
be described as
X
yi ¼ f i w x x hi
j;k¼1;...;n ijk j k
ð2:9Þ

where all the symbols have similar deﬁnitions to those in (11).

3 Crime Classiﬁcation in YD County

As it has been said previously, this project is based on a Project of national social
science foundation about the causes and countermeasures of ethnic minority crimes in
the county of YD. In this chapter, the principle and formula of Bayesian network,
random tree and neural network are given briefly.

3.1 Description of the Problem

In this problem, a training dataset with nearly 35 months of crime reports from all
across YD county was provided. This dataset contains all crimes classified in cate-
gories, which are the different crime typologies. The main goal of the challenge is to
predict these categories of the crimes that occurred.
For the algorithm evaluation, another unlabelled dataset is provided (the test one). It
is used to evaluate the algorithm accuracy with new unclassified data.
How Is the Problem Going to be Solved?
In this contest, we will use a different algorithm to get a good result. Each one will be
explained, tested and tested, and finally we will see which of them is best for this case.
Cross-validation will be used to validate the model, so the database must be divided
into subsets of tests, training, and verification. This division must be layered to ensure
that the proportion of the original components is maintained in each segment (the
number of crimes per category is the same).
All development and testing is done on a server provided by the university
department. In this way, the death penalty can last a whole day without worrying about
them, and execution is faster.
Results Submission Format and Evaluation
The data submitted to the contest evaluation must be in a specific format that meets the
requirements. To properly evaluate the data, the resulting data set must contain sample
366 S. Wu et al.

ids that contain a list of all categories and the probability that each sample belongs to
each category. Remind the training dataset to label the crime types of all samples (10
different).
Then, instead of predicting which category a given sample belongs to, the output
will always be the probability vector.

3.2 Dataset Analysis

Data
The data in this article involves the reported cases of the crime of smuggling, selling,
transporting and manufacturing drugs, Theft, Intentional injury, Illegal business crime,
Illegal possession of drugs, Rape, Crime of fraud, Gang ﬁghting, manslaughter, Rob-
bery in YD county between the years 2012 and 2015. The summary of the data is as
provided in Table 1.

Table 1. Summary statistics of the data set on crime activities.

Crime types Numbers of crime
The crime of smuggling, selling, transporting and manufacturing drugs 224
Theft 86
Intentional injury 85
Illegal business crime 23
Illegal possession of drugs 18
Rape 18
Crime of fraud 13
Gang ﬁghting 11

The data provides insight into criminal activity, and its research can help reduce
crime (protecting communities) and decision-making. Part of the analysis provided can
be used to explain the relationship between certain criminal activities.
The data can further be analyzed using other statistical methods like Random Trees,
Bayesian, Quasi-neural networks and so on.
From Table 1 and Fig. 3, The crime of smuggling, selling, transporting and
manufacturing drugs is the most important crime types, the following is theft, inten-
tional injury and so on.
Data Analysis
The provided dataset has different “features”, each one being of a different relevance. In
this chapter we will proceed to analyze this database and extract the useful information
out of it. There are 478 samples of crime analysis. These data were collected from
2012-09-01 to 2015-07-21 in YD.
Crime Prediction Using Data Mining and Machine Learning 367

200

150

100

0
Sunday Monday Tuesday Wednesday Thursday Friday Saturday

Fig. 1. Number of crimes by day of the week (YD County).

Another interesting analysis is to count the number of crimes that occur every day
of the week so that we know if this is relevant information. Figure 1 shows that the day
when the offenders choose the most is Friday, and the rest of the week is distributed
differently.
We chose the top eight of the crime categories as the basis for this analysis and
discussion.

Numbers of crime
250
200
150
100
50
0
The crime Theft Intentional Illegal Illegal Rape Crime of Gang
of selling injury business possession fraud fighting
drugs crime of drugs

Fig. 2. Numbers of crime types.

From the Fig. 2, we can see that the crime of smuggling, selling, transporting and
manufacturing drugs 224; Theft 86; Intentional injury 85; Illegal business crime 23,
which are the main parts of crime types.

4 Results

The data set selected in this study is a denormalized data set for community and crime.
Including the social and economic data from 2012-09-01 to 2015-07-21, the law
enforcement data of the YD County People’s Procuratorate, and the crime data from
2012-09-01 to 2015-07-21. It also includes 478 cases or criminal cases and 12 total
attributes reported from the entire YD County, often referred to as features.
368 S. Wu et al.

This section describes all the implementation results of the random tree, Bayesian
algorithm, and neural network algorithm. The algorithm is run to predict the following
characteristics of each data set: he smuggles crimes, sells, transports and manufactures
drugs, illegally holds drugs, rapes, steals, murders, robberies, intentional injuries,
illegal business crimes, fraud, criminal gang ﬁghts.

(a) Random Trees

(b) Bayesian Networks

(C) Neural Networks

Fig. 3. The modeling of machine learning algorithm.

Crime Prediction Using Data Mining and Machine Learning 369

4.1 Modeling Based on Machine Learning Algorithm

In this section, we build three models based on Random Trees, Bayesian, Quasi-neural
networks algorithms. Figure 3 shows the modeling of machine learning algorithms.

4.2 Relation Between Sex (or Gender), Ethnic, Age, Education and Crime
The relationship between sex and crime as following: ﬁrst, the number relationship is
as Table 2 and Fig. 4 shows the relationship between sex and crime. From Fig. 4 and
Table 2, number of the males more than females.

Table 2. Convictions according to gender for YD.

Year Males Females
2012 6 0
2013 169 20
2014 221 16
2015 89 11

YD: convictions according to gender

240
220
20
200
180
160 15
Number of convictions

140
120
10
100
80
60 5
males
40 females

20
0
0
-20
2012 2013 2014 2015
year

Fig. 4. Relationship between sex and crime.

In the 478 samples, the relationship between age and Crime numbers for drugs See
Fig. 5. From Fig. 5, it shows that people for the age from 16 to 35, accounting for
about half of the total sample, are the main criminal group.
The relationship between education and Crime numbers for drugs See Fig. 6. From
Fig. 6, it shows that people for illiteracy or semi-illiteracy, Primary school and Junior
high school, are the main criminal group.
370 S. Wu et al.

1 8 0

1 6 0

1 4 0

1 2 0
Numbers of crime

1 0 0

8 0

6 0

4 0

2 0

< = 2 5 2 6 -3 0 3 1 -3 5 3 6 -4 0 4 1 -4 5 4 6 -5 0 5 1 -5 5 > 5 5 --
A g e

Fig. 5. The relationship between age and Crime numbers for crime types.

Numbers of crime
120
100
80
60
40
20
0
Illiteracy or Primary school Junior high High school Technical The specialized
semi-illiteracy school secondary school subject graduates

Fig. 6. The relationship between education and crime.

4.3 The Conditional Probability Table for Bayesian Networks

The Conditional probability based on for The crime of smuggling, selling, transporting
and manufacturing drugs and Gang ﬁghting for professional, education, gender, ethnic
and age are followed as Tables 3, 4, 5, 6, 7 and 8.

Table 3. Conditional probability for crime types.

The crime of smuggling, selling, transporting and manufacturing drugs Gang ﬁghting
0.95 0.05

Table 4. Conditional probability for gender.

Crime types Male Female
The crime of smuggling, selling, transporting and manufacturing drugs 0.86 0.14
Gang ﬁghting 1.00 0.00
Crime Prediction Using Data Mining and Machine Learning 371

Table 5. Conditional probability for professional.

Crime types Gender Individual Migrant Farmers Foreigners Student Teacher
worker worker
The crime of Male 0.01 0.00 0.97 0.02 0.00 0.00
smuggling, selling,
transporting and
manufacturing drugs
The crime of Female 0.00 0.00 0.97 0.03 0.00 0.00
smuggling, selling,
transporting and
manufacturing drugs
Gang ﬁghting Male 0.00 0.00 1.00 0.00 0.00 0.00

Table 6. Conditional probability for education.

Crime types Gender High Illiteracy Junior Primary Technical Specialized
school or semi- high school secondary subject
illiteracy school school graduates
The crime of Male 0.05 0.22 0.23 0.46 0.03 0.01
smuggling, selling,
transporting and
manufacturing
drugs
The crime of Female 0.06 0.53 0.09 0.28 0.03 0.00
smuggling, selling,
transporting and
manufacturing
drugs
Gang ﬁghting Male 0.00 0.00 0.82 0.18 0.00 0.00

Table 7. Conditional probability for ethnic.

Crime types Gender bai Blang Buyi dai Deang Han Hui Lisu man miao Other wa yi
The crime of Male 0.00 0.01 0.01 0.02 0.00 0.55 0.02 0.00 0.01 0.02 0.01 0.00 0.39
smuggling, selling,
transporting
and manufacturing
drugs
The crime of Female 0.00 0.00 0.00 0.00 0.00 0.47 0.00 0.00 0.00 0.00 0.03 0.00 0.50
smuggling, selling,
transporting and
manufacturing
drugs
Gang ﬁghting Male 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
372 S. Wu et al.

Table 8. Conditional probability for age.

Crime types Gender <=24.6 24.6– 35.2– 45.8– >56.4
35.2 45.8 56.4
The crime of smuggling, selling, Male 0.22 0.38 0.28 0.10 0.02
transporting and manufacturing
drugs
The crime of smuggling, selling, Female 0.12 0.44 0.32 0.12 0.00
transporting and manufacturing
drugs
Gang ﬁghting Male 0.82 0.18 0.00 0.00 0.00

4.4 The Comparison of Result for Random Trees, Neural Networks

and Bayesian Networks

The Comparison of Data Analysis Results Predictive Variable Importance

The comparison of data analysis results predictive variable importance based on dif-
ferent algorithms is as following:
Predictive variable importance of crime types for Random Trees, Bayesian Network
and Neural Network See Fig. 7. From Fig. 7, it shows that the age is important variable
for Random Trees and Neural Networks, and the education is important variable for
Bayesian Networks.
The Comparison of Model Accuracy
The data analysis results based on Random Trees and Bayesian Networks. The com-
parison of model accuracy based on Random Trees Classiﬁcation and Bayesian Net-
works Classiﬁcation are as follows:
From Fig. 8, it shows that Model accuracy for Random Trees is much higher than
Bayesian Networks.
Crime Prediction Using Data Mining and Machine Learning 373

(a) Random Trees

(b) Bayesian Network

(c) Neural Networks

Fig. 7. Predictive variable importance of crime types for Random Trees, Bayesian Network and
Neural Networks.
374 S. Wu et al.

Model information for Random Trees

The target field Crime types
Model building method Random Trees Classification
The number of predictive variables entered 5
Model accuracy 0.974
Misclassification rate 0.026
Model information for Bayesian Networks
The target field Crime types
Model building method Bayesian Networks Classification
The number of predictive variables entered 5
Model accuracy 0.537
Misclassification rate 0.463

Fig. 8. The comparison of model accuracy.

5 Conclusions and Future Development

In the field of artificial intelligence, machine learning is a very powerful field. If the
model is done correctly, the accuracy that some algorithms can achieve can be sur-
prising. Of course, the current and future of intelligent systems are subject to ML and
big data analysis. From the above discussion, we can draw the following conclusions:
In the data we selected, the crime rate of was the highest, reaching 46.86%, which
was the main crime type in YD county. In fact, it was theft and intentional injury,
reaching 17.99% and 17.78% respectively.
For ethnic groups, the top five were han, yi, wa, dai and lang, accounting for
68.43%, 23.43%, 1.88%, 1.67% and 1.25% respectively. From this, the han nationality
is the main criminal in the nation.
Through the data collection of the three algorithms, the validity and accuracy of the
random tree algorithm in predicting crime data are observed. The performance of the
Bayesian network algorithm is relatively poor, probably due to the existence of certain
random factors in various crimes and related features (the correlation between the three
algorithms is low).

Acknowledgements. This study is supported by scientiﬁc research projects of National Social

Science Foundation (13CFX038). The authors would like to express their gratitude to the Ofﬁce
of the national social science foundation.

References
1. McClendon, L., Meghanathan, N.: Using machine learning algorithms to analyze crime data.
Mach. Learn. Appl. Int. J. 2(1), 1–2 (2015)
2. Murder. http://www.fbi.gov/ucr/cius2009/offenses/violent_crime/murder_homicide.html
3. Forcible Rape. http://www.fbi.gov/ucr/cius2009/offenses/violent_crime/forcible_rape.html
4. Robbery. http://www.fbi.gov/ucr/cius2009/offenses/violent_crime/robbery.htm
Crime Prediction Using Data Mining and Machine Learning 375

5. Assault. http://www.fbi.gov/ucr/cius2009/offenses/violent_crime/aggravated_assault.html
6. Vaquero Barnadas, M.: Machine learning applied to crime prediction. http://upcommons.
upc.edu/bitstream/handle/2117/96580/machine%20learning%20applied%20TO%20crime%
20precticion.pdf
7. Ronsivalle, G.B.: Neural and bayesian networks to fight crime: the NBNC meta-model of
risk analysis. In: Artificial Neural Networks-Application, pp. 29–42 (2011)
8. Ward, J.T., Ray, J.V., Fox, K.A.: Developed a computer model exploring differences in self-
control across sex, race, age, education, and language: considering a bifactor MIMIC model.
J. Crim. Justice 56, 29–42 (2018)
9. Nirkhi, S.M., Dharaskar, R.V., Thakre, V.M.: Data mining: a prospective approach for
digital forensics. Int. J. Data Min. Knowl. Manag. Process 2(6), 41–48 (2012)
10. Ngai, E.W.T., Xiu, L., Chau, D.C.K.: Application of data mining techniques in customer
relationship management: a literature review and classification. Expert Syst. Appl. 2592–
2602 (2008)
11. McCarthy, J.: Arthur samuel: pioneer in machine learning. AI Mag. 11(3), 10–11 (1990)
12. Aldous, D.: The continuum random tree I. Ann. Probab. 19(1), 1–28 (1991)
13. Aldous, D.: The continuum random tree III. Ann. Probab. 21(1), 248–289 (1993)
14. Haas, B., Miermont, G.: The genealogy of self-similar fragmentations with negative index as
a continuum random tree. Electron. J. Probab. 9, 57–97 (2004)
15. Le Gall, J.F.: Spatial Branching Processes, Random Snakes and Partial Differential
Equations. Birkhauser, Boston (1999)
16. Heckerman, D.: A Bayesian approach to learning causal networks. Technical report MSR-
TR-95-04, pp. 1–23 (1995)
17. Jensen, F.V.: Bayesian Networks and Decision Graphs. Springer, New York (2001)

Training Report On Data Sciencep
No ratings yet
Training Report On Data Sciencep
80 pages
Fbi Crime Analysis and Prediction Using Machine Learning
No ratings yet
Fbi Crime Analysis and Prediction Using Machine Learning
8 pages
Crime Data Mediante Machine Learning
No ratings yet
Crime Data Mediante Machine Learning
6 pages
Irjet V5i9192 PDF
No ratings yet
Irjet V5i9192 PDF
6 pages
abcde
No ratings yet
abcde
5 pages
Fbi Crime Data
No ratings yet
Fbi Crime Data
6 pages
JCSSP 2023 1170 1179
No ratings yet
JCSSP 2023 1170 1179
10 pages
Crime Type and Occurrence Prediction Using Machine Learning Algorithm
No ratings yet
Crime Type and Occurrence Prediction Using Machine Learning Algorithm
8 pages
Crime Analysis Through Machine Learning: November 2018
No ratings yet
Crime Analysis Through Machine Learning: November 2018
7 pages
Crime Type Doc
No ratings yet
Crime Type Doc
7 pages
Crime Prediction and Analysis: 1 Pratibha 2 Akanksha Gahalot
No ratings yet
Crime Prediction and Analysis: 1 Pratibha 2 Akanksha Gahalot
6 pages
Crime Rate Predictor
No ratings yet
Crime Rate Predictor
95 pages
1822 B.E Cse Batchno 242
No ratings yet
1822 B.E Cse Batchno 242
54 pages
IRJET-V10I457
No ratings yet
IRJET-V10I457
4 pages
AbhayRautela_MiniProject_5th Semester
No ratings yet
AbhayRautela_MiniProject_5th Semester
15 pages
Crime Examination Study 2021
No ratings yet
Crime Examination Study 2021
9 pages
crime prediction system proposal
No ratings yet
crime prediction system proposal
19 pages
Paper (Imran)
No ratings yet
Paper (Imran)
13 pages
1822 B.E Cse Batchno 242
No ratings yet
1822 B.E Cse Batchno 242
59 pages
Pre Final Review
No ratings yet
Pre Final Review
29 pages
RP 1
No ratings yet
RP 1
11 pages
Assignment Data Mining
No ratings yet
Assignment Data Mining
20 pages
(Doi 10.1109 - cnsc.2014.6906719) Sathyadevan, Shiju S, Devan M. S., Surya Gangadharan - (IEEE 2014 International Conference On Networks & Soft Computing (ICNSC) - Guntur, Andhra Pradesh, India (20
No ratings yet
(Doi 10.1109 - cnsc.2014.6906719) Sathyadevan, Shiju S, Devan M. S., Surya Gangadharan - (IEEE 2014 International Conference On Networks & Soft Computing (ICNSC) - Guntur, Andhra Pradesh, India (20
7 pages
IJRPR17012
No ratings yet
IJRPR17012
5 pages
Crime Prediction in Nigeria's Higer Institutions
No ratings yet
Crime Prediction in Nigeria's Higer Institutions
13 pages
Crime Analysis and Prediction Using Data
No ratings yet
Crime Analysis and Prediction Using Data
7 pages
Sat - 63.Pdf - Crime Detction Using Machine Learning
No ratings yet
Sat - 63.Pdf - Crime Detction Using Machine Learning
11 pages
Group 6_SMa_Crime Data Analysis Using Data Mining_Presentation
No ratings yet
Group 6_SMa_Crime Data Analysis Using Data Mining_Presentation
27 pages
95 Submission-2
No ratings yet
95 Submission-2
12 pages
Crime - Data-Mining-And-K-Means 2018
No ratings yet
Crime - Data-Mining-And-K-Means 2018
4 pages
Crime Prediction Using Machine Learning and Deep L
No ratings yet
Crime Prediction Using Machine Learning and Deep L
21 pages
TestEngineering
No ratings yet
TestEngineering
8 pages
Machine Learning Based Advanced Crime Prediction and Analysis
No ratings yet
Machine Learning Based Advanced Crime Prediction and Analysis
7 pages
Sat - 91.Pdf - Cyber Patrolling Using Machine Learning
No ratings yet
Sat - 91.Pdf - Cyber Patrolling Using Machine Learning
11 pages
The Use of Data Mining Techniques in Crime Prevention and Prediction-Not Good
No ratings yet
The Use of Data Mining Techniques in Crime Prevention and Prediction-Not Good
4 pages
Review 1 cst2
No ratings yet
Review 1 cst2
13 pages
New Content
No ratings yet
New Content
45 pages
Crime Analysis and Prediction Using Datamining: A Review
No ratings yet
Crime Analysis and Prediction Using Datamining: A Review
20 pages
Mini project abstract
No ratings yet
Mini project abstract
3 pages
Crime Hotspot Prediction Using Machine Learning v4
No ratings yet
Crime Hotspot Prediction Using Machine Learning v4
20 pages
Predictive Modelling of Crime Dataset Using Data Mining
No ratings yet
Predictive Modelling of Crime Dataset Using Data Mining
16 pages
Crime Prediction and Prevention Using K-Means Clustering
No ratings yet
Crime Prediction and Prevention Using K-Means Clustering
7 pages
Using Machine Learning Algorithms To Analyze Crime Data: March 2015
No ratings yet
Using Machine Learning Algorithms To Analyze Crime Data: March 2015
13 pages
Dataminig in crimerate
No ratings yet
Dataminig in crimerate
9 pages
Sample Technical Seminar Vtu
No ratings yet
Sample Technical Seminar Vtu
14 pages
Crime Type and Occurrence Predection
No ratings yet
Crime Type and Occurrence Predection
18 pages
Crime Data Analysis Using ML
No ratings yet
Crime Data Analysis Using ML
22 pages
AI Unit4
No ratings yet
AI Unit4
29 pages
PREDPATROL (Predictive Patrolling) - IEEEFormat
No ratings yet
PREDPATROL (Predictive Patrolling) - IEEEFormat
7 pages
jurnalku
No ratings yet
jurnalku
11 pages
Ijcsit 2021120201
No ratings yet
Ijcsit 2021120201
9 pages
Mini Project
No ratings yet
Mini Project
9 pages
Crime Analysis System
No ratings yet
Crime Analysis System
74 pages
FinalInterim
No ratings yet
FinalInterim
3 pages
Anurag Nayak Report
No ratings yet
Anurag Nayak Report
36 pages
IJCRT22A6562
No ratings yet
IJCRT22A6562
8 pages
Crime Prediction
No ratings yet
Crime Prediction
11 pages
Crime Prediction Using KNN
No ratings yet
Crime Prediction Using KNN
4 pages
Forecasting of Crime Ppt1
No ratings yet
Forecasting of Crime Ppt1
18 pages
Exploring The Intersection Of Artificial Intelligence And Cyber Defense
From Everand
Exploring The Intersection Of Artificial Intelligence And Cyber Defense
Stephen Nnamdi
No ratings yet
Navigating Emerging Tech Ethics: 1A, #1
From Everand
Navigating Emerging Tech Ethics: 1A, #1
ABEBE-BARD AI WOLDEMARIAM
No ratings yet
Modeling Information Diffusion
No ratings yet
Modeling Information Diffusion
34 pages
Boldrin Belluzi, Maykel Tesis
No ratings yet
Boldrin Belluzi, Maykel Tesis
184 pages
Sensitivity Analysis of Ordinary Differential Equation Models
No ratings yet
Sensitivity Analysis of Ordinary Differential Equation Models
33 pages
Sensibilidad Virginia Tech
No ratings yet
Sensibilidad Virginia Tech
16 pages
Crime Social Factors
No ratings yet
Crime Social Factors
7 pages
A Novel Random Forest Approach For Imbalance 2020 Elseiver NO
No ratings yet
A Novel Random Forest Approach For Imbalance 2020 Elseiver NO
13 pages
Ferath Kherif PCA
No ratings yet
Ferath Kherif PCA
17 pages
Insights Into The Long-Term Pollution 2020
No ratings yet
Insights Into The Long-Term Pollution 2020
36 pages
Effects of Heavy Metal Contamination Ok 2020
No ratings yet
Effects of Heavy Metal Contamination Ok 2020
13 pages
6aimlsyll
No ratings yet
6aimlsyll
9 pages
Session On Maximum Likelihood Estimation
No ratings yet
Session On Maximum Likelihood Estimation
15 pages
Soft Computing 2016
No ratings yet
Soft Computing 2016
4 pages
Video Understanding With Large Language Models - A Survey
No ratings yet
Video Understanding With Large Language Models - A Survey
24 pages
CONCEPTS_OF_MACHINE_LEARNING [MINOR]
No ratings yet
CONCEPTS_OF_MACHINE_LEARNING [MINOR]
14 pages
Master in Big Data and Business Intelligence
No ratings yet
Master in Big Data and Business Intelligence
28 pages
Generative AI -smartbridge
No ratings yet
Generative AI -smartbridge
3 pages
100 Must-Know PythonMl Interview Questions and Answers 2024 – Devinterview.io
No ratings yet
100 Must-Know PythonMl Interview Questions and Answers 2024 – Devinterview.io
1 page
CS402 Data Mining and Warehousing
No ratings yet
CS402 Data Mining and Warehousing
3 pages
Dataanalytics
No ratings yet
Dataanalytics
44 pages
The Yin Yang of AI - 2024
No ratings yet
The Yin Yang of AI - 2024
19 pages
Big Data Mining Literature Review
100% (2)
Big Data Mining Literature Review
7 pages
RF
No ratings yet
RF
13 pages
Module 4
No ratings yet
Module 4
54 pages
Thesis Research Deep Learning
No ratings yet
Thesis Research Deep Learning
18 pages
Naman Meena: Data Science Engineer
No ratings yet
Naman Meena: Data Science Engineer
1 page
Abstract E-Book - Proceedings of ISMILE 2025
No ratings yet
Abstract E-Book - Proceedings of ISMILE 2025
26 pages
Free Resources For Self-Study Plan Data Science
No ratings yet
Free Resources For Self-Study Plan Data Science
3 pages
CyberbullyingDetection - Documentation
No ratings yet
CyberbullyingDetection - Documentation
12 pages
LTE Data Analysis - Project Proposal N.2
No ratings yet
LTE Data Analysis - Project Proposal N.2
22 pages
204CS001-Machine Learning Techniques.
No ratings yet
204CS001-Machine Learning Techniques.
1 page
AI Course File HIMAKIRAN
No ratings yet
AI Course File HIMAKIRAN
161 pages
Ai &Ml Syllabus
No ratings yet
Ai &Ml Syllabus
4 pages
A Container Scheduling Strategy Based On Machine Learning in Microservice Architecture
No ratings yet
A Container Scheduling Strategy Based On Machine Learning in Microservice Architecture
7 pages
Uom Synopsis-Project 4592 Project 1725294865152
No ratings yet
Uom Synopsis-Project 4592 Project 1725294865152
103 pages
Data Science Crash Course - SharpSight PDF
100% (3)
Data Science Crash Course - SharpSight PDF
107 pages
Seminar Paper AI
No ratings yet
Seminar Paper AI
8 pages
Data Analytics of Power System Application
No ratings yet
Data Analytics of Power System Application
7 pages
【23秋】2023-2024-1 (Fall) UndergraduateClass V2 (23.8.14)
No ratings yet
【23秋】2023-2024-1 (Fall) UndergraduateClass V2 (23.8.14)
44 pages

Crime Prediction Using Data Mining

Uploaded by

Crime Prediction Using Data Mining

Uploaded by

Crime Prediction Using Data Mining

and Machine Learning

Shaobing Wu1, Changmei Wang2(&), Haoshun Cao1,

Keywords: Crime prediction Data mining Machine learning

© Springer Nature Switzerland AG 2020

ML is a fairly multidisciplinary ﬁeld that deals primarily with programming and

2 Data Mining and Machine Learning Algorithms

2.1 Data Mining

2.2 Machine Learning

2.3 Algorithms Selected for Analysis

According to Lemma 1.1, they get the Theorem 2.1 as following:

The function u solves the integral equation [20]

Random continuous trees can be used to model the genealogy of self-similar

Pðli jXi ÞPðXi Þ

where all the symbols have similar deﬁnitions to those in (11).

3 Crime Classiﬁcation in YD County

3.1 Description of the Problem

3.2 Dataset Analysis

Table 1. Summary statistics of the data set on crime activities.

Fig. 1. Number of crimes by day of the week (YD County).

Fig. 2. Numbers of crime types.

(a) Random Trees

(b) Bayesian Networks

(C) Neural Networks

Fig. 3. The modeling of machine learning algorithm.

4.1 Modeling Based on Machine Learning Algorithm

Table 2. Convictions according to gender for YD.

YD: convictions according to gender

Fig. 4. Relationship between sex and crime.

Fig. 6. The relationship between education and crime.

4.3 The Conditional Probability Table for Bayesian Networks

Table 3. Conditional probability for crime types.

Table 4. Conditional probability for gender.

Table 5. Conditional probability for professional.

Table 6. Conditional probability for education.

Table 7. Conditional probability for ethnic.

Table 8. Conditional probability for age.

4.4 The Comparison of Result for Random Trees, Neural Networks

The Comparison of Data Analysis Results Predictive Variable Importance

(a) Random Trees

(b) Bayesian Network

(c) Neural Networks

Model information for Random Trees

Fig. 8. The comparison of model accuracy.

5 Conclusions and Future Development

Acknowledgements. This study is supported by scientiﬁc research projects of National Social

You might also like