0% found this document useful (0 votes)
15 views4 pages

Abdullah 2020

Uploaded by

adithyajkjaas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views4 pages

Abdullah 2020

Uploaded by

adithyajkjaas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2020 23rd International Conference on Computer and Information Technology (ICCIT)

19-21 December, 2020

Intelligent Crime Investigation Assistance Using


Machine Learning Classifiers on Crime and Victim
Information
2020 23rd International Conference on Computer and Information Technology (ICCIT) | 978-1-6654-2244-4/20/$31.00 ©2020 IEEE | DOI: 10.1109/ICCIT51783.2020.9392668

Saqueeb Abdullah1 , Farah Idid Nibir2 , Suraiya Salam3 , Akash Dey4 , Md Ashraful Alam5 and Md Tanzim Reza6
1,2,3,4,5,6
Department of Computer Science and Engineering, BRAC University, 66 Mohakhali, Dhaka 1212, Bangladesh
Email: [email protected], 2 [email protected], 3 [email protected], 4 [email protected],
1
5
[email protected], 6 [email protected]

Abstract—In order to establish peace and justice in a society, and is expected to grow in the future along with the continuous
it is essential to make proper and correct investigation of crime growth of population. [2]
incidents. With the expansion of the utilization of computerized In this age of vast digitalization, various machine based
system to track crime and violence, computer applications can
help law enforcement officers in a significant way. In most cases, approaches are being taken to automate the problem solving
crime incidents are kept in police database and these can be procedures across different fields. This automation of problem
used for various helpful purpose. In this experiment, we have solving requires some typical steps such as collecting raw data,
collected data of crime scenario from Bangladesh Police that had denoising it, analyzing it through computing machines and
features such as area of crime, type of crime, number of victims so on. These problem solving procedures are often referred
and so on. Then we applied machine learning algorithms on the
dataset for prediction of some attributes such as criminal age, as automated data driven approach where the data are being
sex, race, crime method etc. We used four different algorithms analyzed by a machine instead of a human. Criminal investiga-
for our research: K-Nearest Neighbor (KNN), Logistic Regression tion methodologies are also mostly data driven as various data
(LR), Random Forest Classifier (RFC), Decision Tree Classifier from the crime scene are used to deduct criminal information.
(DTC). Using the aforementioned algorithms with 10 fold cross Consequently, it is possible to apply machine based approach
validation, we achieved different accuracy from all four attribute
labels ranging from an average of approximate 75% to an in this investigation.
average of approximate 90%. Despite the clear need of further In this experiment, we applied few machine learning algo-
improvement, the results give clear implications that it is possible rithms to determine criminal attributes from crime information
to achieve well performing automated system for suspect attribute and compared between the results of the algorithms. The
prediction with further work. Finally, we ended the research by paper is divided into five main sections. In the next section,
comparing and analyzing all the achieved results.
Index Terms—Crime, investigation, Automated system, Classi-
literature review of the proposed methodology is discussed.
fication, Features, Labels The following parts consist of the dataset details, proposed
model, result and analysis, conclusion and discussion.
I. I NTRODUCTION II. L ITERATURE R EVIEW
Criminal investigation is a multifaceted problem solving A. Previous Works
challenge. During investigation, an expert official is often There is quite a few work that has been done for automated
required to examine the location of the crime. The official crime investigation in the past. Among the few ones that have
meticulously examines various important aspects of the crime been done, most of them used some form of data mining
scene, collects data and eventually analyzes data in order to technology. Some of these data mining works include, usage
infer identification information of the criminal. This compli- of semi-supervised machine learning algorithms such as K-
cated process of criminal identification demands high critical means clustering in order to discover essential knowledge from
and reasoning skills. Additionally, most of the time these records of crime [3], usage of different types of regression
procedures are needed to be performed fairly quickly since algorithms to predict violent crime patterns from data [4],
criminals always try to hide all their traces. Therefore, the usage of data mining for fraud detection [5], prediction of
more time criminals get, the harder it becomes to track him event outcome through analyzing a dataset of criminal ac-
down. In order to address all these complications, the crime tivity [6] and so on. Additionally, work has been done on
scene examiners need to earn lots of experience and analytical predicting crime based on geographical features [7], urban
skills so that they can make proper use of insightful infor- planning features [8] etc. All these algorithms predict criminal
mation. [1] However, very few can earn such interpretative attributes from a set of specific information that are often
skills which results in a low number of proficient criminal difficult to collect. From the perspective of Bangladesh, most
investigators. Therefore, a lack of enough crime investigator of the research are crime forecast based. As a result, this type
is often evident. This is especially true for a country like of criminal attribute predicting research has not been done
Bangladesh, where the amount of crime is regularly growing before as there is no regulation of collecting and storing crime

978-1-6654-2244-4/20/$31.00 ©2020 IEEE

Authorized licensed use limited to: Carleton University. Downloaded on May 31,2021 at 12:58:04 UTC from IEEE Xplore. Restrictions apply.
data. Therefore, there is a dire need of more crime prediction
research in Bangladesh.
B. Algorithms
The four different algorithms we have used for our research
are classification algorithms that try to classify labels based
on a feature set. LR is a regression technique that converts
output to binary by using sigmoid function. When LR is used
to classify between more than two classes, it is called as
multinomial LR. [9] In our research, we have used multinomial
LR for every label classification except gender, as gender has
two value types: male and female. KNN on the other hand,
is a clustering algorithm that tries to group together similarly
labeled data into the same cluster. The value of K in KNN
determines the number of nearest data point it tries to cluster
together. [10] Furthermore, DTC is a classifier that creates
tree structured branching shape based on different attributes
for classification. [11] For our experiment, we used CART
decision tree. [12] CART uses a metrics called gini index
for classification. Finally, RFC algorithm creates a group of
small classification trees with different branching attributes
and combines them for very strong predictive power. [13]
III. P ROPOSED MODEL
The proposed model starts with collection of the database
and afterwards, some of the pre-processing steps were per-
formed on the dataset. Then the dataset was divided into
feature and label set. A portion of the feature set was used to
train the machine learning classifiers and those classifiers tried
to predict the labels. Before training, the entire dataset was
divided into 80% train data and 20% test data. Subsequently,
the feature set of the data was scaled and passed through four
different classification algorithm: KNN, LR, RFC and DTC.
Finally, all the different results were compared and analyzed.
IV. DATASET DETAILS AND PROCESSING
A. Dataset details
We collected a completely new dataset for our research. The Fig. 1. Proposed model
data were directly collected from Bangladesh Police under the
Ministry of Home Affairs of the Government of Bangladesh.
This dataset is difficult to find since it is classified data and ranges. Finally, all the data points were encoded into numerical
full of critical information. Although the amount of samples form from their string from for proper classification purpose.
in the dataset was not huge, there was still modest amount of The details of all the features and labels of the dataset is given
sample just good enough to serve our purpose. in table number I.
There were five different types of features in the dataset As it is visible from table I, there were three different
and there were four different types of corresponding labels types of features in the dataset and four different types of
alongside with it. labels. There were exactly 1466 data samples after the pre-
processing steps were done. The data samples were divided
B. Dataset processing into approximately 80% training data and 20% testing data for
The raw dataset had some defects in it so those had to be supervised learning purpose. As a result, 1172 data samples
resolved through some pre-processing steps. First of all, the went into the training set and rest went into the testing set.
rows with at least one empty value had to be taken care of. As During learning process, we took all the five features and one
criminal prediction is a critical task, we decided to drop entire of the four labels at a time for classification purpose. During
rows that contained one or more null values. Afterwards, as training period, we also applied exhaustive gridsearch on the
labels such as ’age’ had lots of different numerical values, the parameters to find the the best parameters that can provide
amount of variance was reduced by putting them into specific the most accurate results for each class. Additionally, we have

Authorized licensed use limited to: Carleton University. Downloaded on May 31,2021 at 12:58:04 UTC from IEEE Xplore. Restrictions apply.
TABLE I
D ETAILS OF FEATURES AND LABELS IN THE DATASET

Type Name Types of values


Area of crime Sutrapur, Gulshan, Lalbagh, Ad-
Features abor, Rampura, Mirpur, Shah-
bag, Bangsal, Hazaribagh, Moti-
jheel and others
Type of crime Kidnap, Rape, Aggravated assault,
Arson, Drug trafficking, False pre-
tences, Embezzlement, Robbery,
Terrorism, Murder and others
Victim Sex Male and Female
Victim Race White, Black and Brown
Number of vic- 1-7
tims
Criminal Age 31-40, 41-50, 51-60 and others
Criminal Sex Male and Female
Labels Criminal Race White, Black and Brown
Methods of the Firing, Unknown, Deadly weapon,
crimes Explosion, Bombing, Chloroform
and others Fig. 3. Comparison between accuracy of four models (2)

performed cross validation during training in order to avoid


baised split of train-test dataset. We intentionally used the
same set of train and test data for each of the classification
algorithm during cross validation so that the results can be
compared properly and accurately.
V. R ESULT AND A NALYSIS
After performing 10 fold cross validation on the dataset, we
extracted some results for all the four predictive attributes. The
accuracy measurements for methods of crimes are as follows,

Fig. 4. Comparison between accuracy of four models (3)

Fig. 2. Comparison between accuracy of four models (1)

As we can see from the figure 2, RFC achieves the best


classification accuracy in case of method prediction. On the
other hand, we found the lowest result from DTC. However,
the results were mostly close to each other.
On other hand, in figure 3, again RFC exceeds in terms
of accuracy between all four algorithms. This time, KNN
achieves the lowest amount of accuracy. For a prediction task
of classifying between only three labels, the accuracy from
the algorithms are rather low in this case. Finally, we also Fig. 5. Comparison between accuracy of four models (4)
attempted to classify between sex and age range.

Authorized licensed use limited to: Carleton University. Downloaded on May 31,2021 at 12:58:04 UTC from IEEE Xplore. Restrictions apply.
TABLE II [2] Md Abdul Awal, Jakaria Rabbi, Sk Imran Hossain, and MMA Hashem.
M ULTIPROGRAM SETS Using linear regression to forecast future trends in crime of bangladesh.
In 2016 5th International Conference on Informatics, Electronics and
Best and worst accuracy Vision (ICIEV), pages 333–338. IEEE, 2016.
Best Worst [3] Shyam Varan Nath. Crime pattern detection using data mining. In
Label Model Accuracy Model Accuracy 2006 IEEE/WIC/ACM International Conference on Web Intelligence and
Crime Method RFC 83.3% DTC 80.6% Intelligent Agent Technology Workshops, pages 41–44. IEEE, 2006.
Criminal Race RFC 76.5% KNN 63.9% [4] Lawrence McClendon and Natarajan Meghanathan. Using machine
Criminal Sex LR 91.2% KNN 90.1% learning algorithms to analyze crime data. Machine Learning and
Criminal Age Range RFC 65.6% KNN 58.2% Applications: An International Journal (MLAIJ), 2(1):1–12, 2015.
[5] Clifton Phua, Vincent Lee, Kate Smith, and Ross Gayler. A compre-
hensive survey of data mining-based fraud detection research. arXiv
preprint arXiv:1009.6119, 2010.
The results for sex classification were quite good. However, [6] Umair Saeed, Muhammad Sarim, Amna Usmani, Aniqa Mukhtar,
there were only 2 different types of labels for sex so the results Abdul Basit Shaikh, and Sheikh Kashif Raffat. Application
of machine learning algorithms in crime classification and
were quite understandable. Meanwhile, all the algorithms classification rule mining. Research Journal of Recent Sciences
achieved rather poor results during age range prediction with
65.6% being the highest and 58.2% being the lowest. ISSN, 2277:2502, 2015.
[7] Ying-Lung Lin, Meng-Feng Yen, and Liang-Chih Yu. Grid-based crime
In table II, a quite obvious pattern of result is present. prediction using geographical features. ISPRS International Journal of
In the task of classifying between four different labels, RFC Geo-Information, 7(8):298, 2018.
provides the most accurate result in three out of four cases. [8] Luiz GA Alves, Haroldo V Ribeiro, and Francisco A Rodrigues. Crime
prediction through urban metrics and statistical learning. Physica A:
On the other hand, in three out of four cases, KNN provides Statistical Mechanics and its Applications, 505:435–443, 2018.
the least accurate result. This result gives us an interesting [9] Raymond E Wright. Logistic regression. 1995.
perspective that ensemble classifiers like RFC may provide [10] Leif E Peterson. K-nearest neighbor. Scholarpedia, 4(2):1883, 2009.
[11] Xie Niuniu and Liu Yuxun. Review of decision trees. In 2010
the most accurate outcome. 3rd International Conference on Computer Science and Information
Technology, 2010.
VI. C ONCLUSION AND F UTURE W ORKS [12] Roger J Lewis. An introduction to classification and regression tree
(cart) analysis. In Annual meeting of the society for academic emergency
Our goal of the research was to establish an expandable medicine in San Francisco, California, volume 14, 2000.
[13] Carla CM Chen, Holger Schwender, Jonthan Keith, Robin Nunkesser,
knowledge that can be used for building machine learning Kerrie Mengersen, and Paula Macrossan. Methods for identifying
based applications that can reliably output criminal data after snp interactions: a review on variations of logic regression, random
giving some victim data and crime information as input. While forest and bayesian logistic regression. IEEE/ACM transactions on
computational biology and bioinformatics, 8(6):1580–1591, 2011.
our current version of research does well for classifying some
of the labels such as gender or crime method, there are still
lots of improvement needed to be done as there are some
obvious weakness of the model. First of all, a criminal cannot
be completely identified by just one single attribute. Therefore,
multiple attributes are needed to be stacked to create an overall
criminal profile. However, when multiple attributes with little
errors are stacked, the amount of total error increases by
probabilistic theory. Therefore, each of the labels has to be
classified very accurately in order to build a successful model.
Secondly, when a victim’s body is unrecognizable because of
burn or some other cause, then data for the proposed system
cannot be collected in proper way. Unfortunately, there is no
viable solution to this issue for a model like this.
As for future work, the first thing we need to do is to
collect more data in order to see if the performance of the
classifiers can be improved. In addition to that, the types of
features can be experimented as it may well be the case that
the current set of features do not fit the labels well enough.
Perhaps there are some other important attribute that can
provide more information regarding the criminal. Finally, the
whole system can be integrated into a database for ease of
access and modularity.

R EFERENCES
[1] Rod Gehl and Darryl Plecas. Introduction to Criminal Investigation:
Processes, Practices and Thinking. Justice Institute of British Columbia,
2017.

Authorized licensed use limited to: Carleton University. Downloaded on May 31,2021 at 12:58:04 UTC from IEEE Xplore. Restrictions apply.

You might also like