0% found this document useful (0 votes)
21 views80 pages

AnandReport Merged

Uploaded by

naveennandhu628
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views80 pages

AnandReport Merged

Uploaded by

naveennandhu628
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

CRIME RATE PREDICTION AND ANALYSIS USING

K – MEANS CLUSTERING ALGORITHM

A PROJECT REPORT

Submitted by

ANANDAKUMAR. A (920819104005)

in partial fulfillment for the award of degree


of
BACHELOR OF ENGINEERING
in
COMPUTER SCIENCE AND ENGINEERING

NPR COLLEGE OF ENGINEEERING AND TECHNOLOGY


NATHAM, DINDIGUL.

ANNA UNIVERSITY::CHENNAI 600 025

MAY 2023
ANNA UNVIERSITY::CHENNAI 600 025

BONAFIDE CERTIFICATE

Certified that this project report “ CRIME RATE PREDICTION AND

ANALYSIS USING K – MEANS CLUSTERING ALGORITHM ” is the

bonafide work of “ANANDAKUMAR .A (920819104005) who carried out the

project work under my supervision.

SIGNATURE SIGNATURE
Dr. K. RAMANAN, M.Tech., Ph.D., Dr. K. RAMANAN, M.Tech., Ph.D.,
HEAD OF THE DEPARTMENT MENTOR
Professor, Assistant Professor,
Computer Science and Computer Science and
Engineering, Engineering,
NPR College of Engineering NPR college of Engineering
and Technology, and Technology,
Natham, Natham,
Dindigul – 624001. Dindigul – 624001.

Submitted for the ANNA UNIVERSITY viva-voce Examination held on


……………….. at NPR College of Engineering and Technology, Natham.
ABSTRACT

In our daily life, the collection and analysis of crime-related data is

uncertain from a security perspective. Crime rate forecasting and analysis is a

way of identifying and analyzing crime patterns in crime data contained in a

crime database. Our system predicts what criminal activity occurs in everyday

life in various parts of the world. Using machine learning and data mining

algorithms, we can predict the information in the dataset. This process helps

solve the crime faster. Instead of focusing on why the crime happened,

background locations are also provided. We use K-means clustering algorithm

and linear regression algorithm to improve criminology. It allows analysis of

crime data based on monthly and weekly data. Here we have an approach

between computer science and criminal justice to develop a data mining

procedure that can help solve crimes faster. Instead of focusing on causes of

crime occurrence like criminal background of offender, political enmity etc. we

are focusing mainly on crime factors of each day.

iii
ACKNOWLEDGEMENT

First and foremost, I praise and thank the nature from the depth of my
heart which has given us immense source of strength, comfort and inspiration in
completion of this project work.
I should like to express sincere thanks to our beloved Principal
Dr.J.SUNDARARAJAN,B.E.,M.Tech., Ph.D., for forwarding us to do our
project.

I extended our gratitude to our Head of the Department of Computer


Science and Engineering Dr.K.RAMANAN, M.Tech., Ph.D., for providing
constructive suggestions and his sustained encouragement all through this
project.
I express our graceful thanks to our project coordinator and our project
Guide Dr.K.RAMANAN, M.Tech., Ph.D., Assistant Professor for their
valuable technical guidance, patience and motivation to complete this project
in a successful manner.
Also, I would like to record our deepest gratitude to our Parents for their
constant encouragement and support which motivated us to complete our
project.

iv
TABLE OF CONTENTS

CHAPTER TITLE
NO. PAGE NO.

ABSTRACT iii
LIST OF FIGURES viii

LIST OF ABBREVIATIONS x

I INTRODUCTION 1
1.1 Need for study 3
1.2 Objectives of study 4
1.3 Research Objective 4
1.4 Challenges 5
1.5 Technique Used 5
1.6 Plan of implementation 9
1.7 Problem Statement 10
II LITERATURE SURVEY 11

III EXISTING SYSTEM 17


3.1 Overview 17
3.2 Traditional System 17
3.3 Disadvantages
18

IV SYSTEM STUDY 19
4.1 Feasibility Study
19
4.1.1 Economical Feasibility 19
4.1.2 Technical Feasibility 19

4.1.3 Social Feasibility 19


V PROPOSED SYSTEM 20
5.1 Overview 20

5.2 Advantages 22
VI SYSTEM SPECIFICATION 23
6.1 Technologies Used 23

6.2 Application Required 23

6.3 Hardware Requirements 30


VII SYSTEM DESIGN 31
7.1 System Design 31
7.2 Block Diagram 31
VIII SYSTEM IMPLEMENTATION 33
8.1 Modules 33
8.1.1 Pre - processing Module 33

8.1.2 Classification Module 40


8.1.3 Clustering Module 42

8.1.4 Crime Prediction Module 43


8.1.5 Analysis Module 44

IX RESULTS AND DISCUSSION 46


X TESTING 55
10.1 Introduction to Testing 55
10.2 Specific Testing 56
10.3 Module Level Testing 56
10.4 Unit Testing 56
10.5 Integration Testing 56
10.6 Validation Testing 56
10.7 Recovery testing 57
10.8 Security Testing 57
10.9 Performance Testing 57
10.10 Black Box Testing 57
10.11 Output Testing 58
10.12 User Acceptance Testing 58
XI CONCLUSION 59

XII FUTURE SCOPE 60

APPENDIX 61
OUTPUT SCREENSHOTS 61
SAMPLE CODE 63
REFERENCES 68
LIST OF FIGURES

FIGURE
FIGURE NAME PAGE NO
NO.

1.1 Crime Identification


1

3.1 Crime Prediction 18

3.2 Crime Mapping 18


5.1 Flow of Proposed system 20

5.2 Deep Learning technique architecture 22

6.1 Code editors 24

7.1 Block Diagram of our proposed work 32

8.1 Data Preprocessing steps 33

8.2 Data Transformation Techniques 36

8.3 Use case diagram 38

8.4 Class diagram 39

8.5 Crime dataset 41

8.6 Read the dataset 41

8.7 Clustering using K-Means Algorithm 43

8.8 Split the data into training & testing 45

9.1 Measure of False Positive Data


46
9.2 Impact of time taken for data collection 47
9.3 Scatter the total crime 48
9.4 Group murder from total crime 49
9.5 Murder vs Assault murder 50
9.6 Data Visualization by crime category 51
9.7 Top 5 Crimes 52
9.8 Number of monthly & weekly arrests 53
9.9 Elbow method 54
A.1 Location identification 61
A.2 Crime Area identification 61
A.3 Crime Area Location 62
A.4 Data Visualization – Correlation Matrix 62

LIST OF TABLES

TABLE
TABLE NAME PAGE NO
NO.

1 False positive data aggregate rate with


respect to detect count 47

2 Time taken for data gathering with respect 48 Distortion


to Frame count values
3 Total Crime count 49
4 Murder count from Total Crime 50

5 Murder & Assualt Murder count 51

6 Top 5 crimes values 53


7 Distortion values 54
LIST OF ABBREVIATIONS

ACRONYMS ABBREVIATIONS

ML Machine Learning
IT Information Technology
K - MEANS Number of Clusters
IDLE Integrated Development Learning
Environment
COLAB Collabaratory
VS CODE Visual Studio Code
RAM Random Access Memory
GB Giga Byte
FPDA False Positive Data Aggregate
UML Unified Modeling Language

x
CHAPTER – I
INTRODUCTION

Crimes are the significant threat to human kind. There are many crimes
that happen. Crime prediction & identification are the major problems to police
department. The aim of this project is to make crime prediction using the
features present in the dataset. With the help of Machine learning algorithms we
can predict the type of crime which will occur in particular area.

Figure 1.1 Crime Identification

Crimes are the significant threat to the humankind. There are many
crimes that happen in regular intervals of time. Perhaps it is increasing and
spreading at a fast and vast rate. Crimes happen from small village, town to big
cities. Crimes are of different type – robbery, murder, rape, assault, battery, false
imprisonment, kidnapping, homicide. Since crimes are increasing there is a need
to solve the cases in a much faster way. The crime activities have been increased
at a faster rate and it is the responsibility of police department to control and
reduce the crime activities.

1
Crime prediction and criminal identification are the major problems to the
police department as there are tremendous amount of crime data that exist.
There is a need of technology through which the case solving could be faster.
Through many documentation and cases, it came out that machine learning and
data science can make the work easier and faster. The aim of this project is to
make crime prediction using the features present in the dataset. The dataset is
extracted from the official sites. With the help of machine learning algorithm,
using python as core we can predict the type of crime which will occur in a
particular area with crime percapita. The objective would be to train a model for
prediction. The training would be done using Training data set which will be
validated using the test dataset. The Multi Linear Regression (MLR) will be
used for crime prediction. Visualization of dataset is done to analyze the crimes
which may have occurred in a particular year and based on population and
number of crimes. This work helps the law enforcement agencies to predict and
detect the crime percapita in an area and thus reduces the crime rate.
Machine Learning is a sub-area of artificial intelligence, whereby the term
refers to the ability of IT systems to independently find solutions to problems by
recognizing patterns in databases. In other words: Machine Learning enables IT
systems to recognize patterns on the basis of existing algorithms and data sets
and to develop adequate solution concepts. Therefore, in Machine Learning,
artificial knowledge is generated on the basis of experience. In order to enable
the software to independently generate solutions, the prior action of people is
necessary. For example, the required algorithms and data must be fed into the
systems in advance and the respective analysis rules for the recognition of
patterns in the data stock must be defined. Once these two steps have been
completed, the system can perform the following tasks by Machine Learning.

2
1.1 Need for the study

Gathering criminal intelligence is essential to security today. Crime is


increasing dramatically. It cannot be predicted and analysed because it is neither
random nor systematic. There were many crimes. Crimes such as robberies,
thefts, murders and sexual harassment have increased. However, we cannot
predict who all may be victims of this crime. Predicted and analysed results
cannot be given 100%. Crimes are a great threat to humanity.
We use cluster technology to improve efficiency. The term cluster refers to
dividing a huge data set into n number of clusters. IT organized a bunch of the
same data into the same group. This is the same as classification. In our project,
we use the K-means Clustering algorithm to combine data based on various
factors. This algorithm is the simplest and most commonly used partitioning
algorithm among all grouping algorithms.
A data mining algorithm such as a linear regression algorithm is also used.
This type of algorithm is one of the mathematical ways to find a relationship
between a set of dependent and independent variables whose input values are
collected in a crime scene dataset. It is used to predict crime rates using datasets.
With the help of machine learning algorithms, we can predict and analyse what
kind of crime will happen at a particular hotspot. We can add criminal records
into system. We can view criminal & crime details occurred in particular
location. System with K-means algorithm to predict the crime record which help
to prevent crime in society. Our project can help reduce crime and provide
security in crime-prone areas. The main problem is that day to day the
population is going to be increased and by that the crimes are also going to be
increased in different areas by this the crime rate can„t be accurately predicted
by the officials. The officials as they focus on many issues may not predict the
crimes to be happened in the future.

3
1.2 Objectives of the study
The aim of the study is to analyze and predict the crime rate and crime
types in various locations. Based on this information the officials can take
charge and try to reduce the crime rate. It is based on spatial distribution of
data and anticipation of crime rate.

The main objective of the project is to predict the crime rate and analyze
the crime rate to be happened in future. Based on this Information the officials
can take charge and try to reduce the crime rate.

→The concept of Multi Linear Regression is used for predicting the graph
between the Types of Crimes (Independent Variable) and the Year (Dependent
Variable).

→The system will look at how to convert crime information into a


regression problem, so that it will help detectives in solving crimes faster.

→Crime analysis based on available information to extract crime patterns.


Using various multi linear regression techniques, frequency of occurring crime
can be predicted based on territorial distribution of existing data and
Crime recognition.

1.3 Research Objectives


The expectations, objectives and aims to achieve through this project are:
 The aim is to develop an efficient and easy prototype of crime rate
prediction system that provide secure prediction process and build trust in
people regarding this technology and improve experience.
 Current policing strategies work towards finding the criminals basically
after the crime has occurred.
 A crime rate prediction and analyzing algorithm that is reliable and
accurate will determine the crime spots.

4
1.4 Challenges
Crime rate prediction and analysis system have many advantages such as
accessibility, simplicity, secure and efficient but there are number of challenging
problems associated when design, planning and implementing the entire system.
Some of the challenges that we are faced during project are:

Technical Challenges:
 Data Security

 Lack of data in Dataset


 Unauthorized Access

General Challenges:

1. Undefined goals
2. Scope changes
3. Lack of engagement
4. Resource depreciation
5. Lack of time

1.5 Technique Used


The programming language used in our project is Python. Python is a high-
level, general-purpose programming language. Its design philosophy
emphasizes code readability with the use of significant indentation via the off-
side rule.

Python is dynamically typed and garbage-collected. It supports


multiple programmingparadigms,including structured (particularly procedural),
object-oriented and functional programming. It is often described as a "batteries
included" language due to its comprehensive standard library.

Guido van Rossum began working on Python in the late 1980s as a


successor to the ABC programming language and first released it in 1991 as

5
Python 0.9.0. Python 2.0 was released in 2000. Python 3.0, released in 2008,
was a major revision not completely backward-compatible with earlier versions.
Python 2.7.18, released in 2020, was the last release of Python 2.

Python consistently ranks as one of the most popular programming


languages. Python is a multi-paradigm programming language. Object-oriented
programming and structured programming are fully supported, and many of
their features support functional programming and aspect-oriented
programming (including meta programming and meta objects). Many other
paradigms are supported via extensions, including design by contract and logic
programming.

Python uses dynamic typing and a combination of reference counting and a


cycle-detecting garbage collector for memory management. It uses
dynamic name resolution (late binding), which binds method and variable names
during program execution.

Its design offers some support for functional programming in


the Lisp tradition filter , map and reduce functions; list
comprehensions, dictionaries, sets, and generator expressions. The standard
library has two modules ( itertools and functools ) that implement functional
tools borrowed from Haskell and Standard ML.

Its core philosophy is summarized in the document The Zen of


Python (PEP 20), which includes aphorisms such as:

 Beautiful is better than ugly.


 Explicit is better than implicit.
 Simple is better than complex.
 Complex is better than complicated.

6
Numpy:
NumPy is a Python library used for working with arrays. It also has
functions for working in domain of linear algebra, fourier transform, and
matrices. NumPy was created in 2005 by Travis Oliphant. It is an open source
project and you can use it freely. NumPy stands for Numerical Python. In
Python we have lists that serve the purpose of arrays, but they are slow to
process. NumPy aims to provide an array object that is up to 50x faster than
traditional Python lists. The array object in NumPy is called ndarray, it provides
a lot of supporting functions that make working with ndarray very easy. Arrays
are very frequently used in data science, where speed and resources are very
important. NumPy arrays are stored at one continuous place in memory unlike
lists, so processes can access and manipulate them very efficiently. This
behavior is called locality of reference in computer science. This is the main
reason why NumPy is faster than lists. Also it is optimized to work with latest
CPU architectures.

Pandas:
Pandas is a Python library used for working with data sets. It has functions
for analyzing, cleaning, exploring, and manipulating data. The name "Pandas"
has a reference to both "Panel Data", and "Python Data Analysis" and was
created by Wes McKinney in 2008. Pandas allows us to analyze big data and
make conclusions based on statistical theories. Pandas can clean messy data sets,
and make them readable and relevant. Pandas are also able to delete rows that
are not relevant, or contains wrong values, like empty or NULL values. This is
called cleaning the data. Relevant data is very important in data science. Pandas
gives you answers about the data like:

 Is there a correlation between two or more columns?


 What is average value?
 Max & Min value?
7
Seaborn:
Seaborn is an amazing visualization library for statistical graphics plotting
in Python. It provides beautiful default styles and color palettes to make
statistical plots more attractive. It is built on the top of matplotlib library and
also closely integrated to the data structures from pandas.

Seaborn aims to make visualization the central part of exploring and


understanding data. It provides dataset-oriented APIs, so that we can switch
between different visual representations for same variables for better
understanding of dataset. Plots are basically used for visualizing the relationship
between variables. Those variables can be either be completely numerical or a
category like a group, class or division. Seaborn divides plot into the below
categories –

Relational plots: This plot is used to understand the relation between two
variables.

Categorical plots: This plot deals with categorical variables and how they
can be visualized.

Distribution plots: This plot is used for examining univariate and bivariate
distributions.

Regression plots: The regression plots in seaborn are primarily intended to


add a visual guide that helps to emphasize patterns in a dataset during
exploratory data analyses.

Matrix plots: A matrix plot is an array of scatterplots.

Multi-plot grids: It is an useful approach is to draw multiple instances of


the same plot on different subsets of the dataset.

8
Matplotlib:
Matplotlib is a plotting library for the Python programming language and
its numerical mathematics extension NumPy. It provides an object-
oriented API for embedding plots into applications using general-purpose GUI
toolkits like Tkinter, wxPython, Qt, or GTK. There is also a procedural "pylab"
interface based on a state machine (like OpenGL), designed to closely resemble
that of MATLAB, though its use is discouraged. SciPy makes use of Matplotlib.

Matplotlib was originally written by John D. Hunter. Since then it has had
an active development community and is distributed under a BSD-style license.
Michael Droettboom was nominated as matplotlib's lead developer shortly
before John Hunter's death in August 2012 and was further joined by Thomas
Caswell. Matplotlib is a NumFOCUS fiscally sponsored project.

Matplotlib 2.0.x supports Python versions 2.7 through 3.10. Python 3


support started with Matplotlib 1.2. Matplotlib 1.4 is the last version to support
Python 2.6. Matplotlib has pledged not to support Python 2 past 2020 by signing
the Python 3 Statement.

1.6 Plan of implementation


Different algorithms will be used to come up with a good result in this
contest. Each of them will be explained, tried and tested, and finally we will get
to see which of them works best for this case. Cross-validation will be used to
validate the models, so the database has to be split into test, train and validation
subsets. This split has to be stratified to ensure that the initial proportion of
elements (same amount of crimes per category) is maintained in each
subdivision. The resulting train dataset is still too large (approx. 700.000), and
running the testing programs would take too long. To speed up tests and
development, we will reduce the database to approx. 8.000 records using a
clustering algorithm. This algorithm will be K-Means. Then, having the 20

9
number of elements per cluster, we will be able to decide which element has
more weight inside the algorithm. Technically there is no data loss. Once the
data has been treated, the following algorithms will be tried (in order of
complexity):

- K-Nearest Neighbors.

- Neural Networks.

- Confusion matrix.

Each of them will be deeply explained in its chapter later on. All the
development and testing was done on a server lent by a university department.
This way, executions could last all day without having to worry about them and
were a little bit faster.
1.7 Problem Statement
The main problem is that day to day the population is going to be increased
and by that the crimes are also going to be increased in different areas by this the
crime rate can„t be accurately predicted by theofficials. The officials as they
focus on many issues may not predict the crimes to be happened in the future.
The officials/police officers although they tries to reduce the crime rate they
may not reduce in full-fledged manner. The crime rate prediction in future may
be difficult for them. There has been countless of work done related to crimes.
Large datasets have been reviewed, and information such as location and the
type of crimes have been extracted to help people follow law enforcements.
Existing methods have used these databases to identify crime hotspots based on
locations. There are several maps applications that show the exact crime location
along with the crime type for any given city. Even though crime locations have
been identified, there is no information available that includes the crime
occurrence date and time along with techniques that can accurately predict what
crimes will occur in the future.

10
CHAPTER – II
LITERATURE SURVEY

There are various papers which contributed to the study of sentimental


classifications of citations. If you would like facilitate or area unit during a rush
you contact our fast Support department and one in all our representatives can
hold your hand through the method (as a premiumservice, we will even lookout of
everything on your behalf). Based on the study of these papers, this project was
proposed.

1. “A Survey of Data Mining Techniques for Analyzing Crime


Patterns” on Than sataom watan. Second Asian conference on defence
technology. Year -2021.

The main goal of this work is the identification and classification of crime
data. Here, the system detects that the linked regions chain them together in
relative position. Modern technology alsohelps to catch mistakes.

Crime Prediction is a systematic approach for finding the crime patterns


and trends. This paper gives different technologies that can be used for building
Crime Prediction System. By building Crime Prediction System, it speeds up the
process of solving crimes and reduces the rate of crime. We have different
techniques which are dependent on the data that are previously reported and
recorded and time and location. Crime Prediction system uses recorded data and
analyses the data using several analysing techniques and later can predict the
patterns and trends of crime using any of the below mentioned approaches.

Existing : classification is not possible


Proposed : Modern technology is used to classify data

11
2. “Machine Learning based criminal shortlisting using modus
aperandi features”, University of Mumbai, Shree L.RTiwari college of
engineering and technology. Year – 2017.

Using the k-means clustering algorithm for unsupervised learning to


determine crime data. The model was then analyzed, preprocessed and
implemented for testing the dataset, and the algorithm was trained. K-means
clustering algorithms yielded more than 75%. K- Nearest Neighbors is used for
classification into single and multi-class variables. A neural network is used to
ensure the accuracy of the forecast. Accuracy rate using neural network, model
accuracy is 60% to 97%.

Is it really secure or not that‟s why we face many unpleasant


circumstances; in this job, we use different clustering approaches of data mining
to analyze the crime rate of Bangladesh and we also use K-nearest neighbor
(KNN) algorithm to train our dataset. For our job, we are using main and
secondary data. By analyzing the data, we find out for many places the
prediction rate of different crimes and use the algorithm to determine the
prediction rate of the path. Finally, to find out our safe route, we use the forecast
rate. This job will assist individuals to become aware of the crime area and
discover their secure way to the destination.
Existing : No clustering technique
Proposed : Clustering technique is used.

Merits: More accuracy and easy to detect crime rate.


Demerits: Poor accuracy in analysis

12
3. “Crime analysis Exploitation data processing techniques and
algorithms” S Sivaranjani, S Sivakumari, M Aasha. Dept. Ofinformation
technology. University of Mumbai L.R. Tiwari College Year- 2017.

This paper presents a geographic analysis-based and self- regressive


approach to automatically identify high-risk urban crime areas and represent the
crime pattern of each area. The output of a crime prediction system algorithm
consisting of a collection of thick crime areas. It mainly works in a large area
with many people and shows that the proposed crime prediction strategy. The
workflow in this paper collects the raw data that the hotspot uses after sharing the
data to createa new hotspot and ultimately predict crime rates.

Crime Analysis and counteractive action is a precise methodology for


recognizing and breaking down examples and patterns in crime. Our framework
can anticipate districts which have a high likelihood for crime event and can
envision crime-prone areas. With the expanding appearance of mechanized
frameworks, crime data investigators can help the Law requirement officers to
accelerate the way toward comprehending crimes. About 10% of the culprits
carry out about half of the crimes. Despite the fact that we can't anticipate who
all might be the casualties of crime however, can foresee the spot that has a
likelihood for its event. K-means Algorithm is finished by apportioning data into
gatherings dependent on their means. K-means calculation has an expansion
called desire - boost calculation where we segment the data dependent on their
parameters. This simple to actualize data mining framework works with the
geospatial plot of crime and enhances the profitability of the criminologists and
other law requirement officers. This framework can likewise be utilized for the
Indian crime divisions for lessening the crime and settling the crimes with less
time.

13
Existing : Does not collect the raw data.
Proposed : collects the raw data that the hotspot uses after sharing the data
to create a new hotspot and ultimately predict crime rates.

Merits: This innovative method helps law enforcers to provide security to a


community by marking the areas of higher rate of crimes.
Demerits: False sense of security.

4. “A review of crime rate analysis using data mining techniques and


algorithms”. spicious Criminal Activity Based on Decision Tree Mugdha
Sharma Information Technology, University of Mumbai, Shree L.R Tiwari
College of Engineering, Thane, India, Year – 2015

This study deals with different types of crime scenes. The analysis was
conducted from both research and behavioral perspectives. It provided
information about unknown criminals and a recommendation for investigation.
Classification is a unique data mining method used to classify each object. Data
mining creates classification models by observing confidential information.
Naive Bayes is a classification algorithm used for prediction.

Crime is one of the biggest and dominating problem in our society. Daily
there are huge number of crimes committed frequently. Here the dataset consists
of the date and the crime rate that has taken place in the corresponding years. In
this project the crime rate is only based on the robbery. We use linear regression
algorithm to predict the percentage of the crime rate in the future by using the
previous data information. The date is given as an input to the algorithm and the
output is the percentage of the crime rate in that particular year.
Existing : Does not use Naïve Bayes algorithm
Proposed : Naive Bayes & linear SVM are the classification algorithm
used for prediction.
Merits: It gives the police a time frame & points out “hot spots” and all.

14
5. “Cluster Analysis for anamoly detection in accounting Data”
The purpose of this study is to examine the possibility of using clustering
technology for continuous auditing. Automating fraud filtering can be of great
value to preventive continuous audits. In this paper, cluster-based outliers help
auditors focus their efforts when evaluating group life insurance claims. Claims
with similar characteristics have been grouped together and those clusters with
small population have been flagged for further investigations. Some dominant
characteristics of those clusters are, for example, having large beneficiary
payment, having huge interest amount and having been submitted long time
before getting paid. This study examines the application of cluster analysis in
accounting domain. The results provide a guideline and evidence for the potential
application of this technique in the field of audit.

The primary objective of this project is to distinguish various crimes using


clustering techniques based on the occurrences and regularity. Data mining is
used for analysis, investigation and check patterns in crimes. In this project, a
clustering approach is used to analyse the crime data; the stored data is clustered
using the K-Means algorithm. After the classification and clustering, we can
predict a crime based on its historical information. This proposed system can
indicate regions which have a high probability of crime rate and distinguish
areas which have a higher crime rate
Existing : There is no cluster analysis
Proposed : It also identifies the cluster technique used to classify the crime
data.

15
6. .“Analysing Violent Criminal Behaviour By Simulation Model.”
Crime analysis, a part of criminology, is a task that includes exploring and
detecting crimes and their relationships with criminals. The high volume of
crime datasets and also the complexity of relationships between these kinds of
data have made criminology an appropriate field for applying data mining
techniques. Identifying crime characteristics is the first step for developing
further analysis. The knowledge that is gained from data mining approaches is a
very useful tool which can help and support in identifying violent criminal
behaviour. The idea here is to try to capture years of human experience into
computer models via data mining and by designing a simulation model.

16
CHAPTER – III

EXISTING SYSTEM

3.1 Overview

Crime has been increasing day by day and everyone is trying to Figureure
out how to manage crime rate. Different crime data mining techniques are
proposed. Among each of them includes extraction only. By using hotspots
crime zones only identified. Clustering techniques are not included in existing
systems. Without internet connectivity the system can‟t be work.

3.2 Traditional System

Crime has increased day by day and everyone is trying to Figure out how to
control and reduce crime . Various criminal data mining techniques have been
proposed in the existing work. Each one involves only mining. Using hotspots,
only crime areas are identified. Criminal records cannot be analyzed in previous
jobs. Grouping techniques are not included in existing work. Separation of crime
models based on crime analysis and available crime information, prediction of
crimes based on the spatial distribution of available information. This level of
crime, which the authorities cannot accurately predict, also increases crime in
various areas. Officers who focus on multiple issues may not predict future
crimes. Current methods used these databases to identify crime scenes based on
locations. Although the crime scenes have been identified, there is no
information about the crime scene.

17
3.3 Disadvantages
 Code readability is inadequate.
 Optimized code is designed.
 Cannot be modified.

Figure 3.1 Crime Prediction

Figure 3.2 Crime Mapping

18
CHAPTER – IV

SYSTEM STUDY

4.1 Feasibility Study


The feasibility of the project is analyzed in this phase and a business
proposal is put forth with a very general plan for the project and some cost
estimates. During system analysis the feasibility study of the proposed system is
to be carried out.

 Economical feasibility
 Technical feasibility
 Social feasibility

4.1.1 Economical Feasibility


This study is carried out to check the economic impact that the system
will have on organization. The implementation of the system is very easy and
simple when compared to other systems according to the cost and other factors.
It‟s just registered the user details and login with their respective id and cast .
4.1.2 Technical Feasibility
This study is carried out to check the technical feasibility that is the
technical requirements of the system. Any system must not have a high demand
on the available technical resources. This will leads to high demands being
placed on the client.

4.1.3 Social Feasibility


The aspect of this study is to check the level of acceptance of the system
by the user. This includes the process of training the user to use the system
efficiently. The user must not feel threatened by the system instead must
accept it as a necessary.

19
CHAPTER – V

PROPOSED SYSTEM

5.1 Overview
In the proposed work, we analyze crime data with several parameters and
factors, including daily crime, weekly crime and monthly crime, domestic
violence. Using a decision tree algorithm and K-means clustering algorithm, crime
type is predicted from latitude and longitude. K-means Clustering is one of the
methods of cluster analysis, which aims to divide n observations into k clusters,
where each observation belongs to a cluster. There are two types of data, training
and test set. A regression model predicts the transitivity of future crime data with
various parameters.

Figure 5.1 Flow of the Proposed system

Deep Learning is the composition of multiple layers which includes non


linear operations such as neural net, propositional formulae, etc. Deep Learning
technique of artificial intelligence here is used to build the relationship between
crimes and build the model of world with various crimes. Deep Learning uses
several algorithm to convert the raw information into higher representations.
20
With the help of dataset and information visual representation of hotspots is
developed. The system can learn about patterns of events and this knowledge
can be used for different crime with different time and space.The Graph Loader
loads the graph data store for analysing of events which in turn predicts the time
and space of the crime events.
The algorithm has three stages:
1) Pre-processing Stage
2) Processing Stage
3) Post processing Stage

1) Pre-processing Stage
In Pre-processing Stage the data from various resources are collected
transformed into clean information. The transformed data will then be stored on
graph data store. Then the system will automatically take parsing to understand
the type of information, its relationship etc. to provide all possible combination
of events.

2) Processing Stage
In Processing Stage, the generated event combinations will be analysed to
produce possible configuration. The system decides the most suitable
combination with the help of previous data. Therefore producing the set of
locations with the set of possible events.

21
3) Post processing Stage
In the Post processing Stage the set of events are filtered into interesting
and important events. It can be found out by using several output stage threshold
based filters.

Figure 5.2 Deep Learning Technique Architecture

5.2 Advantages
 Reusability of code.
 Effective accuracy.
 Prediction can be easy.
 Precautionary methods can be taken to prevent from crimes.

22
CHAPTER – VI

SYSTEM SPECIFICATION

6.1 Technologies Used

The technologies may serve as the basis for a contract for the
implementation of the system and should therefore be a complete and consistent
specification of the whole system. They are used by software engineers as the
starting point for the system design. It shows what the system does and not
knows it should be implemented.

 Programming : Python Language


 Interface Used : Python IDLE & Google Colab

 Libraries Used : Pandas, Numpy, Matplotlib

6.2 Applications Required

The application requirements document is the specification of the


system. It is Set of what the system does rather than how it should do it.
 Code editor – Visual Studio code / Pycharm

Code Editor
It is a text editor program designed specifically for editing source code
of computer programs. It may be standalone application or it may built into
an integrated development environment or web browser.
They make writing and reading the source code easier by
differentiating the elements and routines so programmers can more easily
look at their code.

23
Figure 6.1 Code editors
PyCharm :

PyCharm is an integrated development environment (IDE) used for


programming in Python. It provides code analysis, a graphical debugger, an
integrated unit tester, integration with version control systems, and supports web
development with Django. PyCharm is developed by the Czech company
JetBrains. It is cross-platform, working on Microsoft Windows, macOS and Linux.
PyCharm has a Professional Edition, released under a proprietary license and a
Community Edition released under the Apache License. PyCharm Community
Edition is less extensive than the Professional Edition. PyCharm was released to
the market of the Python-focused IDEs to compete with PyDev (for Eclipse) or the
more broadly focused Komodo IDE by Active State.

The beta version of the product was released in July 2010, with the 1.0
arriving 3 months later. Version 2.0 was released on 13 December 2011, version
3.0 was released on 24 September 2013, and version 4.0 was released on
November 19, 2014.It is a text editor program designed specifically for editing
source code of computer programs.

24
It may be standalone application or it may built into an integrated
development environment or web browser. PyCharm became Open Source on
22 October 2013. The Open Source variant is released under the name
Community Edition – while the commercial variant, Professional Edition,
contains closed-source modules.

Visual Studio Code :


Visual Studio Code, also commonly referred to as VS Code, is a source-
code editor made by Microsoft with the Electron Framework, for Windows,
Linux and mac OS. Features include support for debugging, syntax highlighting,
intelligent code completion, snippets, code refactoring, and embedded Git. Users
can change the theme, keyboard shortcuts, preferences, and install extensions
that add functionality. In the Stack Overflow 2022 Developer Survey, Visual
Studio Code was ranked the most popular developer environment tool among
71,010 respondents, with 74.48% reporting that they use it. Visual Studio Code
was first announced on April 29, 2015, by Microsoft at the 2015 Build
conference. A preview build was released shortly thereafter.

On November 18, 2015, the source of Visual Studio Code was released
under the MIT License, and made available on GitHub. Extension support was
also announced. On April 14, 2016, Visual Studio Code graduated from the
public preview stage and was released to the Web. Microsoft has released most
of Visual Studio Code's source code on GitHub under the permissive MIT
License, while the releases by Microsoft are proprietary freeware. Visual Studio
Code is a source-code editor that can be used with a variety of programming
languages, including C, C#, C++, Fortran, Go, Java, JavaScript, Node.js,
Python, Rust.It is based on the Electron framework, which is used to develop
Node.js web applications that run on the Blink layout engine.

25
Visual Studio Code employs the same editor component (codenamed
"Monaco") used in Azure DevOps (formerly called Visual Studio Online and
Visual Studio Team Services). Out of the box, Visual Studio Code includes
basic support for most common programming languages. This basic support
includes syntax highlighting, bracket matching, code folding, and configurable
snippets. Visual Studio Code also ships with IntelliSense for JavaScript,
TypeScript, JSON, CSS, and HTML, as well as debugging support for Node.js.
Support for additional languages can be provided by freely available extensions
on the VS Code Marketplace.

Sublime Text:
Sublime Text is a shareware text and source code editor available for
Windows, macOS, and Linux. It natively supports many programming
languages and markup languages. Users can customize it with themes and
expand its functionality with plugins, typically community-built and maintained
under free-software licenses. To facilitate plugins, Sublime Text features a
Python API. The editor utilizes minimal interface and contains features for
programmers including configurable syntax highlighting, code folding, search-
and-replace supporting regular-expressions, terminal output window, and more.
It is proprietary software, but a free evaluation version is available. The
following is a list of features of Sublime Text:

1) "Go to Anything", quick navigation to project files, symbols, or lines.

2) "Command palette" uses adaptive matching for quick keyboard invocation


of arbitrary commands.

3) Simultaneous editing: simultaneously make the same interactive changes to


multiple selected areas.

4) Python-based plugin API.

26
5) Project-specific preferences.
6) Extensive customizability via JSON settings files, including project- specific
and platform-specific settings.

7) Cross-platform (Windows, macOS, and Linux) and Supportive Plugins for


cross-platform.

Spyder :

Spyder is an open-source cross-platform integrated development


environment (IDE) for scientific programming in the Python language. Spyder
integrates with a number of prominent packages in the scientific Python stack,
including NumPy, SciPy, Matplotlib, pandas, IPython, SymPy and Cython, as
well as other open-source software. It is released under the MIT license. Initially
created and developed by Pierre Raybaut in 2009, since 2012 Spyder has been
maintained and continuously improved by a team of scientific Python
developers and the community.

Spyder is extensible with first-party and third-party plugins, includes


support for interactive tools for data inspection and embeds Python-specific
code quality assurance and introspection instruments, such as Pyflakes, Pylint
and Rope. It is available cross-platform through Anaconda, on Windows, on
macOS through MacPorts, and on major Linux distributions such as Arch Linux,
Debian, Fedora, Gentoo Linux, opens USE and Ubuntu.

Spyder uses Qt for its GUI and is designed to use either of the PyQt or
PySide Python bindings. QtPy, a thin abstraction layer developed by the Spyder
project and later adopted by multiple other packages, provides the flexibility to
use either backend.

27
Atom :
Atom is a deprecated free and open-source text and source code editor for
macOS, Linux, and Windows with support for plug-ins written in JavaScript,
and embedded Git Control. Developed by GitHub, Atom was released on June
25, 2015. On June 8, 2022, GitHub announced that Atom‟s end-of-life would
occur on December 15 of that year, "in order to prioritize technologies that
enable the future of software development", specifically its GitHub Codespaces
and Microsoft's Visual Studio Code.

Jupyter :
Jupyter is a project to develop open-source software, open standards, and
services for interactive computing across multiple programming languages. It
was spun off from IPython in 2014 by Fernando Pérez and Brian Granger.
Project Jupyter's name is a reference to the three core programming languages
supported by Jupyter, which are Julia, Python and R. Its name and logo are an
homage to Galileo's discovery of the moons of Jupiter, as documented in
notebooks attributed to Galileo. Project Jupyter has developed and supported the
interactive computing products Jupyter Notebook, JupyterHub, and JupyterLab.
Jupyter is financially sponsored by Num FOCUS. The first version of
Notebooks for IPython was released in 2011 by a team including Fernando
Pérez, Brian Granger, and Min Ragan-Kelley. In 2014, Pérez announced a spin-
off project from IPython called Project Jupyter.

IPython continues to exist as a Python shell and a kernel for Jupyter, while
the notebook and other language-agnostic parts of IPython moved under the
Jupyter name. Jupyter supports execution environments (called "kernels") in
several dozen languages, including Julia, R, Haskell, Ruby, and Python (via the
IPython kernel).

In 2015, about 200,000 Jupyter notebooks were available on GitHub. By


2018, about 2.5 million were available. In January 2021, nearly 10 million were
28
available, including notebooks about the first observation of gravitational waves
and about the 2019 discovery of a supermassive black hole.

Major cloud computing providers have adopted the Jupyter Notebook or


derivative tools as a frontend interface for cloud users. Examples include
Amazon SageMaker Notebooks,Google's Colaboratory, and Microsoft's Azure
Notebook.

NotePad ++ :
Notepad++ is a highly functional, free, open-source, editor for MS Windows
that can recognize (i.e., highlight syntax for) several different programming
languages from Assembly to XML, and many others inbetween, including, of
course, Python. Besides syntax highlighting, Notepad++ has some features that are
particularly useful to coders. It will allow you to create shortcuts to program calls,
such as a Run Python menu item that will invoke python.exe to execute your
Python code without having to switch over to another window running a Python
shell, such as IPython. Another very convenient feature is that it will group
sections of code and make them collapsible so that you can hide blocks of code to
make the page/window more readable.

Notepad++ provides indentation guides, particularly useful for Python


which relies not on braces to define functional code blocks, but rather on
indentation levels.

29
Google Colab :
Colab notebooks allow you to combine executable code and rich text in a
single document, along with images, HTML, LaTeX and more. When you create
your own Colab notebooks, they are stored in your Google Drive account. You
can easily share your Colab notebooks with co-workers or friends, allowing
them to comment on your notebooks or even edit them. With Colab you can
harness the full power of popular Python libraries to analyze and visualize data.
The code cell below uses numpy to generate some random data, and uses
matplotlib to visualize it.

6.3 Hardware Requirements

 Processor : Intel Core i3 or i5


 RAM : 1 GB

It is developed and manufactured by Intel, and first introduced and released in


2010, the Core i3 is a dual-core computer processor, available for use in both
desktop and laptop computers. It is one of three types of processors in the "i"
series (also called the Intel Core family of processors).

The Core i3 processor is available in multiple speeds, ranging from 1.30 GHz
up to 3.50 GHz, and features either 3 MB or 4 MB of cache. It utilizes either the
LGA 1150 or LGA 1155 socket on a motherboard. Core i3 processors are often
found as dual-core, having two cores. However, a select few high-end Core i3
processors are quad-core, featuring four cores. The most common type of RAM
used with a Core i3 processor is DDR3 1333 or DDR3 1600.v

30
CHAPTER – VII

SYSTEM DESIGN

7.1 System Design

System design is the process of defining the architecture, components,


modules, interfaces and data for a system to satisfy specified requirements. One
could see it as the application of systems theory to product development. There
is some overlap with the disciplines of system analysis, systems architecture and
system engineering. If the broader topic of product development “blends the
perspective of marketing, design and manufacturing into a single approach to
product development,” then the design is the act of taking the marketing
information and creating the design of the product to be manufactured. It is
therefore the process of defining and developing systems to satisfy specified
requirements of the user.

7.2 Block Diagram

A block diagram is a diagram of a system in which the principal parts or


functions are represented by blocks connected by lines that show the
relationships of the blocks. They are heavily used in engineering in software
design and flow diagrams.

Block diagrams are typically used for higher level, less detailed
descriptions that are intended to clarify overall concepts without concern for the
details of implementation. Contrast this with the schematic diagrams and layout
diagrams used in computer engineering which shows the implementation details
of physical construction.

31
Figure 7.1 Block diagram of our proposed work

The data set can be collected from the official website for the analysis
process. After collecting the data set, start the preprocessing method. This
preprocessing method can be used to remove unknown or unwanted values from
a loaded dataset. We can clean the data and prepare it for our clustering
algorithm. It converts the raw data into an understandable form. After the data is
pre-processed, we start distributing the train. A K-means clustering algorithm is
then applied to partition the N observations into K clusters, where each
observation belongs to the closest cluster. And then we use a linear regression
algorithm m to find the ratio and percentage of crime that occurred. The Naive
Bayes algorithm is the third algorithm we use in our project.

32
CHAPTER – VIII

SYSTEM IMPLEMENTATION

Implementation is the stage of this project when the theoretical design is


turned out into a working system. Thus it can be considered to be the most
critical stage in achieving a successful new system and in giving the user,
confidence that the new system will work and be effective.

The implementation stage involves careful planning, investigation of the


existing system and it‟s constraints on implementation, designing of methods to
achieve change over and evaluation of changeover methods.

It is the process of the users does registration process first. The user can
send all information to authentication server. After login, voters are able to cast
their votes.

8.1 Modules

8.1.1 Pre-processing Module

Preprocessing is a data mining technique. It is the process of preparing raw


data and making it suitable for a machine learning model. This is the first and
crucial step in building a machine learning model. This is a mandatory task for
data cleaning and machine learning model calibration, which also increases the
accuracy and efficiency of the machine learning model.

Figure 8.1 Data Preprocessing Steps


33
Data Cleaning : Data Cleaning is a crucial process in data mining. It
carries an important part in the building of a model. Data Cleaning can be
regarded as the process needed, but everyone often neglects it. Data Cleaning is
fixing or removing incorrect, corrupted, incorrectly formatted, duplicate or
incomplete data within a dataset. Steps in data cleaning:
a. Remove duplicate observations
b. Fix structural errors
c. Filter unwanted outliers
d. Handle missing data
a. Remove duplicate observations: Removed unwanted observationsfrom
your dataset, including duplicate observations or irrelevant observations.
Overlapping observations most often occur during data collection. When you
combine data from multiple locations, scrape data, youcan create duplicate data.
Deduplication is one of the key areas in this process. Irrelevant observations are
when you notice observations that do not fit into the specific problem you are
trying to analyze.
For example, if you want to analyze the dataset regarding millennial
customers, but your dataset includes older generations, you might remove those
irrelevant observations. This can be make analysis more efficient, minimize
distraction from primary target and create a more manageable and performable
dataset.
b. Fix structural errors: Structural errors are when data is measured or
transmitted and when strange naming conventions, types, or invalid characters
occur. These inconsistencies can cause mislabeled categories orclasses.
For example, you may find “N/A” and “Not Applicable” in any sheet, but
they should be analyzed in the same category.

34
c. Filter unwanted outliers: Often individual observations are made, in
which at first glance we do not seem to stop within the limits of the data to be
analyzed. If we have a legitimate reason to correct an anomaly, such as
incorrect data entry, we will help work with the data. However sometimes that
appearance of an outlier will prove a theory you are working on. And just
because an outlier exists doesn‟t mean it is incorrect. This step is needed to
determine the validity of that number. If an outlier proves to be irrelevant for
analysis or is a mistake, considering removing it.

d. Handle missing data: We cannot ignore missing data because many


algorithms do not accept missing values. There are a number of ways to handle
missing data. Neither is optimal but both can be considered, such as
 You can drop observations with missing values, but this will drop
information, so be careful before removing it.
 You can input missing values based on other observations.
 You might alter how the data is used to navigate null values
effectively.

Data Integration: Data integration is the process of merging data from


several disparate sources. While performing data integration, you must work on
data redundancy, inconsistency, duplicity. In data mining, data integration is a
record preprocessing method that includes merging data from a couple of
heterogeneous data sources into coherent data to retain and provide a unified
perspective of data.The data integration step involves combining data from
multiple sources (eg. databases, spread sheets). The purpose of integration is to
create a single, unified view of the data.
It has been integral part of data operations because data can be obtained
from several sources. It is important because it gives a uniform view of scattered
data while also maintaining data accuracy.
35
Data Transformation: Data transformation is a method used to convert
raw data into an understandable format. This step involves convertingthe data
into a format that is more suitable for the data mining task. Data
transformation is an essential data pre-processing technique that must be
performed on the databefore data mining to provide patterns that are easier to
understand. It changes the format, structure or values of the data and converts
them into clean, usable data. Techniques in data transformation are:
a. Data Smoothing
b. Attribute Construction
c. Data Generalization
d. Data Aggregation
e. Data Discretization
f. Data Normalization

Figure 8.2 Data Transformation Techniques

36
a. Data Smoothing: Data Smoothing is a process that is used to remove
noise from the dataset using some algorithms. It allows for highlighting
important features present in the dataset. It helps in predicting the patterns.
When collecting data, it can be manipulated to eliminate or reduce any
variance or any other noise form. The concept behind data smoothing is
that it will be able to identify simple changes to help predict different
trends and patterns.
b. Attribute Construction: In this method, the new attribute consult
the existing attributes to construct a new dataset that eases data mining. New
attributes are created and applied to assist the mining process from the given
attributes.
For example, Suppose we have a dataset referring to measurements of
different plots we may have the height and width of each plot.
c. Data Aggregation: Data aggregation is the method of storing and
presenting data in a summary format. The data may be obtained from
multiple data sources to integrate these data sources into a data analytics
description. This is a crucial step since the accuracy of data analysis insights
is highly dependent on the quantity and quality of the data used.
For example, we have a dataset of sales report of an enterprise that has
quarterly sales of each year. We can aggregate the data to get the
enterprise‟s annual sales report.
d. Data Normalization: Normalization in the data refers to scaling the
data values to much smaller range such as [-1,1] or [0.0,1.0]. There are
different methods to normalize the data.
 Min – max normalization
 Z – score normalization
 Decimal Scaling

37
e. Data Discretization: This is a process of converting continuous data
into a set of data intervals. Continuous attribute values are substituted by
small interval labels. This makes the data easier to study and analyze. This
improves the efficiency of task. This method is also called as a data reduction
mechanism as it transforms a large dataset into a set of categorial data. It also
uses decision tree based algorithms to produce short,compact and accurate
results when using discrete values.
Use Case Diagram

Use case diagram of proposed system, where user inputs dataset, we pre-
process dataset, the algorithm Decision Tree and K-means clusteringto generate
the trained model to predict the crime type. The actor and usecase is represented.
An eclipse shape represents the use case namely input image, pre-process, Split
features, prediction and output.

Figure 8.3 Use case diagram


38
Class Diagram

The class diagram explains about the properties and functions of eachclass.
The classes are Main, pre-process, split data, train and test. In the above diagram,
every class is represented with attributes and operations.

Figure 8.4 Class diagram

39
8.1.2 Classification Module

Classification is a data mining step used to group crimes based on various


parameters. It is used to identify a class or class identifier. An algorithm
receives a set of input data and a corresponding output. The algorithm derives
the model using the trained dataset. Here, when the model receives unnamed
data, it should find the class it belongs to. This classification module contains a
linear regression algorithm. Linear regression is used in crime forecasting
situations to describe a crime scene. It is a type of mathematical approach to
find the relationship between the independent variables and the set of
independent variables between the input values collected from the crime scene.
It can be classified into three stages such as,
a. Developing the classifier
b. Applying classifier for classification
c. Data classification process

a. Developing the classifier: This level is the learning or initial stage of


the classification process. The classification algorithm builds a classifier in this
step. The classifier is formed by a training set consisting of database records
and their corresponding class names. Each class that makes up the training set is
called a category.

b. Applying classifier for classification: An advanced classifier is used


for classification at this level. The test data is used here to evaluate the
accuracy of the classification algorithm.

40
Figure 8.5 Crime Dataset

Figure 8.6 Read the Dataset

41
8.1.3. Clustering Module

In this module, we identify a specific location and crime pattern


corresponding to each location, retrieve attributes and predict the crime pattern
of a specific cluster. The K - Means clustering algorithm is an iterative algorithm
that attempts to divide data into K given subgroups. It is an unsupervised
iterative clustering technique.
Clustering is one of the data mining techniques which are used to place
the data elements into their related groups. It is the process of partitioning the
data or objects into the same class. The data which is present in the same class
is more similar to each other to those in other cluster. The process of
partitioning the data objects into some subclasses is called a cluster. Clustering
comes under unsupervised learning. There are a variety of algorithms for
clustering process, which generally share the same property of interactively
assigning records to a cluster.
It has six steps to implement:
Step 1: Choose the number of cluster K
Step 2: Randomly select any K data points as cluster centers. Choose the
cluster centers so that they are as far apart as possible.
Step 3: Calculate the distance between each data point and the center of
each cluster. The distance can be calculated using either the distance
function or the Euclidean distance formula.
Step 4: Assign each data point to a cluster. A data point assigned to the
cluster whose center is closest to the data point.
Step 5: Recalculate the center of the new clusters. The cluster center is
calculated as the average of all data points contained in the clusters.
Step 6: Repeat steps 3 to 5.

42
Figure 8.7 Clustering using K-Means Algorithm

8.1.4. Crime Prediction Module

This module is used to predict crime in different hotspots. After analyzing


the data, law enforcement agencies can make predictions about future crime.
These predictions, which can vary, help agencies allocate resources more
efficiently and prevent them from over performing. Predictions can be various,
such as high crime areas, types of crime likely to occur, time of day when
crimes are most likely to occur, specific crimes in specific locations, crime
hotspots, and crime trends.
This system takes factors/attributes and gives frequent patterns of that
place. The pattern is used for building a model for decision tree. Corresponding
to each place we build a model by training on these various patterns. Crime
patterns cannot be static since patterns change over time. By training means we
are teaching the system based on some particular inputs. So the system
automatically learns the changing patterns in crime by examining the crime
patterns.
43
Also, crime factors change over time. By sifting through the crime data,
we have to identify new factors that lead to crime though full accuracy cannot
be achieved. For getting better results in prediction, we have to find more crime
attributes of places instead of fixing specific attributes. Thus, the proposed
system using certain attributes but we are planning to include more factors to
improve accuracy.
A clustering algorithm groups elements in a specified number of clusters.
This clusters‟ centres are calculated and they act as the representatives of all the
samples in their group. Because not all clusters will have the same number of
elements, some will be more relevant than others. Their relevance is measured
by their weight, which is the number of elements they contain.The idea of using
training data in machine learning programs is a simple concept, but it is also
very foundational to the way that these technologies work. The training data is
an initial set of data used to help a program understand how to apply
technologies like neural networks to learn and produce sophisticated results.

8.1.5. Analysis Module

Classification Accuracy is given by the relation,

To get the value of precision, we divides total number of correctlyclassified


positive by the total number of predicted positive.

44
Figure 8.8 Split the data into training & testing

45
CHAPTER – IX

RESULT AND DISCUSSION

False Positive Data:

The graph represents the false positive data gets aggregated. False Positive

Data Aggregate rate (FPDA) is the ratio of false data aggregated to the overall

data. The false positive aggregate rate is formulated as given below.

Figure 9.1 Measure of False Positive Data

46
Table -1 : False positive data aggregate rate with respect to detect count

Crime K - Means Linear SVM


Rate (Proposed) (Existing)
10 21 24
20 24 26
30 27 29
40 31 35
50 34 38
60 36 39
Time Taken :

This graph shows that the time taken for data gathering is the difference
between end time and start time for crime data gathering. It is measured in terms
of milliseconds and is formulated is given below.
DCt = (End TimeDC – Start TimeDC)

Time taken for data analysis


400

350

300

250

200

150

100

50

0
10 20 30 40 50

K-Means (Proposed) Linear SVM (Existing)

Figure 9.2 Impact of time taken for data collection

47
Table -2: Time taken for data gathering with respect to Frame count

Crime K-Means Linear SVM


Data (Proposed) (Existing)

10 125 174
20 146 176
30 155 187
40 164 190
50 168 195

Scatter Graph:
The scatter graph represents the total crime that appears in various regions.
With the help of this crime appears can be detected in the various hotspots can be
determined. The total crime values can be taken from the crime dataset.

Figure 9.3 Scatter the total crime

48
Table -3: Total Crime count

nm_pol Totalcrime
CHITRANJAN PARK 512
DABRI 397
MALVIYA NAGAR 837
CHANDNI MAHAL 588
MODEL TOWN 466
ANANDVIHAR 619
KASHMERE GATE 398
GOVIND PURI 741
BINDAPUR 509
NEW FRIENDS COLONY 694
SARITA VIHAR 245

Bar Graph:
The bar graph represents the murder appear from the total crime that
appears in various regions. With the help of this the murder type of crime
appears can be detected in the various hotspots can be determined. The murder
values can be taken from the crime dataset.

Figure 9.4 Group Murder from the total crime


49
Table -4: Murder count from Total Crime

Murder Total Crime


2 512
8 397
3 837
1 588
0 466
10 619
0 398
1 741
4 509
6 694
1 245

Line Graph:
The line graph represents the relationship between the normal murders and
assault murders that appears in various regions. With the help of this the murder
type of crime appears can be detected in the various hotspots can be determined.
The murder values & assault murder can be taken from the crime dataset.

Figure 9.5 Murder Vs Assualt Murder


50
Table -5: Murder & Assualt Murder count

Murder assualt murders


2 19
8 26
3 63
1 19
0 9
10 24
0 17
1 5
4 11
6 14
1 8

Data Visualization:
A large amount of information represented in graphic form is easier to
understand and analyze. Some companies specify that a data analyst must know
how to create slides, diagrams, charts, and templates. In our approach, the data
histogram and scatter matrix are shown as data visualization part.

Figure 9.6 Data Visualization by crime category


51
Top Crimes:
This graph represents the top 5 monthly crimes types that appears in various
hotspots. Through this the top most crime in various cities can be identified. Then
the precautionary steps to avoid these types of crimes can be done.

Figure 9.7 Top 5 Crimes

52
Table -6: Top 5 crimes values

Assault murders theft Narcotics


19 442 7
26 240 16
63 694 15
19 529 7
9 393 14
24 457 15
17 340 12
5 692 5
11 417 8
14 609 10
8 211 1

Number of arrests:

The following line graph denotes the number of arrests as monthly and
weekly can be determined.

Figure 9.8 Number of monthly & weekly arrests

53
Elbow Method:
The elbow method runs k-means clustering on the dataset for a range of
values of k. In the elbow method we plot mean distance and look for the elbow
point where the rate of decrease shifts. For each k, calculate the total within-
cluster sum of squares.

Figure 9.9 Elbow method

Table – 7 : Distortion values

Range Distortion

1 0.42

2 0.27

3 0.24

4 0.22

5 0.15

6 0.09

54
CHAPTER – X
TESTING

10.1 Introduction to testing

After finishing the development of any computer based system the next
complicated time consuming process is system testing. During the time of
testing only the development company can know that, how far the user
requirements have been met out, and so on. Software testing is an important
element of the software quality assurance and represents the ultimate review of
specification, design and coding. The increasing feasibility of software as a
system and the cost associated with the software failures are motivated forces
for well planned through testing.
Testing procedures for the project is done in the following sequence:
 System testing is done for checking the server name of the machines
being connected between the customer and executive..
 The product information provided by the company to the executive is
tested against the validation with the centralized data store.
 System testing is also done for checking the executive availability to
connected to the server.
 The server name authentication is checked and availability to the
customer.
 Proper communication chat line viability is tested and made the chat
system function properly.
 Mail functions are tested against the user concurrency and customer
mail date validate.

55
10.2 Specification Testing

We can set with, what program should do and how it should perform under
various condition. This testing is a comparative study of evolution of system
performance and system requirements.

10.3 Module Level Testing

In this the error will be found at each individual module, it encourages the
programmer to find and rectify the errors without affecting the other modules.

10.4 Unit Testing

Unit testing focuses on verifying the effort on the smallest unit of software-
module. The local data structure is examined to ensure that the date stored
temporarily maintains its integrity during all steps in the algorithm‟s execution.
Boundary conditions are tested to ensure that the module operates properly at
boundaries established to limit or restrict processing.

10.5 Integration Testing

Data can be tested across an interface. One module can have an


inadvertent, adverse effect on the other. Integration testing is a systematic
technique for constructing a program structure while conducting tests to uncover
errors associated with interring.

10.6 Validation Testing

It begins after the integration testing is successfully assembled. Validation


succeeds when the software functions in a manner that can be reasonably
accepted by the client. In this the majority of the validation is done during the
data entry operation where there is a maximum possibility of entering wrong data.

56
10.7 Recovery Testing

Recovery Testing is a system that forces the software to fail in variety of


ways and verifies that the recovery is properly performed. If recovery is
automatic, re-initialization, and data recovery are each evaluated for correctness.

10.8 Security Testing

Security testing attempts to verify that protection mechanism built into


system will in fact protect it from improper penetration. The tester may attempt
to acquire password through external clerical means, may attack the system with
custom software design to break down any defenses to others, and may
purposely cause errors.

10.9 Performance Testing

Performance Testing is used to test runtime performance of software within


the context of an integrated system. Performance test are often coupled with
stress testing and require both software instrumentation.

10.10 Black Box Testing

Black- box testing focuses on functional requirement of software. It enables


to derives of input conditions that will fully exercise all functional requirements
for a program. Black box testing attempts to find error in the following category:
 Incorrect or missing function
 Interface errors
 Errors in data structures or external database access and performance error.

57
10.11 Output Testing

After performing the validation testing, the next step is output testing of the
proposed system since no system would be termed as useful until it does produce
the required output in the specified format. Output format is considered in two
ways, the screen format and the printer format.

10.12 User Acceptance Testing

User Acceptance Testing is the key factor for the success of any system.
The system under consideration is tested for user acceptance by constantly
keeping in touch with prospective system users at the time of developing .

58
CHAPTER – XI

CONCLUSION

Machine learning technology has made it easier to find correlations and


patterns in various crime data. The work of this project mainly focuses on
predicting the type of crime. Using the concept of machine learning, we built a
model using a training data set that underwent data cleaning and data
transformation with a linear regression algorithm. Model predicts type of crime
Data visualization helps analyze material and predict crime. Since we applied
the data mining clustering technique to crime analysis, we can also perform
other data mining techniques such as classification.

59
CHAPTER – XII

FUTURE SCOPE

As of now, the project relies on manual input from a human (a police


officer) in order to enter details in the database. If we can make this a centralised
system and connect it to all the police stations countrywide and make FIR
reporting digital, then it would be quite easier to predict crimes in that particular
location and recognize patterns in them. It would also encourage citizens to track
their E-FIR online. We can also avoid corruption as the government can keep a
track on the number of cases registered and their solvability rate which can help
them utilise their resources better.

60
APPENDICES

OUTPUT SCREENSHOTS

Figure A.1 Location identification

Figure A.2 Crime Area identification


61
Figure A.3 Crime Area Location

Figure A.4 Data Visualization – Correlation Matrix

62
SAMPLE CODING

Header page:

<!DOCTYPE html>

<html>

<head>

<title> Etihaad</title>

<link rel="stylesheet" type="text/css"


href="https://cdnjs.cloudflare.com/ajax/libs/semantic
ui/2.2.11/semantic.min.css"/>

<style>

.container.main{ margin-top: 7.0em;}

h1{

margin-top: 3.0em;

i.i con{

font-size: 2.0em;

</style>

</head>

<body>

<div class="ui fixed inverted menu">

<div class="ui container">


</div>

63
Index Page

<%include partials/header%>
<!DOCTYPE html>
<html>
<head>
<title>Major Project</title>
<link rel="stylesheet"
href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css">

<link rel="stylesheet" href="style.css">


<script
src="https://cdnjs.cloudflare.com/ajax/libs/modernizr/2.8.3/modernizr.min.js"
type="text/javascript" async></script>
</head>
<body id="landing" class="container">
<div id="landing-header">
<h1 style="text-align: center; font-size: 150px; font-family: georgia; margin-top:
100px">Major Project </h1>
<h3 style="text-align: center; font-size: 50px; font-family: georgia; margin-top:
100px; margin-bottom: 20px"> Because you must look before you leave !
</h3>
<center>
<a href="/login" class="ui secondary button large" style="text-align:
center;">Login</a>
<a href="/signup" class="ui secondary button large" style="text-align:
center">Sign Up</a>
</center>
</div>
</div></body></html>
64
K- Means Algorithm

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset

dataset = pd.read_csv('crime.csv')
X = dataset.iloc[:, [1,2,3,4,5,6,7,12]].values

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X = sc_X.fit_transform(X)

# Using the elbow method to find the optimal number of clusters


from sklearn.cluster import KMeans
wcss = []
for i in range(1, 21):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 21), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

# Fitting K-Means to the dataset


kmeans = KMeans(n_clusters = 6, init = 'k-means++', random_state = 42)
65
y_kmeans = kmeans.fit_predict(X)

from sklearn.decomposition import KernelPCA

kpca = KernelPCA(n_components = 2, kernel = 'rbf')X


= kpca.fit_transform(X)
# Visualising the clusters
plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = 'green')
plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 100, c = 'cyan')
plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 100, c =
'magenta' )
plt.scatter(X[y_kmeans == 5, 0], X[y_kmeans == 5, 1], s = 100, c = 'black' )
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s =
300, c = 'yellow', label = 'Centroids')
plt.title('Clusters of areas')
plt.xlabel('factor1')
plt.ylabel('factor2')
plt.legend()
plt.show()

66
Elbow Method for find optimum K values

from scipy.spatial.distance import cdist


from sklearn.cluster import KMeans
distortions = []
K = range(1,7)for
k in K:

kmeanModel = KMeans(n_clusters=k).fit(data_normalize)
kmeanModel.fit(data_normalize);
distortions.append(sum(np.min(cdist(data_normalize,
kmeanModel.cluster_centers_, 'euclidean'), axis=1)) /
data_normalize.shape[0])#
Plot the elbow
plt.plot(K, distortions, 'bx-')

plt.xlabel('k')

plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()

Model Building of k-means Clustering

kmeans = KMeans(n_clusters=6, random_state=0).fit(data_normalize)


labels=kmeans.labels_ labels labels.shape

67
Split the data into training and testing

from sklearn.model_selection import train_test_split


x_train,x_test,y_train,y_test=train_test_split(data_normalize,labels,test_size=0.2,r
andom_state=1)
x_train
x_test

Model prediction of test data


y_pred=gnb.predict(x_test) #Prediction
y_pred test_score=gnb.score(x_test,y_test)
test_score

68
REFERENCES

1. A Survey of Data Mining Techniques for Analyzing Crime Patterns Ub on


Than sataomwatan. Second Asian conference on defence technology. Year -
2021.

2. Nikhil Dubey et al “A Survey Paper on Crime Prediction Technique


Using Data Mining”, Int. Journal of Engineering Research Applications, 2021.

3. Anon:Areview_Crime_analysis_using_data_mining_techniques
and_algorithms. Accessed 30 Aug. 2019.
4. Marchant, R., Haan, S., Clancey, G., Cripps, S.: Applying machine
learning to criminology: semi parametric spatial demographic Bayesian
regression. Security Inform. 7(1) (2018).

5. Manish Gupta1*, B.Chandra1 and M. P Gupta1,2017 Crime Data


Mining for Police Information System.

6. Lin, Y., Chen, T., Yu, L.: Using machine learning to assist crime
prevention. In: 2017 sixth IIAI International Congress on Advanced Applied
Science (IIAI-AAI).

7. Detection Analysis & Prediction Meet Meet Timbadia, Ajit Yadav


Information Technology, University Of Mumbai, Shree L.RTiwari College Of
Engineering,Thane, India, Year – 2017.

69
8. Crime Prediction and Forecasting in Tamilnadu Using Clustering
Approaches S Sivaranjani, S Sivakumari, M Aasha. Dept. Of information
technology. University of Mumbai L.R. Tiwari College Year- 2017.

9. A Data Mining Tool for The Detection of Suspicious Criminal Activity


Based on Decision Tree Mugdha Sharma Information Technology, University of
Mumbai, Shree L.R Tiwari College of Engineering, Thane, India, Year – 2015.

10. Jazeem Azeez, D. John Aravindhar, Hybrid Approach to Crime


Prediction using Deep learning, 2015 International Conference Advances in
Computing, Communications and Informatics (ICACCI).

11. Farzana ,2012 Application of Data Mining Techniques for Analyzing


Violent Behavior by Simulation Model.

12. S. Joshi, and B. Nigam, ―Categorizing the document using multi class
classification in data mining,‖ International Conference on Computational
Intelligence and Communication Systems, 2011.

13. J. Agarwal, R. Nagpal, and R. Sehgal, ―Crime analysis using k-means


clustering,‖ International Journal of Computer Applications, Vol. 83 – No4,
December 2013.

14. T. Phyu, ―Survey of classification techniques in data mining,‖


Proceedings of the International Multi Conference of Engineers and Computer
Scientists Vol. IIMECS 2009, March 18 - 20, 2009, Hong Kong.

70

You might also like