0% found this document useful (0 votes)
16 views22 pages

Spatio-Temporal Crime Analysis and Forecasting On Twitter Data Using Machine Learning Algorithms

This research explores the use of Twitter data for spatio-temporal crime analysis and forecasting in India, employing machine learning algorithms to classify and visualize crime-related tweets. The study utilizes various models, including LSTM, ARIMA, and SARIMA, to predict crime trends and identify hotspots, enhancing law enforcement's ability to allocate resources effectively. The findings highlight the potential of social media as a valuable tool for real-time crime data analysis despite challenges such as data inconsistency and language barriers.

Uploaded by

suykey3x7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views22 pages

Spatio-Temporal Crime Analysis and Forecasting On Twitter Data Using Machine Learning Algorithms

This research explores the use of Twitter data for spatio-temporal crime analysis and forecasting in India, employing machine learning algorithms to classify and visualize crime-related tweets. The study utilizes various models, including LSTM, ARIMA, and SARIMA, to predict crime trends and identify hotspots, enhancing law enforcement's ability to allocate resources effectively. The findings highlight the potential of social media as a valuable tool for real-time crime data analysis despite challenges such as data inconsistency and language barriers.

Uploaded by

suykey3x7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

SN Computer Science (2023) 4:383

https://doi.org/10.1007/s42979-023-01816-y

ORIGINAL RESEARCH

Spatio‑temporal Crime Analysis and Forecasting on Twitter Data Using


Machine Learning Algorithms
Meghashyam Vivek1 · Boppuru Rudra Prathap1

Received: 19 October 2022 / Accepted: 30 March 2023 / Published online: 6 May 2023
© The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd 2023

Abstract
The concept of social media began to gain popularity in the late 1990s and has played a significant role in connecting people
across the globe. The constant addition of features to old social media platforms and the creation of new ones have helped
amass and retain an extensive user base. Users could now share their views and provide detailed accounts of events from
worldwide to reach like-minded people. This led to the popularization of blogging and brought into focus the posts of the
commoner. These posts began to be verified and included in mainstream news articles bringing about a revolution in jour-
nalism. This research aims to use a social media platform, Twitter, to classify, visualize, and forecast Indian crime tweet
data and provide a spatio-temporal view of crime in the country using statistical and machine learning models. The Tweepy
Python module's search function and '#crime' query have been used to scrape relevant tweets under geographical constraints,
followed by substring-keyword classification using 318 unique crime keywords. The Bokeh and gmaps Python modules
create analytical and geospatial visualizations, respectively. Time series forecasting of crime tweet count is performed by
comparing the accuracy of Long Short-Term Memory (LSTM), Auto-Regressive Integrated Moving Average (ARIMA), and
Seasonal Auto-Regressivee Integrated Moving Average (SARIMA) models to determine the best model.

Keywords Crime analysis · Twitter data · Machine learning algorithms · Forecasting · Crime prevention

Introduction Social media platforms can serve as a repository of crime


data with great geographical specificity as information may
Crime analysis and forecasting have become an area of focus be contributed by the residents of a locality. This repository
in criminology. India is a vast country with large regional is large and ever-growing, especially in developing nations
diversity. Law enforcement agencies would greatly benefit like India, where great strides are being made to bring inter-
from being able to identify crime hotspots pre-emptively. net connectivity to all. Twitter, in particular, has been chosen
This would enable these agencies to organize better and dis- for its characteristic features. Tweets are largely text-based,
tribute their limited resources to take optimal measures to allowing easier analysis of crime data through text process-
tackle the crime rate. However, the major issue these agen- ing techniques. The 280-character limit on tweets forces
cies face is the large-scale inconsistency and fragmentation users to use keywords that can be identified for data classifi-
of crime data. This is where social media comes in. cation. Furthermore, India makes up the world's third-largest
Twitter user base, providing an extensive crime-related data
This article is part of the topical collection “Cyber Security and set that is publicly accessible. Several regional and national
Privacy in Communication Networks” guest edited by Rajiv Misra, news agencies now have Twitter feeds that further contribute
R K Shyamsunder, Alexiei Dingli, Natalie Denk, Omer Rana, to this data set.
Alexander Pfeiffer, Ashok Patel and Nishtha Kesswani.
It must be noted that tweet data has its drawbacks like the
* Boppuru Rudra Prathap use of regional languages, incorrect and inconsistent infor-
[email protected] mation, and grammatical errors and poor sentence structure.
Meghashyam Vivek However, the advances made in natural language processing
[email protected] and translation allow us to overcome these drawbacks. Data
from official newsfeeds have an advantage over Twitter data
1
Computer Science and Engineering, CHRIST (Deemed to be in consistency and structure but fail to provide specificity at
University), Bangalore, India

SN Computer Science
Vol.:(0123456789)
383 Page 2 of 22 SN Computer Science (2023) 4:383

a local level. These feeds may be used as a secondary data assesses the detrimental effects of crime on society and iden-
source to verify the accuracy of Twitter data. tifies places with high crime harm by assigning a greater
The data scraped from Twitter is first organized and clas- weight to the more serious offenses. Although its applica-
sified using keywords and removing duplicate tweets. The bility in Japan, a country with a low rate of violent crime
obtained data is then used as a base for visualization tech- overall and specifically gun violence, is debatable, this index
niques like heatmaps and choropleths and forecasting using affects urban policy. Catlett et al. [3] provide a prediction
ARIMA, SARIMA, and LSTM models. These techniques approach based on geographical analysis and auto-regressive
together provide locations with the highest probable occur- models. The method predicts urban crime hotspots. This
rence of crime at any given time across the country. technique will produce a spatio-temporal crime prediction
model. The model will use crime hotspots and predictors.
Each predictor estimates general crime in its area. The
Related Work experiment used NYC and Chicago data. According to this
review, the system can provide accurate spatial and temporal
In this section, an extensive review of related literature crime projections over rolling time periods. Hu et al. [4] map
has been made to explicate the nature, importance, trend, and analyze hotspots using a spatial–temporal approach. The
and pattern and determinants of spatio-temporal crime. A proposed paradigm differs in four ways: STKDE incorpo-
wide variety of sources have been explored for this purpose, rates time into predictive hotspot mapping. The best band-
including published articles, journals, theories of crime, widths are chosen using probability cross-validation. A sta-
available and accessible books, research reports, govern- tistical significance test eliminates false positives in density
ment reports, Karnataka state police Reports, and various estimations and the predictive accuracy index (PAI) curve
official websites of authorized agencies related to crime. The measures predictive hotspot mapping. Anneleen Rummens
causes of crime have been the subject matter of much specu- et al. [5] proposed a predictive analysis for urban research.
lation, discussions, research, and debates. There is a large Invasion, robbery, and battery statistics are collected in
assortment of theories about crime and criminal behavior. 200 m by 200 m grids. The monthly crime rate for 2014 can
Scores of theories related to crime and criminal behavior be estimated using data from the last three years. Monthly
state crime as a part of human nature. Crime and criminal forecasts are broken down (day vs. night). The accuracy of
behavior mainly stem from the psychological, biological, a forecast is based on the direct hit rate, the precision, and
sociological, and economic aspects of human behavior. Var- the prediction index (ratio of direct hit rate versus proportion
ious theories explain people's engagement in crime from of total area predicted as high risk). Predictive analysis of
mental, physical, developmental, economic, social, cultural, crime data at the grid level is used to make functional fore-
and other causes. Theories exploring the causes of crime casts. Monthly forecasts that tell the difference between day
identified religion, philosophy, politics, economics, and and night do better than biweekly forecasts, which suggests
social forces as the main contributors to the ever-increasing that the amount of time a forecast covers affects how well
crime rate. Several works have been done on crimes, crimi- it works. Prathap [6] examined 68 crime-related keywords
nals, and different theories of crimes. A few of them are dis- to assist in identifying the type of crime using geographical
cussed here, which the researcher believes can lend a hand and temporal information. Keywords are assessed to provide
in shaping and explaining the present study. as much information as possible on criminal activities. It is
Cornow et al. [1] proposed a study on the relationship possible to segment criminal activity using news feeds and
between alcohol retailers and crime in Buffalo, New York, in the Naive Bayes classifier. Mallet extracts terms from many
terms of area and time. The study investigated if a crime was news streams. K-means is utilized to identify crime hot-
more likely to occur near licensed liquor businesses. Data spots. The KDE [23] strategy is utilized while dealing with
on licensed alcohol outlets and violent crimes were exam- crime density, and this method has corrected the algorithm's
ined using global and local bivariate space–time k-function shortcomings. The study uncovered similarities between the
techniques from 2005 to 2011. A global bivariate space–time ARIMA and crime-predicting models.
K-function analysis revealed the spatial and temporal dis- Changes in the criminal justice system have been talked
tribution of bars and crime. Personal blunders were both about for a long time. Prioritizing crimes based on random
collected and dispersed. According to a local survey, out- models from physics or statistics is a good idea in theory, but
lets selling alcohol and crimes occur simultaneously and it doesn't work well in practice. Data-driven models, espe-
in place. Space–time analysis of bars and crime reveals a cially neural network models, can depict event dynamics.
link. Much time has passed since Louis Thurstone compared Huge data sets can be mined for information. Spatial–tem-
different types of crime almost a century ago as stated by poral datasets make it hard to learn about crime in a region
Ohyama et al. [2]. Recent research has used methodologies because of their complexity, intractable correlations, and
such as the Cambridge Crime Harm Index (CCHI), which redundant data. CSAN was made by Qi Wang et al. [7] by

SN Computer Science
SN Computer Science (2023) 4:383 Page 3 of 22 383

putting together variational auto-encoders and context-based discusses issues and directions for future research. This pro-
sequence generative neural networks. CSAN is better than ject aims to build a technical resource that academic and
Conv-LSTM at predicting the number of different types of industrial specialists, practical decision-makers, and others
crimes in a region. Twitter enables the monitoring of spatial may utilize. Sarker et al. [13] provide a comprehensive over-
and temporal crime statistics on social media. Prathap et al. view of "AI-based Modeling" including the principles and
[8]. capabilities of potential AI techniques that can aid in the
Since social media users change rapidly, emotional analy- development of intelligent and smart systems in real-world
sis is an excellent decision-making tool. Twitter is a popu- application domains, such as business, finance, healthcare,
lar way to obtain news and communicate with others. More agriculture, smart cities, and more. Our study's research
than 150 million individuals send 500 million 140-charac- challenges are highlighted. The study contains academics,
ter tweets daily. Twitter is utilized to recommend products industry specialists, and decision-makers in real-world sce-
depending on the opinions of its users. This paper demon- narios and application areas with a complete introduction to
strates how to examine crime-related tweets from users. The AI-based modeling. Jangada Correia's [14] research recom-
data will demonstrate variations in how people feel about mended various methods for determining public perceptions
different types of crimes and how they feel about positive or of cyberterrorism and the need for standardized terminology
negative crime situations. Crime is the most serious societal and framework. The findings of an online poll favor expand-
problem in emerging countries. Crime has an impact on a ing stakeholder diversity to improve terrorism detection and
country's reputation and way of life. Crime has an economic prevention in the UK. Despite general agreement on cyber-
impact since it necessitates investments in the police force terrorism, misunderstandings may impair the public's ability
and justice system, increasing the government's financial to detect and report it. Overall, the literature research and
burden. Law enforcement works to reduce crime. Real-time gathering of primary data contribute to developing a cyber
crime projections can aid in crime reduction. The proposed terrorism definition consistent with UK legislative defini-
study by Prathap et al. [9] develops a criminal analytics plat- tions and a terrorist activity framework that emphasizes the
form that analyzes newsfeed data to detect crime hotspots. connections between traditional, cyber-enabled, and cyber-
This method assists criminologists in understanding unac- dependent terrorism.
knowledged relationships between crime and specific areas. N Kanimozhi et al. [15] presented a study that predicted
Interactive visualizations aid law enforcement in crime current crimes using Kaggle open-source crime data. This
prediction. study assesses the crime that has the greatest impact on a
Crime analysis using social media data, such as News- specific place and time period. This study uses machine
feeds, Facebook, Twitter, and so on, is a rising area of study learning methods such as Naive Bayes to classify criminal
for law enforcement agencies worldwide. Data are used to patterns more precisely than pre-composed works. Sivana-
anticipate attacks and arrange reinforcements. Prathap et al. galeela et al. [16] developed a method based on crime loca-
[10] collect and visualize newsfeed data to focus on textual tion rather than criminal identification. Initially, the system
data analytics. By providing crime area coordinates and pos- relied heavily on naive Bayes classification. Data on kid-
sible crimes, the research study creates a framework for fore- napping, murder, theft, burglary, cheating, crime against
casting 16 types of crime in India and Bangalore. Prathap women, and robbery will be clustered using the fuzzy
et al. [11] conducted a research study on criminal activity C-Means technique in the current setting. In developing
reports in India and Bangalore. Theft, homicide, alcohol- nations like India, crime is rife. Rapid urbanization neces-
ism, assault, etc., are categorized by geographic density and sitates constant oversight. Akash Kumar et al. [17] suggested
criminal trends, such as time of day, to identify and empha- employing KNN to forecast crime rates to avert calamity.
size national and regional crime. Based on a review of a It will anticipate a crime's type, date, place, and hour. This
year's worth of news articles, 68 crime-related terms can be data will show local criminal trends, which can help with
divided into three categories. investigations. It also includes a list of the most serious
Crime clusters are typically reported in locations with a crimes committed in a specific area. The k-nearest neigh-
high number of kernels. Time series data can be predicted bor, machine learning method, is used in the author's work.
using the ARIMA model. Using a data mining applica- Wajiha Safat et al. [18] utilized machine learning algorithms,
tion, one's different criminal tendencies can be graphically such as logistic regression, support vector machine (SVM),
represented. Iqbal H. Sarker describes machine learning Naive Bayes, k-nearest neighbors (KNN), decision tree, mul-
methods that can boost an application's intelligence and tilayer perceptron (MLP), random forest, and eXtreme Gra-
capabilities [12]. The study investigated the basics of many dient Boosting (XGBoost), to improve the fit of crime data.
machine learning algorithms and their relevance to many Regarding RMSE and MAE, LSTM performed well on both
real-world application areas, including cybersecurity, smart data sets. A data analysis predicts more than 35 categories
cities, healthcare, e-commerce, and agriculture. The study of crime and an annual decline in crime rates in Chicago

SN Computer Science
383 Page 4 of 22 SN Computer Science (2023) 4:383

and Los Angeles, with February having the lowest crime 1. Development of a dashboard to provide a consolidated
rate. Chicago's crime rate will continue to grow slowly until view of aggregated data.
decreasing. According to the ARIMA model, the crime rate 2. Identification of crime keywords and categorization of
and the number of offenses in Los Angeles decreased sig- crime tweets.
nificantly. The results of crime prediction were also seen in 3. Generation of crime heatmaps, choropleths, and scatter
the urban cores of both cities. Overall, these results provide plots.
Police with a more accurate method than earlier methods for 4. Forecasting of crime tweet count using ARIMA,
predicting crime, crime hotspots, and future trends. They can SARIMA, and LSTM models.
be utilized to inform police practice and planning.
Based on the literature, there are very complex spatio- Identifying Problem Based on the Literature
temporal patterns and complicated urban configurations in
urban crime. A large number of existing algorithms will not The real-time identification of crime hotspots has become
capture all the aspects of the patterns. Therefore, compre- a priority for law enforcement agencies to determine and
hensive techniques are necessary to identify the complex neutralize increasing threats pre-emptively. However, the
spatio-temporal pattern for analyzing urban crime. Machine vast amount of data collected by these agencies is outdated,
learning and Deep learning are good algorithms for cap- inconsistent, and fragmented based on several categories like
turing spatio-temporal crime patterns and predicting future region, nature of crime, etc. Therefore, a single, consist-
outcomes. ent source of data collected in real-time that provided an
overall view of crime in the country with equal importance
Motivation and Objective of the Research to both localized minor events and distributed macroscopic
events is required. The use of data from official news sources
India is a fast-developing country with a large population is promising but lacks the level of coverage provided by
and a young workforce. Urbanization is one of the many social media posts. Therefore, we turn to Twitter, where the
by-products of this high-speed development [19, 21]. The required data set is updated with every tweet in real time and
degree of urbanization in India has been in a constant ascent is publicly accessible by all. In addition to hotspot predic-
from 31.28% in 2011 to 35.39% in 2021. This growth has tion, a simple means of visualization is also required. Gerber
led to mass migrating of people from rural to urban areas in et al. [23] and Prathap [24] indicating the Kernel Density
search of better prospects. Researchers have shown a direct Estimation to identify crime hotspots in the USA and India,
correlation between the concentration of population in urban respectively. The literature review shows that social media
areas and the increase in crime rate. and police data can predict criminal activity, but none have
To show the consistent increase in crime, we may con- utilized Twitter data for the same in the Indian context.
sider the example of Bangalore city, Karnataka. Bangalore is
one of the most prominent metropolitan cities in the country
and is considered to be the IT hub of India. As per Karna- List of Crime Keywords Considered
taka State Police reports [22], cases of thefts recorded were
480 in January 2022 and 725 in August 2022, showing an A total of 318 unique crime keywords have been considered
increase of 151.04% over 7 months. Similarly, a 122.51% to classify crime into 6 categories as shown in Table 1. This
increase is seen in the number of Special and Local Laws model of categorization is modeled after the system used
(SLL) crimes. by the Karnataka State Police but is modified for improved
The crime rate in India has been consistently increas- understanding of readers without an in-depth understanding
ing since 2018, according to reports by the National Crime of the Indian Penal Code.
Records Bureau (NCRB) [20]. 2020 witnessed a massive
surge in the crime rate in response to the imposition of
COVID-19 restrictions. This surprised law enforcement Methodology and Implementation
agencies, leading to poor mobilization of limited resources.
Growth and prosperity of a nation are largely dependent The paper focuses on building a tool that uses Twitter data to
on the psychological state and well-being of its residents. identify crime hotspots in India. This section discusses the
This, along with the literature review on the use of social methodology used to develop the tool, as mentioned earlier.
media to predict crime, has motivated the use of Twitter data As discussed, the methodology can be broadly divided into
to create a tool that provides a spatio-temporal visualization classification, visualization, and forecasting components.
of criminal activity. Based on this main objective, the fol- These components can be further classified into the follow-
lowing sub-objectives were identified: ing seven subsystems:

SN Computer Science
SN Computer Science (2023) 4:383 Page 5 of 22 383

Table 1  Crime keyword classification


Crime category Crime keywords

Drug-related crimes Drug Trafficking, Drug dealing, Dealing drugs, Drug dealer, Alcohol Drinking, Alcohol dealing, Alcohol, Liquor law
violation, Liquor, Drug, Narcotics, Heroin, Cocaine, Ganja, Opium, Cannabis, Weed, Overdose, Capsule, Ketamine,
Amphetamine
Violent crimes Gang rape, Rape, Sexual harassment, Sexual assault, Sex offense, Sex abuse, Sexual abuse, Molest, Dishonor, Assault,
Fight, Beat, Kick, Punch, Battery, Lash out, Attack, Belt down, Obliterate, Intent to kill, Attempted murder, Murder,
Kill, Homicide, Armed robbery, Robbery, Terrorism, Kidnapping, Abduction, Tied, Incapacitate, Ransom, Shot,
Gunshot, Shootout, Stab, Harassment, Abuse, Assault, Outrage, Snatch, Put to death, Lynch, Hit and run, Gambling,
Hang, Run over, Dacoity, Sex racket, Victim, Goon, Rowdy, Knife, Domestic violence, Dead, Death, Violence, Sexual
misconduct, Body, Firing, Fired, Sexual, Hostage, Sex, Injure, Injury, Finger, Slit, Intimidate, Kidnap, Acid, Explosive
Commercial crimes Official Document Forgery, Currency Forgery, Official Seal Forgery, Official Stamp Forgery, Forgery, Bribery, Bribe,
Counterfeit, Cheat, Conned, Impersonator, Impersonate, Fake, Deceptive, Breach of trust, Breach of contract, Breach,
Embezzlement, Misappropriate, Fraud, Corrupt, Leak, Abscond, Misconduct, Conspire, Conspiracy, Tax, Excise, Col-
lude, Collusion, Contract, Extort, Lakh, Crore, Gold, Gullible, Chit fund, Chit, Fund, Money, Association, KYC, Bank,
Business, Deal, Raid, Copyright, Import, Export
Property crimes Arson, Motor vehicle theft, Theft, Burglary, Steal, Riot, Violent protest, Protest, Larceny, Barrage, Barrage fire, Fire,
Bombardment, Bomb, Explosion, Explode, Shelling, Looting, Trespass, Incendiarism, Shoplifting, Vandalism, Van-
dalize, Vandals, Encroach, Property, Stole, Smuggle, Thief, Burn, Land dispute, Land, Dispute, Rob, Enter, House,
Robbed, Crook, Loot, Robbers, Shop, Burglar, Grand theft auto
Traffic offenses Speeding, Signal Jump, Jump Signal, Running a Red Light, Reckless, Collision, Collide, High speed, Speed, Fast, Drunk
driving, Drink and drive, Drinking and driving, Driving under influence, DUI, Helmet, Accident
Other offenses Employing Illegal Worker, Illegal worker, Prostitution, Illegal Gambling, Gambling, Begging, Adultery, Homosexuality,
Weapons violation, Violation, Weapon, Porn, Video, Child, Public peace violation, Peace, Stalking, Hurt, Dowry, Mod-
esty, Negligence, Suicide, Criminal damage, Harlotry, Whoredom, Espionage, Spy, Pickpocketing, Pilfering, Poaching,
Damage, Illegal, Juvenile, Inappropriate, Marriage, Custody, Affair, Student, Hacking, Convict, Prison, Love, Threat,
Blackmail, Witness, Accuse, Conflict, Gang, Arrest, Miscreant, Cop, Police, Incident, Criminal, Cyber Crime, Scream,
Commit, Seize, Thugs, Uproar, Trap, Hindu, Muslim, BJP, Congress, Romantic, Marry, Politic, Follow, Hunting, Hunt,
Tiger, Fur, Skin, Husband, Wife, Follow, Girlfriend, Cyber, Crime, Suspect, Abase, Warrant, Social media, YouTube,
Facebook, Instagram, Twitter, False news, Law, Minor, Girl, Group, Chain snatch, Chain, Tribal, Caste, Adulterate,
Hijack, Hate, Schedule, Tribe, Atrocities, Atrocity, Abet, Conceal, Habit, Repeat, Major minerals, Minor minerals,
Mineral, Mining, Mine, Vehicle, Car, Bike

1. Data Aggregation Figure 1 shows the proposed framework.


2. Data Cleaning
3. Location Identification Data Aggregation
4. Classification
5. Visualization Data aggregation refers to scraping appropriate tweets from
6. Forecasting Twitter using the search function of the Tweepy module.
7. Dashboard The Tweepy module is used to interact with the Twitter API.
The hashtag '#crime' is used to query crime-related tweets

Fig. 1  Proposed framework

SN Computer Science
383 Page 6 of 22 SN Computer Science (2023) 4:383

and coordinates isolating tweets from and around the Indian Forecasting
subcontinent. Many retrieved tweets are written in various
regional languages and translated into English for better Forecasting involves a comparative analysis of ARIMA,
analysis. The tweets are then stored in a.csv file. SARIMA, and LSTM models to determine the most accurate
forecasting model for crime tweet count.
Data Cleaning
ARIMA
Data cleaning helps improve the integrity of the data set by
removing duplicate tweets. Each tweet is assigned a hash-
ARMA models were developed for stationary time series.
code, and duplicate hashcodes are removed. Tweet hashtags
However, a new class of models was introduced by integrat-
are extracted for each tweet. Tweets with only '#crime' are
ing a phase for the removal of non-stationarity in a time
said to have 'No Hashtags'. The API provides the timestamp
series. These models are known as ARIMA models and
of tweet creation in UTC, which is converted to the cor-
developed by Box and Jenkins which is a stochastic process.
responding time in IST. A URL pointing to each tweet on
ARIMA is described as a three-stage iterative model which
the Twitter platform is generated and stored to confirm the
includes time series identification, estimation, and verifica-
authenticity of the data set.
tion. The Box–Jenkins approach mainly uses the integration
filter, AR filter, and MA filter. The integration filter produces
Location Identification
a filtered (differenced) series from observed data. The AR
filter generates an intermediate series which is further pro-
Location identification involves the identification of the geo-
cessed by the MA filter which results in random white noise.
graphical location of crime by searching tweet attributes in
The formula for an ARIMA(p,d,q) model is given by:
a prioritized manner. First, the name of a city is searched for
in the tweet hashtags, followed by the tweet text and eventu- ARIMA(p, d, q) = (n) = 𝜇 + 𝜃1(n − 1) + ⋯ + 𝜃p(n − p)
ally the tweet geolocation. A city once identified is coupled + 𝜃(n) + 𝜃1(n − 1) + ⋯ + 𝜃q(n − q)
with the corresponding state and stored. The second round
(1)
of searches is carried out on state names in the same manner.
A state once identified is coupled with its capital city and In Eq. (1), μ is the mean of the series, θ1, …, θp are
stored. Finally, the word 'India' is searched for in the same the autoregression coefficients, θ1, …, θq are the moving
fashion. Identification of the word maps the city to Delhi. All average coefficients. f(n) is the forecasted value at time n n
tweets still left without a geographical location are consid- is the current time step n − 1, n − 2, …, n − p are previous
ered to belong to neighboring countries and removed. The time steps.
latitudes and the longitudes of identified cities are stored. This equation can be broken down into three components:
Autoregression (AR) term: the autoregression term repre-
Classification sents the linear dependence between an observation and a
number of lagged observations. It is represented by the sum
Classification refers to the categorization of tweets based on of θ1(n − 1) + ··· + θp(n − p). Differences (I) term: the differ-
the type of crime. Crime keyword is identified using a sub- ences term represents the amount of differencing applied to
string-keyword search on the tweet hashtags followed by the make the time series stationary. It is represented by d in the
tweet text. The keyword is then referenced against the crime ARIMA(p,d,q) formula.
keyword database to determine the crime type. Data from Moving average (MA) term: The moving average term
the above subsystems are stored in the secondary database. represents the error or residual at a certain point modeled
as a linear function of the previous errors or residuals. It
Visualization is represented by the sum of θ1(n − 1) + ··· + θq(n − q). In
summary, the ARIMA model predicts the next value in a
Visualization provides a comprehensible diagrammatic time series as a weighted sum of past observations and past
representation of the data stored in the secondary database. forecast errors, with weights determined by the values of p,
The applied visualization techniques can be classified into d, and q.
analytical and geospatial. Geospatial visualization includes
a heatmap for crime density representation, a choropleth for SARIMA
state-wise crime distribution and a scatter plot to pinpoint
crime location. Analytical visualization includes bar graphs, SARIMA (Seasonal Autoregressive Integrated Moving Aver-
pie charts, clustered bar graphs, and line graphs to represent age) is a statistical model that is used to analyze and forecast
quantifiable measures of collected data. time series data with a seasonal component. It is an extension

SN Computer Science
SN Computer Science (2023) 4:383 Page 7 of 22 383

of the ARIMA model, which takes into account the presence Data Aggregation
of seasonality in the data.
The formula for a SARIMA(p, d, q) (P, D, Q)m model is To extract tweets from Twitter, a valid API key with appro-
given by: SARIMA(p, d, q) (P, D, Q)m priate access level authorization and the API key secret is
required. API key authentication request is sent using the
(n) = 𝜇 + 𝜃1(n − 1) + ⋯ + 𝜃p(n − p)
0Auth2AppHandler function in Tweepy. Once Twitter suc-
+ 𝜃(n) + 𝜃1(n − m) + ⋯ + 𝜃q(n − qm) (2) cessfully authenticates the API key, the data aggregation pro-
+ 𝛾1(n − 1 − mD) + ⋯ + 𝛾P(n − Pm − mD) cess may begin.
Scraping is performed using the search function of the
In Eq. (2), μ is the mean of the series, θ1, …, θp are the Tweepy Python module. The search query '#crime' along with
autoregression coefficients, θ1, …, θq are the moving average geocoding '20.5937 (latitude), 78.9629 (longitude), 3000 km
coefficients γ1, …, γP are the seasonal autoregression coef- (the radius for tweet extraction)' are passed as parameters. This
ficients, f(n) is the forecasted value at time n, n is the current geocode encompasses all of India and small parts of its neigh-
time step, n − 1, n − 2, …, n − p are previous non-seasonal time boring countries. The search_tweets query returns pages of
steps. n − m, n − m − 1, …, n − m − q are previous seasonal time data traversed using the tweepy cursor to extract tweet attrib-
steps, m is the number of seasonal periods per cycle utes. Tweet text is translated to English using the googletrans
D is the order of seasonal differencing and d is the order of python module. Any URLs contained in the tweet text are
non-seasonal differencing. eliminated before translation using simple substring elimina-
In summary, the SARIMA model predicts the next value tion with 'https' as a key. The aggregated data is written onto
in a time series as a weighted sum of past observations, past a pandas data frame and appended to the primary CSV file.
forecast errors, and past seasonal errors, with weights deter-
mined by the values of p, d, q, P, D, Q, and m. It has a form Data Cleaning
similar to the ARIMA model, but with added terms of seasonal
autoregression (γ1, …, γP) and seasonal differencing (D). Hashcodes are generated using the standard Python hashing
library. Duplicate hashcodes are removed using the removedu-
LSTM plicates() function of the pandas module. Time zone conver-
sion is performed using the dateutil Python package.
Long Short-Term Memory (LSTM) is a type of Recurrent
Neural Network (RNN) architecture that is capable of learning Location Identification
long-term dependencies in data. An RNN is a neural network
architecture that is designed to process sequential data, such The presence of geographic keywords referenced from the
as text, speech, or time series data. The LSTM architecture locations database is found in tweet attributes using the find()
was designed to overcome the problem of vanishing gradients function. find() returns a value of − 1 if the keyword is not
that occur in traditional RNNs when the input sequence is found.
very long.
In an LSTM network, each neuron, or “memory cell,” has Classification
three gates: an input gate, a forget gate, and an output gate.
These gates are used to control the flow of information into and Crime keywords referenced from the keyword database are
out of the cell, and to allow the LSTM network to learn when searched for in the tweet attributes using the find() function.
to remember or forget information. The gates are controlled find() returns a value of − 1 if a keyword is not found.
by weights that are learned during training, and the values of
these weights determine how much information is allowed to Visualization
flow through the gates at each time step.
Geospatial heatmaps and scatterplots are generated using
Implementation of the Process the gmaps Python package while the interactive choropleth
is generated using the bokeh module and a shape (.SHP) file
The various subsystems mentioned in the methodology section of India's administrative divisions. Analytical visualizations
are implemented using Python programs. The Tweepy Python are generated using the matplotlib Python package.
module is used to communicate with the Twitter API.
Forecasting

The SARIMA model has been implemented using the SARI-


MAX function from statsmodels.tsa.statespace package in

SN Computer Science
383 Page 8 of 22 SN Computer Science (2023) 4:383

Python. Autocorrelation plot (ACF) and Partial Autocorrela- tweets. This may be due to the following factors—geopoliti-
tion plot (PACF) have been implemented using the acf and cal tension, lack of internet infrastructure, or low population
pacf functions from statsmodels.graphics.tsaplots package density.
in Python. The ARIMA model has been implemented using
the ARIMA function from the statsmodels.tsa.arima.model. Crime Density Detection Using Heatmaps
ARIMA parameters are found using the auto_arima function
from the pmdarima package. To train the LSTM model, the Seven heatmaps are generated in total, 6 of which display
crime counts are scaled to a value between 0 and 1 using the geospatial density of each crime category and one dis-
the MinMaxScaler function from the sklearn. preprocess- plays the total geospatial crime density, including all types
ing package. The time series is then formatted for the train- of crime. Figure 4 shows the different heatmaps generated.
ing process using the TimeSeriesGenerator function of the Figure 4 shows that the density distribution for each crime
keras.preprocessing.sequence module. The LSTM layer is category is different as various factors affect their distribu-
added to the model with 100 neurons and RELU activation. tion. Criminal activity appears denser in metropolitan cities.
The dense layer is added to the model and the model is com- It can be inferred from the above plots that violent crimes
piled using the Adam optimizer and Mean Squared Error make up the majority of the total crime count due to the
as a loss. The LSTM model can be found in keras.layers as similarities in density distribution. Violent crimes appear
LSTM. The model is trained for 50 epochs while saving only more concentrated in North India, with a decreasing density
the best model using ModelCheckpoints. as we move South. Delhi, Uttar Pradesh, and Bihar belts
appear to have the country's highest concentration of crimi-
Dashboard nal activity. Delhi appears to have a high concentration of
criminal activity under all categories. Drug-related crimes
Built using HTML, CSS, and Javascript to display interac- are sparsely distributed, with Goa, Delhi, and Mumbai being
tive plots and other analytics in an organized form. hotspots. Commercial crimes are concentrated around met-
Figure 2 shows a detailed framework of the program ropolitan and port cities, which are seats of commerce.
implementation. Traffic offenses are sparsely distributed with concentrations
along the western and eastern ghats. Property crimes are
concentrated in cities like Mumbai, Delhi-NCR, Bangalore,
Results and Discussion Chennai, Hyderabad, Pune, and Kolkata which are home to
some of the most expensive localities to buy real estate in
State‑wise Crime Distribution Using Choropleth the entire country. Other offenses are sparsely scattered and
concentrated around metropolitan cities without any distinct
Figure 3 displays the absolute crime count determined by patterns.
crime tweets and does not consider population. Therefore,
populous states like Uttar Pradesh tend to have a higher Pinpointing Crime Locations Using Scatter Plots
crime count. The five states with the highest crime count
in decreasing order are as follows: Delhi (2377), Maharash- Heatmaps provide a great visualization of the crime density,
tra (2275), Uttar Pradesh (1670), Tamil Nadu (1252), and but a scatter plot is required to study each event individu-
Gujarat (1121). A 2022 report shows Delhi has the coun- ally. Figure 5 represents the scatter plot for the crime-related
try's highest crime rate, with a crime index of 59.58. The tweets collected thus far.
figures generated from tweet data show a mismatch with Parts of Telangana and Odisha appear not to have any
NCRB reports. This is because not all crimes are tweeted pointers over them. This can be verified with the correspond-
about. The accuracy of tweet data concerning real-world ing heatmap showing zero crime density. A clustered scatter
data is expected to improve over time with the addition of plot produces a heatmap. The view of the map at the default
new users. However, the relative state-wise accuracy of zoom level is not very comprehensive. Zooming into a par-
data can be used to determine the state-wise distribution of ticular area of interest and clicking the marker provides more
Twitter users. For example, in Delhi, the crime tweet data is comprehensive results as shown in Fig. 6,
similar to real-world crime data, it can be assumed to have a Similar to heatmaps, scatter plots may also be filtered
higher percentage share of Twitter users. In contrast to this, by crime type. Figure 7 shows a scatter plot of only drug-
states like Arunachal Pradesh where there is a large dis- related crimes.
parity between crime tweet data and real-world crime data
can be assumed to have few Twitter users. States along the
border of the country, like Ladakh, Sikkim, and Arunachal
Pradesh seem to have fewer Twitter users and crime-related

SN Computer Science
SN Computer Science (2023) 4:383 Page 9 of 22 383

Fig. 2  Detailed framework

Analytics As of 2022, India has recorded a crime index of 44.57


against an average count of 44.74, according to World
Percentage Share of Crime Categories Crime Index Reports. Here, Violent crimes are categorized
as actions that intentionally cause bodily harm. The other
Figure 8 shows each crime category as a percentage of the offenses category has a higher share as it includes any crime
total crime based on tweet data. that does not fit under the other five categories. Twitter data

SN Computer Science
383 Page 10 of 22 SN Computer Science (2023) 4:383

analyze the effectiveness of law enforcement agencies. It


can also be said that these words are of common knowledge
to most people and have specific translations from various
regional languages.

Crime Keywords with the Highest Social Impact

It is possible to determine the crime keywords that Twitter


users focus on by integrating the number of retweets asso-
ciated with that particular tweet. A retweet is a feature of
Twitter that enables one user to repost another user's tweet,
thereby sharing it with his followers and increasing the num-
ber of user interactions with the original post. Figure 10
shows the average number of retweets for a tweet with a
particular keyword.
The keyword Trap tops the list with 56.75 retweets on
average. This could be to raise awareness against conmen
and fraudulent schemes. Burn may be used to indicate both
arson as well as the use of fire to inflict bodily harm. Both
these cases are serious offenses and retweeting helps raise
awareness against them. The keyword Girl consistently
repeats in most tweets conveying sexual crimes against
minors and deeply impacts society. This data could be
tracked to identify repeat offenders and help people identify
and avoid areas with poor safety for children. Twitter being
a platform for the heated exchange of political ideologies
Fig. 3  State-wise crime distribution using choropleth leads to the appearance of the keyword Congress. Keywords
that appear on this as well as the previous chart indicate
crimes that are most talked about on Twitter and also have
might be an efficient way of tracking violent crimes. The a deep impact on society. In this case, the keyword is found
percentage of traffic offenses is much higher, according to to be Shot.
NCB reports. This implies that this category of tweets gets
less traction on social media and is not covered by major Crime in Metropolitan Cities
news channels.
Rapid urbanization has made cities hotspots of crime. The
Crime Keyword Analytics clustered bar graph in Fig. 11 represents the counts of each
of the 6 crime categories in 6 major metropolitan cities. The
Analysis run on the keywords can help identify the level of crime rate is the highest in Delhi. This follows a 2022 report
awareness the public has toward the criminal activity around ranking Delhi as the city with the highest crime rate, a popu-
them. Figure 9 displays the top ten crime keywords with the lation of 18,980,000 and a crime index of 59.58. Kolkata has
highest hit rate. the second lowest overall crime count with no drug crimes
6 of the top 10 keywords with the highest hits are from reported on Twitter. This follows the 2022 NCB report stat-
the violent crime category. This shows that tweet crime data ing Kolkata is the safest Indian city to live in. It must be
has a high percentage of violent crimes about real-life crime noted that the count on this chart is also proportional to the
reports. Crime, Police, and Arrest are general keywords used Twitter user base in each city.
when no other event-specific keyword is found and classi-
fied as other offenses. However, the presence of police as States with the Highest Crime Rates
the keyword with the 5th highest number of hits indicates
that the public is quite extensively using Twitter to bring The state-wise distribution of crime as seen in Fig. 3 is a
the attention of law enforcement agencies to specific issues. representation of the absolute crime count. The crime rate
Law enforcement agencies could consider tweets as a factor for each state can be calculated by taking into account the
while determining resource deployment. Alternately, this population using the formula,
also indicates the possibility of using crime tweet data to

SN Computer Science
SN Computer Science (2023) 4:383 Page 11 of 22 383

Fig. 4  Crime density detection using heatmaps. a Drug-related crimes, b Violent crimes, c Commercial crimes, d Property crimes, e other
offenses, f Traffic offenses, g All crime categories combined

Crime rate Table 2 shows 10 states with the highest crime rates.
= (Total number of crimes∕Total population of the state) These rates are updated in real time and change with the
addition of new data.
× 100, 000

SN Computer Science
383 Page 12 of 22 SN Computer Science (2023) 4:383

Fig. 4  (continued)

These crime rates are calculated on the basis of states with the highest crime rates as per NCRB data. This
crime tweet data for a period of 2 months ranging from shows a 60% match with crime data obtained from Twit-
18/08/2022 to 16/10/2022. Delhi, Goa, Gujarat, Tamil ter. The remaining states on this list may have a relatively
Nadu, Maharashtra, and Chandigarh are part of the top 10 larger number of Twitter users leading to a discrepancy
with NCRB data.

SN Computer Science
SN Computer Science (2023) 4:383 Page 13 of 22 383

Fig. 5  Scatter plot with standard zoom level Fig. 7  Scatter plot of Drug-related crimes

Fig. 6  Pinpointing crimes through scatter plots


Fig. 8  Percentage shares of crime categories

Time Series Analysis

The real-time data used in this paper has been collected con- Daily Plot of Crime Tweet Count
sistently from the 18th of August, 2022 to the 16th of Octo-
ber, 2022 for approximately 2 months. This section contains Figure 12 represents the crime count as a daily time series.
a basic analysis of the daily and hourly crime count plots. It can be seen that some days have higher crime counts than
Note that these counts represent the number of crime tweets others. There is a sharp fall in crime rates on weekends
recorded and not the actual number of crimes but are treated especially Sundays (for example 21/08/2022, 23/08/2022).
as analogous in subsequent parts of this section. The number of crimes seems to increase during festi-
vals (for example 18/08/2022—Krishna Janmashtami,
31/08/2022—Ganesh Chaturthi, 08/09/2022—Onam).

SN Computer Science
383 Page 14 of 22 SN Computer Science (2023) 4:383

There also appears to be a weak correlation between rain-


fall and crime count. This can be better analyzed with the
collection of more data. In this case, the daily time series
contains only 60 entries which are insufficient to create
accurate forecasts using statistical models Therefore, an
hourly time series is developed.

Hourly Plot of Crime Tweet Count

Figure 13 provides an hourly representation of the time


series. It is seen that the least number of crimes is recorded
during the early morning from 2:00 to 5:00 and the high-
est number of crimes is recorded during the late afternoon
from 14:00 to 17:00. This time series representation is
Fig. 9  Most used crime keywords suitable for forecasting. The ARIMA model works best on

Fig. 10  Social impact of crime


keywords

Fig. 11  Crime in metropolitan cities

SN Computer Science
SN Computer Science (2023) 4:383 Page 15 of 22 383

Table 2  10 States with the highest crime rates stationary data. A data set that does not have a clear trend
State Population Crime count Crime rate can be said to be stationary. This can be identified using
the Augmented Dicky–Fuller Test. The test is implemented
Goa 15,21,992 377 24.77016962 using the adfuller function from the statsmodels.tsa.stattools
Delhi 1,93,01,096 2377 12.31536282 package in Python. Figure 14 shows the results obtained
Chandigarh 11,58,040 92 7.944457877 from the Augmented Dicky–Fuller Test.
Maharashtra 12,49,04,071 2275 1.821397799 The value of significance to us is the P-Value. The lower
Puducherry 16,46,050 28 1.701041888 the P-Value, the more stationary is the data. A P-Value of
Uttarakhand 1,17,00,099 194 1.658105628 under 0.05 can be considered to be stationary. Therefore,
Gujarat 7,04,00,153 1121 1.592326085 the crime count data set is stationary.
Tamil Nadu 8,36,97,770 1252 1.495858253
Kerala 3,46,98,876 495 1.426559177 Time Series Forecasting
Meghalaya 37,72,103 42 1.113437252
The process of analyzing past and present data to quan-
tifiably predict future data is considered to be time series
forecasting. Figure 15 shows the Autocorrelation and Par-
tial Autocorrelation plots used to identify the order of the
SARIMA and ARIMA models.

Fig. 12  Daily time series

Fig. 13  Hourly time series

SN Computer Science
383 Page 16 of 22 SN Computer Science (2023) 4:383

The autocorrelation function gives the correlation Table 3 shows the Root Mean Squared Errors (RMSE)
between a time series and its lags. The partial autocorrela- and Mean Absolute Error (MAE) generated from the valida-
tion function gives the same correlation but after removing tion plots for each of the three forecasting models.
the relations explained by previous lags. The order of the From Fig. 19, the ARIMA model is found to have the
ARIMA model as generated by the autoarima function is least RMSE of the three models. Therefore, the ARIMA
(2, 0, 5) and that of the SARIMA model is (0, 1, 2)x(2, 1, 1, model is the best-suited model for time series forecasting
24) due to daily seasonality of data. of crime tweet count. Figure 20 compares the 24-h forecast
Figures 16, 17, and 18 show the validation plots gener- plots generated by the three models.
ated for each of the three models. These plots compare fore- Table 4 shows the hourly forecast of crime provided by
casted data with real-time data to determine the accuracy the ARIMA model for the period 17/10/2022–18/10/2022.
of each model. The last 24 h of training set data is used for
validation.

Fig. 14  Augmented Dicky–


Fuller test

Fig. 15  ACF and PACF plots of aggregated data

Fig. 16  Validation plot for


SARIMA model

SN Computer Science
SN Computer Science (2023) 4:383 Page 17 of 22 383

Fig. 17  Validation plot for


ARIMA model

Fig. 18  Validation plot for


LSTM model

Table 3  Model errors Model RMSE MAE that are not of common knowledge or public importance
were ignored leaving behind the following categories—Spl
ARIMA 3.283 2.005 & Local laws, Theft, Murder, Cases of Hurt, 107 Cr.P.C.,
SARIMA 5.287 3.772 Cybercrime, NDPS Cases, Cheating, Burglary—Night,
LSTM 7.781 6.285 Riots, POCSO, Rape, Robbery, Cr. Br. of Trust, Dacoity.
The crime data scraped from Twitter along with the cor-
responding KSP data for the month of September 2022 are
Validation of Forecasting models listed in Tables 5 and 6.
9 From Figs. 21 and 22, the number of tweets does not
match the number of crimes directly as not every crime
8
7
6 is tweeted about. Therefore, the twitter data are validated
Count

5
4
against KSP data by comparing the order of crime catego-
3 ries based on the number of crimes when arranged from the
2 highest to lowest. As seen from Tables 5 and 6, there is a
1
mismatch for the categories of Murder and Rape, giving a
ARIMA SARIMA LSTM
match of 86.67%. This shows that Twitter is an ideal source
Forecasting Model
of data for crime forecasting (Table 6).
RMSE MAE

Suitability of the ARIMA Model


Fig. 19  Comparision of forecasting models
The ARIMA model can handle data sets with trends but can-
Verification with Karnataka State Police (KSP) Data not handle seasonality in data. The ADF test in Sect. "Hourly
plot of crime tweet count" shows a P-value of 0.006 show-
The keywords were restructured to match the crime cat- ing that the data is stationary that is with low enough trend
egories of the Karnataka State Police. Certain categories and seasonality to apply the ARIMA model. However, on

SN Computer Science
383 Page 18 of 22 SN Computer Science (2023) 4:383

Fig. 20  Comparison of
SARIMA, ARIMA, and LSTM
forecasts

Table 4  ARIMA forecast Table 5  Crime data as per Twitter


Time Crime count Sl. no. Crime category Count

2022–10-17 03:00:00–04:00:00 1.014640865 1 Spl and Local laws 82


2022–10-17 04:00:00–05:00:00 1.558521878 2 Theft 60
2022–10-17 05:00:00–06:00:00 2.58573119 3 Murder 58
2022–10-17 06:00:00–07:00:00 4.281474358 4 Cases of Hurt 57
2022–10-17 07:00:00–08:00:00 6.411219367 5 107 Cr. P. C 47
2022–10-17 08:00:00–09:00:00 8.842567881 6 Cybercrime 34
2022–10-17 09:00:00–10:00:00 11.40985838 7 NDPS Cases 33
2022–10-17 10:00:00–11:00:00 13.93816941 8 Cheating 31
2022–10-17 11:00:00–12:00:00 16.25523778 9 Burglary–N 28
2022–10-17 12:00:00–13:00:00 18.2031955 10 Riots 26
2022–10-17 13:00:00–14:00:00 19.64932577 11 POCSO 20
2022–10-17 14:00:00–15:00:00 20.49510508 12 Rape 19
2022–10-17 15:00:00–16:00:00 20.68291559 13 Robbery 17
2022–10-17 16:00:00–17:00:00 20.19997023 14 Cr. Br. Of Trust 14
2022–10-17 17:00:00–18:00:00 19.07918337 15 Dacoity 3
2022–10-17 18:00:00–19:00:00 17.39692752
2022–10-17 19:00:00–20:00:00 15.26782924
2022–10-17 20:00:00–21:00:00 12.83695856 Table 6  Crime data as per KSP
2022–10-17 21:00:00–22:00:00 10.26994447 Sl. no. Crime category Count
2022–10-17 22:00:00–23:00:00 7.741689586
2022–10-17 23:00:00–00:00:00 5.424453273 1 Spl and Local laws 4210
2022–10-18 00:00:00–01:00:00 3.47611496 2 Theft 1652
2022–10-18 01:00:00–02:00:00 2.029417403 3 Cases of Hurt 1203
4 107 Cr. P. C 1185
2022–10-18 02:00:00–03:00:00 1.182922747
5 Cybercrime 1073
6 NDPS Cases 609
7 Cheating 542
verification with the KPSS test, P-value of 0.01 was found
8 Burglary–N 368
indicating a seasonal trend in contradiction to the ADF test.
9 Riots 285
Additive seasonal decomposition is applied on the data set 10 POCSO 224
to obtain the trending, seasonal, and residual components of 11 Murder 111
the data set using the seasonal_decompose() function in the 12 Robbery 104
statsmodels.tsa.seasonal module in Python. Fig. 23 shows 13 Rape 48
the trend data, Fig. 24 shows the seasonal module data, and 14 Cr. Br. Of Trust 29
Fig. 25 shows the residual data of the crime incidents. 15 Dacoity 13

SN Computer Science
SN Computer Science (2023) 4:383 Page 19 of 22 383

Fig. 21  Twitter crime data reorganized as per KSP categories

Fig. 22  KSP crime data

Validation plot for ARIMA on residual data. As seen from model. The higher accuracy of the ARIMA model over the
Fig. 26, the plot the residual data has neither seasonality nor SARIMA model is due to the method of order identification.
trend and is completely stationary. ARIMA (2, 0, 2) was The order of ARIMA is found using the auto_arima function
trained on this data to produce the following validation plot. while that of the SARIMA model is calculated manually.
Figure 26 shows the model validation of residual values Thus, auto_arima provides a better fit model for the training
of RMSE of 3.135 and an MAE of 1.998, which is approxi- data set.
mately equal to the RMSE and MAE of the ARIMA model
when applied to the original data set as seen in Table 3.
Therefore, the limited seasonality of the original data
set does not greatly impact the accuracy of the ARIMA

SN Computer Science
383 Page 20 of 22 SN Computer Science (2023) 4:383

Fig. 23  Trend data

Fig. 24  Seasonality of data

Fig. 25  Residual data

Conclusion six categories based on 318 unique crime keywords.


Analytical and geospatial visualization techniques were
In this research, 2 months of relevant crime tweets applied to the 15,601 tweets collected to produce valu-
scraped from Twitter were cleaned and classified into able insights. The comparison of ARIMA, SARIMA, and

SN Computer Science
SN Computer Science (2023) 4:383 Page 21 of 22 383

Fig. 26  Model Validation for


residual data

LSTM models showed that ARIMA is most suitable for 3. Catlett C, et al. Spatio-temporal crime predictions in smart cities:
time series forecasting of crime tweet count. Adding new a data-driven approach and experiments. Pervas Mob Comput.
2019;53:62–74.
users over time and increasing median user experience will 4. Hu Y, et al. A spatio-temporal kernel density estimation frame-
further improve the quality of the crime tweet data set. work for predictive crime hotspot mapping and evaluation. Appl
The outcome of this study shows that Twitter is a viable Geogr. 2018;99:89–97.
option for analyzing and forecasting crime. Twitter data 5. Rummens A, Hardyns W, Pauwels L. The use of predictive analy-
sis in spatiotemporal crime forecasting: building and testing a
may also be used in combination with other sources of data model in an urban context. Appl Geogr. 2017;86:255–61.
as a means of verification. This research focuses only on 6. Prathap BR. Geospatial crime analysis and forecasting with
univariate forecasting techniques and uses only Twitter as machine learning techniques. In: Artificial Intelligence and
a source. There is a vast scope for using advanced fore- Machine Learning for EDGE Computing. Academic Press, pp
87–102. 2022
casting techniques and multiple social media platforms for 7. Wang Q, et al. CSAN: a neural network benchmark model for
crime analysis and forecasting in the field of criminology. crime forecasting in spatio-temporal scale. Knowl Based Syst.
Thus, law enforcement agencies can use the developed 2020;189:105120.
system to optimize resource distribution, thereby improv- 8. Prathap BR, Ramesha K. Twitter sentiment for analyzing different
types of crimes. In: 2018 International Conference on Communi-
ing crime response rates. cation, Computing and Internet of Things (IC3IoT). IEEE. 2018
9. Prathap BR, Ramesha K. Geospatial crime analysis to determine
crime density using Kernel density estimation for the Indian con-
Funding The authors would like to confirm that there was no funding text. J Comput Theor Nanosci. 2020;171:74–86.
received for the above work. 10. Boppuru PR, Ramesha K. Geo-spatial crime analysis using
newsfeed data in Indian context. IJWLTT. 2019;14(4):49–64.
Data availability The author collects the Dataset used in this research https://​doi.​org/​10.​4018/​IJWLTT.​20191​00103.
which will be provided on request. 11. Boppuru PR, Ramesha K. Spatio-temporal crime analysis using
KDE and ARIMA models in the Indian context. Int J Digit
Declarations Crime Foren (IJDCF). 2020;12(4):1–19. https://​d oi.​o rg/​1 0.​
4018/​IJDCF.​20201​00101.
Conflict of interest I know of no conflicts of interest associated with 12. Sarker IH. Machine learning: algorithms, real-world applica-
this publication, and there has been no significant financial support for tions and research directions. SN Comput Sci. 2021;2:160.
this work that could have influenced its outcome. As the corresponding https://​doi.​org/​10.​1007/​s42979-​021-​00592-x.
author, I confirm that the manuscript has been read and approved for 13. Sarker IH. AI-based modeling: techniques, applications and
submission by the named author. research issues towards automation, intelligent and smart sys-
tems. SN Comput Sci. 2022;3:158. https://​d oi.​o rg/​1 0.​1 007/​
s42979-​022-​01043-x.
14. Jangada Correia V. An explorative study into the importance of
References defining and classifying cyber terrorism in the UK. SN Comput
Sci. 2022;3:84. https://​doi.​org/​10.​1007/​s42979-​021-​00962-5.
1. Conrow L, Aldstadt J, Mendoza NS. A spatio-temporal analysis of 15. Kanimozhi N, Keerthana NV, Pavithra GS, Ranjitha G, Yuva-
on-premises alcohol outlets and violent crime events in Buffalo, rani S. CRIME type and occurrence prediction using machine
NY. Appl Geogr. 2015;58:198–205. learning algorithm. Int Conf Artif Intell Smart Syst (ICAIS).
2. Ohyama T, et al. Investigating crime harm index in the low and 2021;2021:266–73. https://​doi.​org/​10.​1109/​ICAIS​50930.​2021.​
downward crime contexts: a spatio-temporal analysis of the Japa- 93959​53.
nese Crime Harm Index. Cities. 2022;130:103922. 16. Sivanagaleela B, Rajesh S. Crime analysis and prediction using
fuzzy C-means algorithm. In: 2019 3rd International Conference

SN Computer Science
383 Page 22 of 22 SN Computer Science (2023) 4:383

on Trends in Electronics and Informatics (ICOEI), pp. 595–599. 22. Gadagpolice. "Monthly Crime Review." Monthly Crime Review
2019. https://​doi.​org/​10.​1109/​ICOEI.​2019.​88626​91. - Karnataka State Police. Accessed October 17, 2022. https://​
17. Kumar A, Verma A, Shinde G, Sukhdeve Y, Lal N. Crime pre- ksp.​karna​taka.​gov.​in/​new-​page/​Month​ly%​20Cri​me%​20Rev​iew/​
diction using k-nearest neighboring algorithm. In: 2020 Interna- en.
tional Conference on Emerging Trends in Information Technol- 23. Gerber MS. Predicting crime using Twitter and kernel density
ogy and Engineering (ic-ETITE), pp. 1–4. 2020. https://​doi.​org/​ estimation. Decis Sup Syst. 2014;61:115–25.
10.​1109/​ic-​ETITE​47903.​2020.​155 24. Prathap BR. Geo-spatial crime density attribution using
18. Safat W, Asghar S, Gillani SA. Empirical analysis for crime optimized machine learning algorithms. Int j inf tecnol.
prediction and forecasting using machine learning and deep 2023;15:1167–78. https://​doi.​org/​10.​1007/​s41870-​023-​01160-7.
learning techniques. IEEE Access. 2021;9:70080–94. https://​
doi.​org/​10.​1109/​ACCESS.​2021.​30781​17. Publisher's Note Springer Nature remains neutral with regard to
19. Wikipedia. Crime in India. Wikimedia Foundation. Last modi- jurisdictional claims in published maps and institutional affiliations.
fied September 14, 2022. https://​en.​wikip​edia.​org/​wiki/​Crime_​
in_​India. Springer Nature or its licensor (e.g. a society or other partner) holds
20. O'Neill A. India—Urbanization 2021. Statista, July 29, exclusive rights to this article under a publishing agreement with the
2022. https://​w ww.​stati​sta.​c om/​stati​stics/​2 71312/​u rban​i zati​ author(s) or other rightsholder(s); author self-archiving of the accepted
on-​in-​india/. manuscript version of this article is solely governed by the terms of
21. Malik AA. Urbanization and crime: a relational analysis. J Hum such publishing agreement and applicable law.
Soc Sci. 2016;21:68–9.

SN Computer Science

You might also like