Spatio-Temporal Crime Analysis and Forecasting On Twitter Data Using Machine Learning Algorithms
Spatio-Temporal Crime Analysis and Forecasting On Twitter Data Using Machine Learning Algorithms
https://doi.org/10.1007/s42979-023-01816-y
ORIGINAL RESEARCH
Received: 19 October 2022 / Accepted: 30 March 2023 / Published online: 6 May 2023
© The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd 2023
Abstract
The concept of social media began to gain popularity in the late 1990s and has played a significant role in connecting people
across the globe. The constant addition of features to old social media platforms and the creation of new ones have helped
amass and retain an extensive user base. Users could now share their views and provide detailed accounts of events from
worldwide to reach like-minded people. This led to the popularization of blogging and brought into focus the posts of the
commoner. These posts began to be verified and included in mainstream news articles bringing about a revolution in jour-
nalism. This research aims to use a social media platform, Twitter, to classify, visualize, and forecast Indian crime tweet
data and provide a spatio-temporal view of crime in the country using statistical and machine learning models. The Tweepy
Python module's search function and '#crime' query have been used to scrape relevant tweets under geographical constraints,
followed by substring-keyword classification using 318 unique crime keywords. The Bokeh and gmaps Python modules
create analytical and geospatial visualizations, respectively. Time series forecasting of crime tweet count is performed by
comparing the accuracy of Long Short-Term Memory (LSTM), Auto-Regressive Integrated Moving Average (ARIMA), and
Seasonal Auto-Regressivee Integrated Moving Average (SARIMA) models to determine the best model.
Keywords Crime analysis · Twitter data · Machine learning algorithms · Forecasting · Crime prevention
SN Computer Science
Vol.:(0123456789)
383 Page 2 of 22 SN Computer Science (2023) 4:383
a local level. These feeds may be used as a secondary data assesses the detrimental effects of crime on society and iden-
source to verify the accuracy of Twitter data. tifies places with high crime harm by assigning a greater
The data scraped from Twitter is first organized and clas- weight to the more serious offenses. Although its applica-
sified using keywords and removing duplicate tweets. The bility in Japan, a country with a low rate of violent crime
obtained data is then used as a base for visualization tech- overall and specifically gun violence, is debatable, this index
niques like heatmaps and choropleths and forecasting using affects urban policy. Catlett et al. [3] provide a prediction
ARIMA, SARIMA, and LSTM models. These techniques approach based on geographical analysis and auto-regressive
together provide locations with the highest probable occur- models. The method predicts urban crime hotspots. This
rence of crime at any given time across the country. technique will produce a spatio-temporal crime prediction
model. The model will use crime hotspots and predictors.
Each predictor estimates general crime in its area. The
Related Work experiment used NYC and Chicago data. According to this
review, the system can provide accurate spatial and temporal
In this section, an extensive review of related literature crime projections over rolling time periods. Hu et al. [4] map
has been made to explicate the nature, importance, trend, and analyze hotspots using a spatial–temporal approach. The
and pattern and determinants of spatio-temporal crime. A proposed paradigm differs in four ways: STKDE incorpo-
wide variety of sources have been explored for this purpose, rates time into predictive hotspot mapping. The best band-
including published articles, journals, theories of crime, widths are chosen using probability cross-validation. A sta-
available and accessible books, research reports, govern- tistical significance test eliminates false positives in density
ment reports, Karnataka state police Reports, and various estimations and the predictive accuracy index (PAI) curve
official websites of authorized agencies related to crime. The measures predictive hotspot mapping. Anneleen Rummens
causes of crime have been the subject matter of much specu- et al. [5] proposed a predictive analysis for urban research.
lation, discussions, research, and debates. There is a large Invasion, robbery, and battery statistics are collected in
assortment of theories about crime and criminal behavior. 200 m by 200 m grids. The monthly crime rate for 2014 can
Scores of theories related to crime and criminal behavior be estimated using data from the last three years. Monthly
state crime as a part of human nature. Crime and criminal forecasts are broken down (day vs. night). The accuracy of
behavior mainly stem from the psychological, biological, a forecast is based on the direct hit rate, the precision, and
sociological, and economic aspects of human behavior. Var- the prediction index (ratio of direct hit rate versus proportion
ious theories explain people's engagement in crime from of total area predicted as high risk). Predictive analysis of
mental, physical, developmental, economic, social, cultural, crime data at the grid level is used to make functional fore-
and other causes. Theories exploring the causes of crime casts. Monthly forecasts that tell the difference between day
identified religion, philosophy, politics, economics, and and night do better than biweekly forecasts, which suggests
social forces as the main contributors to the ever-increasing that the amount of time a forecast covers affects how well
crime rate. Several works have been done on crimes, crimi- it works. Prathap [6] examined 68 crime-related keywords
nals, and different theories of crimes. A few of them are dis- to assist in identifying the type of crime using geographical
cussed here, which the researcher believes can lend a hand and temporal information. Keywords are assessed to provide
in shaping and explaining the present study. as much information as possible on criminal activities. It is
Cornow et al. [1] proposed a study on the relationship possible to segment criminal activity using news feeds and
between alcohol retailers and crime in Buffalo, New York, in the Naive Bayes classifier. Mallet extracts terms from many
terms of area and time. The study investigated if a crime was news streams. K-means is utilized to identify crime hot-
more likely to occur near licensed liquor businesses. Data spots. The KDE [23] strategy is utilized while dealing with
on licensed alcohol outlets and violent crimes were exam- crime density, and this method has corrected the algorithm's
ined using global and local bivariate space–time k-function shortcomings. The study uncovered similarities between the
techniques from 2005 to 2011. A global bivariate space–time ARIMA and crime-predicting models.
K-function analysis revealed the spatial and temporal dis- Changes in the criminal justice system have been talked
tribution of bars and crime. Personal blunders were both about for a long time. Prioritizing crimes based on random
collected and dispersed. According to a local survey, out- models from physics or statistics is a good idea in theory, but
lets selling alcohol and crimes occur simultaneously and it doesn't work well in practice. Data-driven models, espe-
in place. Space–time analysis of bars and crime reveals a cially neural network models, can depict event dynamics.
link. Much time has passed since Louis Thurstone compared Huge data sets can be mined for information. Spatial–tem-
different types of crime almost a century ago as stated by poral datasets make it hard to learn about crime in a region
Ohyama et al. [2]. Recent research has used methodologies because of their complexity, intractable correlations, and
such as the Cambridge Crime Harm Index (CCHI), which redundant data. CSAN was made by Qi Wang et al. [7] by
SN Computer Science
SN Computer Science (2023) 4:383 Page 3 of 22 383
putting together variational auto-encoders and context-based discusses issues and directions for future research. This pro-
sequence generative neural networks. CSAN is better than ject aims to build a technical resource that academic and
Conv-LSTM at predicting the number of different types of industrial specialists, practical decision-makers, and others
crimes in a region. Twitter enables the monitoring of spatial may utilize. Sarker et al. [13] provide a comprehensive over-
and temporal crime statistics on social media. Prathap et al. view of "AI-based Modeling" including the principles and
[8]. capabilities of potential AI techniques that can aid in the
Since social media users change rapidly, emotional analy- development of intelligent and smart systems in real-world
sis is an excellent decision-making tool. Twitter is a popu- application domains, such as business, finance, healthcare,
lar way to obtain news and communicate with others. More agriculture, smart cities, and more. Our study's research
than 150 million individuals send 500 million 140-charac- challenges are highlighted. The study contains academics,
ter tweets daily. Twitter is utilized to recommend products industry specialists, and decision-makers in real-world sce-
depending on the opinions of its users. This paper demon- narios and application areas with a complete introduction to
strates how to examine crime-related tweets from users. The AI-based modeling. Jangada Correia's [14] research recom-
data will demonstrate variations in how people feel about mended various methods for determining public perceptions
different types of crimes and how they feel about positive or of cyberterrorism and the need for standardized terminology
negative crime situations. Crime is the most serious societal and framework. The findings of an online poll favor expand-
problem in emerging countries. Crime has an impact on a ing stakeholder diversity to improve terrorism detection and
country's reputation and way of life. Crime has an economic prevention in the UK. Despite general agreement on cyber-
impact since it necessitates investments in the police force terrorism, misunderstandings may impair the public's ability
and justice system, increasing the government's financial to detect and report it. Overall, the literature research and
burden. Law enforcement works to reduce crime. Real-time gathering of primary data contribute to developing a cyber
crime projections can aid in crime reduction. The proposed terrorism definition consistent with UK legislative defini-
study by Prathap et al. [9] develops a criminal analytics plat- tions and a terrorist activity framework that emphasizes the
form that analyzes newsfeed data to detect crime hotspots. connections between traditional, cyber-enabled, and cyber-
This method assists criminologists in understanding unac- dependent terrorism.
knowledged relationships between crime and specific areas. N Kanimozhi et al. [15] presented a study that predicted
Interactive visualizations aid law enforcement in crime current crimes using Kaggle open-source crime data. This
prediction. study assesses the crime that has the greatest impact on a
Crime analysis using social media data, such as News- specific place and time period. This study uses machine
feeds, Facebook, Twitter, and so on, is a rising area of study learning methods such as Naive Bayes to classify criminal
for law enforcement agencies worldwide. Data are used to patterns more precisely than pre-composed works. Sivana-
anticipate attacks and arrange reinforcements. Prathap et al. galeela et al. [16] developed a method based on crime loca-
[10] collect and visualize newsfeed data to focus on textual tion rather than criminal identification. Initially, the system
data analytics. By providing crime area coordinates and pos- relied heavily on naive Bayes classification. Data on kid-
sible crimes, the research study creates a framework for fore- napping, murder, theft, burglary, cheating, crime against
casting 16 types of crime in India and Bangalore. Prathap women, and robbery will be clustered using the fuzzy
et al. [11] conducted a research study on criminal activity C-Means technique in the current setting. In developing
reports in India and Bangalore. Theft, homicide, alcohol- nations like India, crime is rife. Rapid urbanization neces-
ism, assault, etc., are categorized by geographic density and sitates constant oversight. Akash Kumar et al. [17] suggested
criminal trends, such as time of day, to identify and empha- employing KNN to forecast crime rates to avert calamity.
size national and regional crime. Based on a review of a It will anticipate a crime's type, date, place, and hour. This
year's worth of news articles, 68 crime-related terms can be data will show local criminal trends, which can help with
divided into three categories. investigations. It also includes a list of the most serious
Crime clusters are typically reported in locations with a crimes committed in a specific area. The k-nearest neigh-
high number of kernels. Time series data can be predicted bor, machine learning method, is used in the author's work.
using the ARIMA model. Using a data mining applica- Wajiha Safat et al. [18] utilized machine learning algorithms,
tion, one's different criminal tendencies can be graphically such as logistic regression, support vector machine (SVM),
represented. Iqbal H. Sarker describes machine learning Naive Bayes, k-nearest neighbors (KNN), decision tree, mul-
methods that can boost an application's intelligence and tilayer perceptron (MLP), random forest, and eXtreme Gra-
capabilities [12]. The study investigated the basics of many dient Boosting (XGBoost), to improve the fit of crime data.
machine learning algorithms and their relevance to many Regarding RMSE and MAE, LSTM performed well on both
real-world application areas, including cybersecurity, smart data sets. A data analysis predicts more than 35 categories
cities, healthcare, e-commerce, and agriculture. The study of crime and an annual decline in crime rates in Chicago
SN Computer Science
383 Page 4 of 22 SN Computer Science (2023) 4:383
and Los Angeles, with February having the lowest crime 1. Development of a dashboard to provide a consolidated
rate. Chicago's crime rate will continue to grow slowly until view of aggregated data.
decreasing. According to the ARIMA model, the crime rate 2. Identification of crime keywords and categorization of
and the number of offenses in Los Angeles decreased sig- crime tweets.
nificantly. The results of crime prediction were also seen in 3. Generation of crime heatmaps, choropleths, and scatter
the urban cores of both cities. Overall, these results provide plots.
Police with a more accurate method than earlier methods for 4. Forecasting of crime tweet count using ARIMA,
predicting crime, crime hotspots, and future trends. They can SARIMA, and LSTM models.
be utilized to inform police practice and planning.
Based on the literature, there are very complex spatio- Identifying Problem Based on the Literature
temporal patterns and complicated urban configurations in
urban crime. A large number of existing algorithms will not The real-time identification of crime hotspots has become
capture all the aspects of the patterns. Therefore, compre- a priority for law enforcement agencies to determine and
hensive techniques are necessary to identify the complex neutralize increasing threats pre-emptively. However, the
spatio-temporal pattern for analyzing urban crime. Machine vast amount of data collected by these agencies is outdated,
learning and Deep learning are good algorithms for cap- inconsistent, and fragmented based on several categories like
turing spatio-temporal crime patterns and predicting future region, nature of crime, etc. Therefore, a single, consist-
outcomes. ent source of data collected in real-time that provided an
overall view of crime in the country with equal importance
Motivation and Objective of the Research to both localized minor events and distributed macroscopic
events is required. The use of data from official news sources
India is a fast-developing country with a large population is promising but lacks the level of coverage provided by
and a young workforce. Urbanization is one of the many social media posts. Therefore, we turn to Twitter, where the
by-products of this high-speed development [19, 21]. The required data set is updated with every tweet in real time and
degree of urbanization in India has been in a constant ascent is publicly accessible by all. In addition to hotspot predic-
from 31.28% in 2011 to 35.39% in 2021. This growth has tion, a simple means of visualization is also required. Gerber
led to mass migrating of people from rural to urban areas in et al. [23] and Prathap [24] indicating the Kernel Density
search of better prospects. Researchers have shown a direct Estimation to identify crime hotspots in the USA and India,
correlation between the concentration of population in urban respectively. The literature review shows that social media
areas and the increase in crime rate. and police data can predict criminal activity, but none have
To show the consistent increase in crime, we may con- utilized Twitter data for the same in the Indian context.
sider the example of Bangalore city, Karnataka. Bangalore is
one of the most prominent metropolitan cities in the country
and is considered to be the IT hub of India. As per Karna- List of Crime Keywords Considered
taka State Police reports [22], cases of thefts recorded were
480 in January 2022 and 725 in August 2022, showing an A total of 318 unique crime keywords have been considered
increase of 151.04% over 7 months. Similarly, a 122.51% to classify crime into 6 categories as shown in Table 1. This
increase is seen in the number of Special and Local Laws model of categorization is modeled after the system used
(SLL) crimes. by the Karnataka State Police but is modified for improved
The crime rate in India has been consistently increas- understanding of readers without an in-depth understanding
ing since 2018, according to reports by the National Crime of the Indian Penal Code.
Records Bureau (NCRB) [20]. 2020 witnessed a massive
surge in the crime rate in response to the imposition of
COVID-19 restrictions. This surprised law enforcement Methodology and Implementation
agencies, leading to poor mobilization of limited resources.
Growth and prosperity of a nation are largely dependent The paper focuses on building a tool that uses Twitter data to
on the psychological state and well-being of its residents. identify crime hotspots in India. This section discusses the
This, along with the literature review on the use of social methodology used to develop the tool, as mentioned earlier.
media to predict crime, has motivated the use of Twitter data As discussed, the methodology can be broadly divided into
to create a tool that provides a spatio-temporal visualization classification, visualization, and forecasting components.
of criminal activity. Based on this main objective, the fol- These components can be further classified into the follow-
lowing sub-objectives were identified: ing seven subsystems:
SN Computer Science
SN Computer Science (2023) 4:383 Page 5 of 22 383
Drug-related crimes Drug Trafficking, Drug dealing, Dealing drugs, Drug dealer, Alcohol Drinking, Alcohol dealing, Alcohol, Liquor law
violation, Liquor, Drug, Narcotics, Heroin, Cocaine, Ganja, Opium, Cannabis, Weed, Overdose, Capsule, Ketamine,
Amphetamine
Violent crimes Gang rape, Rape, Sexual harassment, Sexual assault, Sex offense, Sex abuse, Sexual abuse, Molest, Dishonor, Assault,
Fight, Beat, Kick, Punch, Battery, Lash out, Attack, Belt down, Obliterate, Intent to kill, Attempted murder, Murder,
Kill, Homicide, Armed robbery, Robbery, Terrorism, Kidnapping, Abduction, Tied, Incapacitate, Ransom, Shot,
Gunshot, Shootout, Stab, Harassment, Abuse, Assault, Outrage, Snatch, Put to death, Lynch, Hit and run, Gambling,
Hang, Run over, Dacoity, Sex racket, Victim, Goon, Rowdy, Knife, Domestic violence, Dead, Death, Violence, Sexual
misconduct, Body, Firing, Fired, Sexual, Hostage, Sex, Injure, Injury, Finger, Slit, Intimidate, Kidnap, Acid, Explosive
Commercial crimes Official Document Forgery, Currency Forgery, Official Seal Forgery, Official Stamp Forgery, Forgery, Bribery, Bribe,
Counterfeit, Cheat, Conned, Impersonator, Impersonate, Fake, Deceptive, Breach of trust, Breach of contract, Breach,
Embezzlement, Misappropriate, Fraud, Corrupt, Leak, Abscond, Misconduct, Conspire, Conspiracy, Tax, Excise, Col-
lude, Collusion, Contract, Extort, Lakh, Crore, Gold, Gullible, Chit fund, Chit, Fund, Money, Association, KYC, Bank,
Business, Deal, Raid, Copyright, Import, Export
Property crimes Arson, Motor vehicle theft, Theft, Burglary, Steal, Riot, Violent protest, Protest, Larceny, Barrage, Barrage fire, Fire,
Bombardment, Bomb, Explosion, Explode, Shelling, Looting, Trespass, Incendiarism, Shoplifting, Vandalism, Van-
dalize, Vandals, Encroach, Property, Stole, Smuggle, Thief, Burn, Land dispute, Land, Dispute, Rob, Enter, House,
Robbed, Crook, Loot, Robbers, Shop, Burglar, Grand theft auto
Traffic offenses Speeding, Signal Jump, Jump Signal, Running a Red Light, Reckless, Collision, Collide, High speed, Speed, Fast, Drunk
driving, Drink and drive, Drinking and driving, Driving under influence, DUI, Helmet, Accident
Other offenses Employing Illegal Worker, Illegal worker, Prostitution, Illegal Gambling, Gambling, Begging, Adultery, Homosexuality,
Weapons violation, Violation, Weapon, Porn, Video, Child, Public peace violation, Peace, Stalking, Hurt, Dowry, Mod-
esty, Negligence, Suicide, Criminal damage, Harlotry, Whoredom, Espionage, Spy, Pickpocketing, Pilfering, Poaching,
Damage, Illegal, Juvenile, Inappropriate, Marriage, Custody, Affair, Student, Hacking, Convict, Prison, Love, Threat,
Blackmail, Witness, Accuse, Conflict, Gang, Arrest, Miscreant, Cop, Police, Incident, Criminal, Cyber Crime, Scream,
Commit, Seize, Thugs, Uproar, Trap, Hindu, Muslim, BJP, Congress, Romantic, Marry, Politic, Follow, Hunting, Hunt,
Tiger, Fur, Skin, Husband, Wife, Follow, Girlfriend, Cyber, Crime, Suspect, Abase, Warrant, Social media, YouTube,
Facebook, Instagram, Twitter, False news, Law, Minor, Girl, Group, Chain snatch, Chain, Tribal, Caste, Adulterate,
Hijack, Hate, Schedule, Tribe, Atrocities, Atrocity, Abet, Conceal, Habit, Repeat, Major minerals, Minor minerals,
Mineral, Mining, Mine, Vehicle, Car, Bike
SN Computer Science
383 Page 6 of 22 SN Computer Science (2023) 4:383
and coordinates isolating tweets from and around the Indian Forecasting
subcontinent. Many retrieved tweets are written in various
regional languages and translated into English for better Forecasting involves a comparative analysis of ARIMA,
analysis. The tweets are then stored in a.csv file. SARIMA, and LSTM models to determine the most accurate
forecasting model for crime tweet count.
Data Cleaning
ARIMA
Data cleaning helps improve the integrity of the data set by
removing duplicate tweets. Each tweet is assigned a hash-
ARMA models were developed for stationary time series.
code, and duplicate hashcodes are removed. Tweet hashtags
However, a new class of models was introduced by integrat-
are extracted for each tweet. Tweets with only '#crime' are
ing a phase for the removal of non-stationarity in a time
said to have 'No Hashtags'. The API provides the timestamp
series. These models are known as ARIMA models and
of tweet creation in UTC, which is converted to the cor-
developed by Box and Jenkins which is a stochastic process.
responding time in IST. A URL pointing to each tweet on
ARIMA is described as a three-stage iterative model which
the Twitter platform is generated and stored to confirm the
includes time series identification, estimation, and verifica-
authenticity of the data set.
tion. The Box–Jenkins approach mainly uses the integration
filter, AR filter, and MA filter. The integration filter produces
Location Identification
a filtered (differenced) series from observed data. The AR
filter generates an intermediate series which is further pro-
Location identification involves the identification of the geo-
cessed by the MA filter which results in random white noise.
graphical location of crime by searching tweet attributes in
The formula for an ARIMA(p,d,q) model is given by:
a prioritized manner. First, the name of a city is searched for
in the tweet hashtags, followed by the tweet text and eventu- ARIMA(p, d, q) = (n) = 𝜇 + 𝜃1(n − 1) + ⋯ + 𝜃p(n − p)
ally the tweet geolocation. A city once identified is coupled + 𝜃(n) + 𝜃1(n − 1) + ⋯ + 𝜃q(n − q)
with the corresponding state and stored. The second round
(1)
of searches is carried out on state names in the same manner.
A state once identified is coupled with its capital city and In Eq. (1), μ is the mean of the series, θ1, …, θp are
stored. Finally, the word 'India' is searched for in the same the autoregression coefficients, θ1, …, θq are the moving
fashion. Identification of the word maps the city to Delhi. All average coefficients. f(n) is the forecasted value at time n n
tweets still left without a geographical location are consid- is the current time step n − 1, n − 2, …, n − p are previous
ered to belong to neighboring countries and removed. The time steps.
latitudes and the longitudes of identified cities are stored. This equation can be broken down into three components:
Autoregression (AR) term: the autoregression term repre-
Classification sents the linear dependence between an observation and a
number of lagged observations. It is represented by the sum
Classification refers to the categorization of tweets based on of θ1(n − 1) + ··· + θp(n − p). Differences (I) term: the differ-
the type of crime. Crime keyword is identified using a sub- ences term represents the amount of differencing applied to
string-keyword search on the tweet hashtags followed by the make the time series stationary. It is represented by d in the
tweet text. The keyword is then referenced against the crime ARIMA(p,d,q) formula.
keyword database to determine the crime type. Data from Moving average (MA) term: The moving average term
the above subsystems are stored in the secondary database. represents the error or residual at a certain point modeled
as a linear function of the previous errors or residuals. It
Visualization is represented by the sum of θ1(n − 1) + ··· + θq(n − q). In
summary, the ARIMA model predicts the next value in a
Visualization provides a comprehensible diagrammatic time series as a weighted sum of past observations and past
representation of the data stored in the secondary database. forecast errors, with weights determined by the values of p,
The applied visualization techniques can be classified into d, and q.
analytical and geospatial. Geospatial visualization includes
a heatmap for crime density representation, a choropleth for SARIMA
state-wise crime distribution and a scatter plot to pinpoint
crime location. Analytical visualization includes bar graphs, SARIMA (Seasonal Autoregressive Integrated Moving Aver-
pie charts, clustered bar graphs, and line graphs to represent age) is a statistical model that is used to analyze and forecast
quantifiable measures of collected data. time series data with a seasonal component. It is an extension
SN Computer Science
SN Computer Science (2023) 4:383 Page 7 of 22 383
of the ARIMA model, which takes into account the presence Data Aggregation
of seasonality in the data.
The formula for a SARIMA(p, d, q) (P, D, Q)m model is To extract tweets from Twitter, a valid API key with appro-
given by: SARIMA(p, d, q) (P, D, Q)m priate access level authorization and the API key secret is
required. API key authentication request is sent using the
(n) = 𝜇 + 𝜃1(n − 1) + ⋯ + 𝜃p(n − p)
0Auth2AppHandler function in Tweepy. Once Twitter suc-
+ 𝜃(n) + 𝜃1(n − m) + ⋯ + 𝜃q(n − qm) (2) cessfully authenticates the API key, the data aggregation pro-
+ 𝛾1(n − 1 − mD) + ⋯ + 𝛾P(n − Pm − mD) cess may begin.
Scraping is performed using the search function of the
In Eq. (2), μ is the mean of the series, θ1, …, θp are the Tweepy Python module. The search query '#crime' along with
autoregression coefficients, θ1, …, θq are the moving average geocoding '20.5937 (latitude), 78.9629 (longitude), 3000 km
coefficients γ1, …, γP are the seasonal autoregression coef- (the radius for tweet extraction)' are passed as parameters. This
ficients, f(n) is the forecasted value at time n, n is the current geocode encompasses all of India and small parts of its neigh-
time step, n − 1, n − 2, …, n − p are previous non-seasonal time boring countries. The search_tweets query returns pages of
steps. n − m, n − m − 1, …, n − m − q are previous seasonal time data traversed using the tweepy cursor to extract tweet attrib-
steps, m is the number of seasonal periods per cycle utes. Tweet text is translated to English using the googletrans
D is the order of seasonal differencing and d is the order of python module. Any URLs contained in the tweet text are
non-seasonal differencing. eliminated before translation using simple substring elimina-
In summary, the SARIMA model predicts the next value tion with 'https' as a key. The aggregated data is written onto
in a time series as a weighted sum of past observations, past a pandas data frame and appended to the primary CSV file.
forecast errors, and past seasonal errors, with weights deter-
mined by the values of p, d, q, P, D, Q, and m. It has a form Data Cleaning
similar to the ARIMA model, but with added terms of seasonal
autoregression (γ1, …, γP) and seasonal differencing (D). Hashcodes are generated using the standard Python hashing
library. Duplicate hashcodes are removed using the removedu-
LSTM plicates() function of the pandas module. Time zone conver-
sion is performed using the dateutil Python package.
Long Short-Term Memory (LSTM) is a type of Recurrent
Neural Network (RNN) architecture that is capable of learning Location Identification
long-term dependencies in data. An RNN is a neural network
architecture that is designed to process sequential data, such The presence of geographic keywords referenced from the
as text, speech, or time series data. The LSTM architecture locations database is found in tweet attributes using the find()
was designed to overcome the problem of vanishing gradients function. find() returns a value of − 1 if the keyword is not
that occur in traditional RNNs when the input sequence is found.
very long.
In an LSTM network, each neuron, or “memory cell,” has Classification
three gates: an input gate, a forget gate, and an output gate.
These gates are used to control the flow of information into and Crime keywords referenced from the keyword database are
out of the cell, and to allow the LSTM network to learn when searched for in the tweet attributes using the find() function.
to remember or forget information. The gates are controlled find() returns a value of − 1 if a keyword is not found.
by weights that are learned during training, and the values of
these weights determine how much information is allowed to Visualization
flow through the gates at each time step.
Geospatial heatmaps and scatterplots are generated using
Implementation of the Process the gmaps Python package while the interactive choropleth
is generated using the bokeh module and a shape (.SHP) file
The various subsystems mentioned in the methodology section of India's administrative divisions. Analytical visualizations
are implemented using Python programs. The Tweepy Python are generated using the matplotlib Python package.
module is used to communicate with the Twitter API.
Forecasting
SN Computer Science
383 Page 8 of 22 SN Computer Science (2023) 4:383
Python. Autocorrelation plot (ACF) and Partial Autocorrela- tweets. This may be due to the following factors—geopoliti-
tion plot (PACF) have been implemented using the acf and cal tension, lack of internet infrastructure, or low population
pacf functions from statsmodels.graphics.tsaplots package density.
in Python. The ARIMA model has been implemented using
the ARIMA function from the statsmodels.tsa.arima.model. Crime Density Detection Using Heatmaps
ARIMA parameters are found using the auto_arima function
from the pmdarima package. To train the LSTM model, the Seven heatmaps are generated in total, 6 of which display
crime counts are scaled to a value between 0 and 1 using the geospatial density of each crime category and one dis-
the MinMaxScaler function from the sklearn. preprocess- plays the total geospatial crime density, including all types
ing package. The time series is then formatted for the train- of crime. Figure 4 shows the different heatmaps generated.
ing process using the TimeSeriesGenerator function of the Figure 4 shows that the density distribution for each crime
keras.preprocessing.sequence module. The LSTM layer is category is different as various factors affect their distribu-
added to the model with 100 neurons and RELU activation. tion. Criminal activity appears denser in metropolitan cities.
The dense layer is added to the model and the model is com- It can be inferred from the above plots that violent crimes
piled using the Adam optimizer and Mean Squared Error make up the majority of the total crime count due to the
as a loss. The LSTM model can be found in keras.layers as similarities in density distribution. Violent crimes appear
LSTM. The model is trained for 50 epochs while saving only more concentrated in North India, with a decreasing density
the best model using ModelCheckpoints. as we move South. Delhi, Uttar Pradesh, and Bihar belts
appear to have the country's highest concentration of crimi-
Dashboard nal activity. Delhi appears to have a high concentration of
criminal activity under all categories. Drug-related crimes
Built using HTML, CSS, and Javascript to display interac- are sparsely distributed, with Goa, Delhi, and Mumbai being
tive plots and other analytics in an organized form. hotspots. Commercial crimes are concentrated around met-
Figure 2 shows a detailed framework of the program ropolitan and port cities, which are seats of commerce.
implementation. Traffic offenses are sparsely distributed with concentrations
along the western and eastern ghats. Property crimes are
concentrated in cities like Mumbai, Delhi-NCR, Bangalore,
Results and Discussion Chennai, Hyderabad, Pune, and Kolkata which are home to
some of the most expensive localities to buy real estate in
State‑wise Crime Distribution Using Choropleth the entire country. Other offenses are sparsely scattered and
concentrated around metropolitan cities without any distinct
Figure 3 displays the absolute crime count determined by patterns.
crime tweets and does not consider population. Therefore,
populous states like Uttar Pradesh tend to have a higher Pinpointing Crime Locations Using Scatter Plots
crime count. The five states with the highest crime count
in decreasing order are as follows: Delhi (2377), Maharash- Heatmaps provide a great visualization of the crime density,
tra (2275), Uttar Pradesh (1670), Tamil Nadu (1252), and but a scatter plot is required to study each event individu-
Gujarat (1121). A 2022 report shows Delhi has the coun- ally. Figure 5 represents the scatter plot for the crime-related
try's highest crime rate, with a crime index of 59.58. The tweets collected thus far.
figures generated from tweet data show a mismatch with Parts of Telangana and Odisha appear not to have any
NCRB reports. This is because not all crimes are tweeted pointers over them. This can be verified with the correspond-
about. The accuracy of tweet data concerning real-world ing heatmap showing zero crime density. A clustered scatter
data is expected to improve over time with the addition of plot produces a heatmap. The view of the map at the default
new users. However, the relative state-wise accuracy of zoom level is not very comprehensive. Zooming into a par-
data can be used to determine the state-wise distribution of ticular area of interest and clicking the marker provides more
Twitter users. For example, in Delhi, the crime tweet data is comprehensive results as shown in Fig. 6,
similar to real-world crime data, it can be assumed to have a Similar to heatmaps, scatter plots may also be filtered
higher percentage share of Twitter users. In contrast to this, by crime type. Figure 7 shows a scatter plot of only drug-
states like Arunachal Pradesh where there is a large dis- related crimes.
parity between crime tweet data and real-world crime data
can be assumed to have few Twitter users. States along the
border of the country, like Ladakh, Sikkim, and Arunachal
Pradesh seem to have fewer Twitter users and crime-related
SN Computer Science
SN Computer Science (2023) 4:383 Page 9 of 22 383
SN Computer Science
383 Page 10 of 22 SN Computer Science (2023) 4:383
SN Computer Science
SN Computer Science (2023) 4:383 Page 11 of 22 383
Fig. 4 Crime density detection using heatmaps. a Drug-related crimes, b Violent crimes, c Commercial crimes, d Property crimes, e other
offenses, f Traffic offenses, g All crime categories combined
Crime rate Table 2 shows 10 states with the highest crime rates.
= (Total number of crimes∕Total population of the state) These rates are updated in real time and change with the
addition of new data.
× 100, 000
SN Computer Science
383 Page 12 of 22 SN Computer Science (2023) 4:383
Fig. 4 (continued)
These crime rates are calculated on the basis of states with the highest crime rates as per NCRB data. This
crime tweet data for a period of 2 months ranging from shows a 60% match with crime data obtained from Twit-
18/08/2022 to 16/10/2022. Delhi, Goa, Gujarat, Tamil ter. The remaining states on this list may have a relatively
Nadu, Maharashtra, and Chandigarh are part of the top 10 larger number of Twitter users leading to a discrepancy
with NCRB data.
SN Computer Science
SN Computer Science (2023) 4:383 Page 13 of 22 383
Fig. 5 Scatter plot with standard zoom level Fig. 7 Scatter plot of Drug-related crimes
The real-time data used in this paper has been collected con- Daily Plot of Crime Tweet Count
sistently from the 18th of August, 2022 to the 16th of Octo-
ber, 2022 for approximately 2 months. This section contains Figure 12 represents the crime count as a daily time series.
a basic analysis of the daily and hourly crime count plots. It can be seen that some days have higher crime counts than
Note that these counts represent the number of crime tweets others. There is a sharp fall in crime rates on weekends
recorded and not the actual number of crimes but are treated especially Sundays (for example 21/08/2022, 23/08/2022).
as analogous in subsequent parts of this section. The number of crimes seems to increase during festi-
vals (for example 18/08/2022—Krishna Janmashtami,
31/08/2022—Ganesh Chaturthi, 08/09/2022—Onam).
SN Computer Science
383 Page 14 of 22 SN Computer Science (2023) 4:383
SN Computer Science
SN Computer Science (2023) 4:383 Page 15 of 22 383
Table 2 10 States with the highest crime rates stationary data. A data set that does not have a clear trend
State Population Crime count Crime rate can be said to be stationary. This can be identified using
the Augmented Dicky–Fuller Test. The test is implemented
Goa 15,21,992 377 24.77016962 using the adfuller function from the statsmodels.tsa.stattools
Delhi 1,93,01,096 2377 12.31536282 package in Python. Figure 14 shows the results obtained
Chandigarh 11,58,040 92 7.944457877 from the Augmented Dicky–Fuller Test.
Maharashtra 12,49,04,071 2275 1.821397799 The value of significance to us is the P-Value. The lower
Puducherry 16,46,050 28 1.701041888 the P-Value, the more stationary is the data. A P-Value of
Uttarakhand 1,17,00,099 194 1.658105628 under 0.05 can be considered to be stationary. Therefore,
Gujarat 7,04,00,153 1121 1.592326085 the crime count data set is stationary.
Tamil Nadu 8,36,97,770 1252 1.495858253
Kerala 3,46,98,876 495 1.426559177 Time Series Forecasting
Meghalaya 37,72,103 42 1.113437252
The process of analyzing past and present data to quan-
tifiably predict future data is considered to be time series
forecasting. Figure 15 shows the Autocorrelation and Par-
tial Autocorrelation plots used to identify the order of the
SARIMA and ARIMA models.
SN Computer Science
383 Page 16 of 22 SN Computer Science (2023) 4:383
The autocorrelation function gives the correlation Table 3 shows the Root Mean Squared Errors (RMSE)
between a time series and its lags. The partial autocorrela- and Mean Absolute Error (MAE) generated from the valida-
tion function gives the same correlation but after removing tion plots for each of the three forecasting models.
the relations explained by previous lags. The order of the From Fig. 19, the ARIMA model is found to have the
ARIMA model as generated by the autoarima function is least RMSE of the three models. Therefore, the ARIMA
(2, 0, 5) and that of the SARIMA model is (0, 1, 2)x(2, 1, 1, model is the best-suited model for time series forecasting
24) due to daily seasonality of data. of crime tweet count. Figure 20 compares the 24-h forecast
Figures 16, 17, and 18 show the validation plots gener- plots generated by the three models.
ated for each of the three models. These plots compare fore- Table 4 shows the hourly forecast of crime provided by
casted data with real-time data to determine the accuracy the ARIMA model for the period 17/10/2022–18/10/2022.
of each model. The last 24 h of training set data is used for
validation.
SN Computer Science
SN Computer Science (2023) 4:383 Page 17 of 22 383
Table 3 Model errors Model RMSE MAE that are not of common knowledge or public importance
were ignored leaving behind the following categories—Spl
ARIMA 3.283 2.005 & Local laws, Theft, Murder, Cases of Hurt, 107 Cr.P.C.,
SARIMA 5.287 3.772 Cybercrime, NDPS Cases, Cheating, Burglary—Night,
LSTM 7.781 6.285 Riots, POCSO, Rape, Robbery, Cr. Br. of Trust, Dacoity.
The crime data scraped from Twitter along with the cor-
responding KSP data for the month of September 2022 are
Validation of Forecasting models listed in Tables 5 and 6.
9 From Figs. 21 and 22, the number of tweets does not
match the number of crimes directly as not every crime
8
7
6 is tweeted about. Therefore, the twitter data are validated
Count
5
4
against KSP data by comparing the order of crime catego-
3 ries based on the number of crimes when arranged from the
2 highest to lowest. As seen from Tables 5 and 6, there is a
1
mismatch for the categories of Murder and Rape, giving a
ARIMA SARIMA LSTM
match of 86.67%. This shows that Twitter is an ideal source
Forecasting Model
of data for crime forecasting (Table 6).
RMSE MAE
SN Computer Science
383 Page 18 of 22 SN Computer Science (2023) 4:383
Fig. 20 Comparison of
SARIMA, ARIMA, and LSTM
forecasts
SN Computer Science
SN Computer Science (2023) 4:383 Page 19 of 22 383
Validation plot for ARIMA on residual data. As seen from model. The higher accuracy of the ARIMA model over the
Fig. 26, the plot the residual data has neither seasonality nor SARIMA model is due to the method of order identification.
trend and is completely stationary. ARIMA (2, 0, 2) was The order of ARIMA is found using the auto_arima function
trained on this data to produce the following validation plot. while that of the SARIMA model is calculated manually.
Figure 26 shows the model validation of residual values Thus, auto_arima provides a better fit model for the training
of RMSE of 3.135 and an MAE of 1.998, which is approxi- data set.
mately equal to the RMSE and MAE of the ARIMA model
when applied to the original data set as seen in Table 3.
Therefore, the limited seasonality of the original data
set does not greatly impact the accuracy of the ARIMA
SN Computer Science
383 Page 20 of 22 SN Computer Science (2023) 4:383
SN Computer Science
SN Computer Science (2023) 4:383 Page 21 of 22 383
LSTM models showed that ARIMA is most suitable for 3. Catlett C, et al. Spatio-temporal crime predictions in smart cities:
time series forecasting of crime tweet count. Adding new a data-driven approach and experiments. Pervas Mob Comput.
2019;53:62–74.
users over time and increasing median user experience will 4. Hu Y, et al. A spatio-temporal kernel density estimation frame-
further improve the quality of the crime tweet data set. work for predictive crime hotspot mapping and evaluation. Appl
The outcome of this study shows that Twitter is a viable Geogr. 2018;99:89–97.
option for analyzing and forecasting crime. Twitter data 5. Rummens A, Hardyns W, Pauwels L. The use of predictive analy-
sis in spatiotemporal crime forecasting: building and testing a
may also be used in combination with other sources of data model in an urban context. Appl Geogr. 2017;86:255–61.
as a means of verification. This research focuses only on 6. Prathap BR. Geospatial crime analysis and forecasting with
univariate forecasting techniques and uses only Twitter as machine learning techniques. In: Artificial Intelligence and
a source. There is a vast scope for using advanced fore- Machine Learning for EDGE Computing. Academic Press, pp
87–102. 2022
casting techniques and multiple social media platforms for 7. Wang Q, et al. CSAN: a neural network benchmark model for
crime analysis and forecasting in the field of criminology. crime forecasting in spatio-temporal scale. Knowl Based Syst.
Thus, law enforcement agencies can use the developed 2020;189:105120.
system to optimize resource distribution, thereby improv- 8. Prathap BR, Ramesha K. Twitter sentiment for analyzing different
types of crimes. In: 2018 International Conference on Communi-
ing crime response rates. cation, Computing and Internet of Things (IC3IoT). IEEE. 2018
9. Prathap BR, Ramesha K. Geospatial crime analysis to determine
crime density using Kernel density estimation for the Indian con-
Funding The authors would like to confirm that there was no funding text. J Comput Theor Nanosci. 2020;171:74–86.
received for the above work. 10. Boppuru PR, Ramesha K. Geo-spatial crime analysis using
newsfeed data in Indian context. IJWLTT. 2019;14(4):49–64.
Data availability The author collects the Dataset used in this research https://doi.org/10.4018/IJWLTT.2019100103.
which will be provided on request. 11. Boppuru PR, Ramesha K. Spatio-temporal crime analysis using
KDE and ARIMA models in the Indian context. Int J Digit
Declarations Crime Foren (IJDCF). 2020;12(4):1–19. https://d oi.o rg/1 0.
4018/IJDCF.2020100101.
Conflict of interest I know of no conflicts of interest associated with 12. Sarker IH. Machine learning: algorithms, real-world applica-
this publication, and there has been no significant financial support for tions and research directions. SN Comput Sci. 2021;2:160.
this work that could have influenced its outcome. As the corresponding https://doi.org/10.1007/s42979-021-00592-x.
author, I confirm that the manuscript has been read and approved for 13. Sarker IH. AI-based modeling: techniques, applications and
submission by the named author. research issues towards automation, intelligent and smart sys-
tems. SN Comput Sci. 2022;3:158. https://d oi.o rg/1 0.1 007/
s42979-022-01043-x.
14. Jangada Correia V. An explorative study into the importance of
References defining and classifying cyber terrorism in the UK. SN Comput
Sci. 2022;3:84. https://doi.org/10.1007/s42979-021-00962-5.
1. Conrow L, Aldstadt J, Mendoza NS. A spatio-temporal analysis of 15. Kanimozhi N, Keerthana NV, Pavithra GS, Ranjitha G, Yuva-
on-premises alcohol outlets and violent crime events in Buffalo, rani S. CRIME type and occurrence prediction using machine
NY. Appl Geogr. 2015;58:198–205. learning algorithm. Int Conf Artif Intell Smart Syst (ICAIS).
2. Ohyama T, et al. Investigating crime harm index in the low and 2021;2021:266–73. https://doi.org/10.1109/ICAIS50930.2021.
downward crime contexts: a spatio-temporal analysis of the Japa- 9395953.
nese Crime Harm Index. Cities. 2022;130:103922. 16. Sivanagaleela B, Rajesh S. Crime analysis and prediction using
fuzzy C-means algorithm. In: 2019 3rd International Conference
SN Computer Science
383 Page 22 of 22 SN Computer Science (2023) 4:383
on Trends in Electronics and Informatics (ICOEI), pp. 595–599. 22. Gadagpolice. "Monthly Crime Review." Monthly Crime Review
2019. https://doi.org/10.1109/ICOEI.2019.8862691. - Karnataka State Police. Accessed October 17, 2022. https://
17. Kumar A, Verma A, Shinde G, Sukhdeve Y, Lal N. Crime pre- ksp.karnataka.gov.in/new-page/Monthly%20Crime%20Review/
diction using k-nearest neighboring algorithm. In: 2020 Interna- en.
tional Conference on Emerging Trends in Information Technol- 23. Gerber MS. Predicting crime using Twitter and kernel density
ogy and Engineering (ic-ETITE), pp. 1–4. 2020. https://doi.org/ estimation. Decis Sup Syst. 2014;61:115–25.
10.1109/ic-ETITE47903.2020.155 24. Prathap BR. Geo-spatial crime density attribution using
18. Safat W, Asghar S, Gillani SA. Empirical analysis for crime optimized machine learning algorithms. Int j inf tecnol.
prediction and forecasting using machine learning and deep 2023;15:1167–78. https://doi.org/10.1007/s41870-023-01160-7.
learning techniques. IEEE Access. 2021;9:70080–94. https://
doi.org/10.1109/ACCESS.2021.3078117. Publisher's Note Springer Nature remains neutral with regard to
19. Wikipedia. Crime in India. Wikimedia Foundation. Last modi- jurisdictional claims in published maps and institutional affiliations.
fied September 14, 2022. https://en.wikipedia.org/wiki/Crime_
in_India. Springer Nature or its licensor (e.g. a society or other partner) holds
20. O'Neill A. India—Urbanization 2021. Statista, July 29, exclusive rights to this article under a publishing agreement with the
2022. https://w ww.statista.c om/statistics/2 71312/u rbani zati author(s) or other rightsholder(s); author self-archiving of the accepted
on-in-india/. manuscript version of this article is solely governed by the terms of
21. Malik AA. Urbanization and crime: a relational analysis. J Hum such publishing agreement and applicable law.
Soc Sci. 2016;21:68–9.
SN Computer Science