1 s2.0 S0045790622005419 Main
1 s2.0 S0045790622005419 Main
1 s2.0 S0045790622005419 Main
A R T I C L E I N F O A B S T R A C T
Editor: Mustafa Matalgah Surveillance system research is now experiencing great expansion. Surveillance cameras put in
public locations such as offices, hospitals, schools, roads, and other locations can be utilised to
Keywords: capture important activities and movements for event prediction, online monitoring, goal-driven
Surveillance system analysis, and intrusion detection. This research proposed novel technique in detecting crime scene
Detecting crime scene video
video surveillance system in real time violence detection using deep learning architectures. Here
Deep learning
the aim is to collect the real time crime scene video of surveillance system and extract the features
ST
DRNN using spatio temporal (ST) technique with Deep Reinforcement neural network (DRNN) based
classification technique. The input video has been processed and converted as video frames and
from the video frames the features has been extracted and classified. Its purpose is to detect
signals of hostility and violence in real time, allowing abnormalities to be distinguished from
typical patterns. To validate our system’s performance, it is trained as well as tested in large-scale
UCF Crime anomaly dataset. The experimental results reveal that the suggested technique per
forms well in real-time datasets, with accuracy of 98%, precision of 96%, recall of 80%t, and F-1
score of 78%.
1. Introduction
Applications in various areas, including crime prevention, automatic smart visual monitoring and road safety, need for consid
erable attention to anomaly in event detection in video surveillance. In recent decades, an enormous number of surveillance cameras
are installed in both private and public locations for effective real-time monitoring to prevent malfunctions and protect public safety
[1]. Most cameras, however, offer just passive logging services and are not capable of monitoring. The number of these films grows
every minute, making it easy for human specialists to comprehend and analyses them. Similarly, monitoring analysts have to wait
hours for abnormal occurrences to be captured or seen for immediate reports [2]. Because there are few anomalous events in the real
* Corresponding author.
E-mail address: [email protected] (K.B. Sahay).
https://doi.org/10.1016/j.compeleceng.2022.108319
Received 11 January 2022; Received in revised form 3 August 2022; Accepted 10 August 2022
Available online 1 September 2022
0045-7906/© 2022 Published by Elsevier Ltd.
K.B. Sahay et al. Computers and Electrical Engineering 103 (2022) 108319
world, video anomaly detection are studied as a one-class issue, in which the model is trained on typical films and a video is tagged as
anomalous when odd patterns appear. All the typical real-world monitoring events cannot be cumulated in one dataset. Different
typical actions may thus be distracted from regular training events and may ultimately produce false alarms [3]. In contemporary
human action recognition research, notably in video surveillance, violence detection has been a hot area. The classification of human
activity in real time, nearly instantaneously after the action has occurred, is one of the challenges with human action recognition in
general. This difficulty escalates when dealing with surveillance video for a number of factors including the quality of surveillance
footage is diminished, lighting is not always guaranteed, and there is generally no contextual information that can be used to ease
detection of actions and classification of violent versus non-violent. Furthermore, in order for violent scene detection to be helpful in
real-world surveillance applications, the identification of violence must be speedy in order to allow for prompt intervention and
resolution. In addition to poor video quality, violence can occur in any given setting at any time of day, therefore, a solution will have
to be robust to detect violence no matter the conditions. Some settings for video surveillance where violence detection can be applied
includes the interior and exterior of buildings, in traffic, or on police body cameras [4].
A major purpose of video surveillance is the detection of unusual situations such as traffic accidents, robberies, or illicit activity.
Human operators and manual examination are still required by most existing monitoring systems (prone to disturbances and tired
ness). As a result, effective computer vision techniques for automatically detecting video anomalies/violence are becoming increas
ingly relevant. Building algorithms that detect specific anomalous occurrences, such as violence detectors, fight action detectors, and
traffic accident detectors, is a tiny step toward resolving detection of anomalies. In recent years, video action recognition has gotten a
lot of attention after achieving very promising results by leveraging CNN’s incredible robustness [5]. In most businesses and sectors,
installing CCTVs for ongoing surveillance of people and their interactions is a widespread practise. Every day, every person in a
developed country with a population of millions is photographed by a camera. Constant monitoring of these surveillance films by
police to determine whether or not the occurrences are suspicious is practically impossible, as it necessitates a workforce and their
undivided attention. As a result, we’re developing a demand for high-precision automation of this process. It is also vital to show which
frame and which parts of it include unexpected activity, as this aids in determining whether the unusual activity is abnormal or
suspicious. This will aid concerned authorities in finding underlying cause of anomalies while also saving time as well as effort that
would otherwise be spent manually searching the records. ARS is a real-time monitoring system that recognises and records evidence
of offensive or disruptive behaviour in real-time. Using a variety of Deep Learning models, this study seeks to detect and characterise
high movement levels in frame. Videos are divided into portions in this project. A detection alert is raised in event of a threat, dis
playing suspicious behaviour at a specific point in time. The movies in this project are divided into two classes: threat and safe.
Burglary, Abuse, Explosion, Fighting, Shooting, Shoplifting, Arson, Road Traffic Accidents, Robbery, Assault, Stealing and Vandalism
are amongst the 12 uncommon actions we recognise. As a result of these irregularities, people would feel safer [6].
The contribution of this paper is as follows:
• To collect real time crime scene video dataset and process them to convert into video frames for detecting abnormal activities
• The converted video frames were extracted using spatiotemporal analysis, which uses forward, backward, and bidirectional pre
dictions to extract the features of a video-based motion. The prediction errors are combined into a single image that depicts motion
of sequence.
• Then extracted features has been classified using Deep Reinforcement neural network (DRNN).
• The experimental results shows accuracy, precision, recall and F-1 score for various real time dataset of video surveillance systems
Paper organization is as follows: related works are described in Section 2, the proposed methodology is detailed in Section 3, and
Section 4 defines experimental analysis and concludes in Section 5.
2. Related works
In the sphere of public safety, the automated video surveillance system has become a hot focus of research. There has been a lot of
research done on object movement identification and tracking. Artificial intelligence has also aided in the reduction of labour and
enhancement of surveillance efficiency. Several attempts have been made to partially or completely automate this labour with ap
plications such as human activity recognition, Event detection and behaviour analysis. They utilised a Harris detector [7] to extract
important points and a SIFT as a descriptor, then a BoVW to extract mid-level features, which they solved using the same method as
visual categorization [8,9] employed the Spacetime Interest Point (STIP) to distinguish face emotions, human activities, and mouse
activity with 83%, 80%, and 72% accuracy. To categorise video sequences, [10] combines Gaussian Difference [11] with PCA-SIFT
(Principal Component Analysis SIFT) [12] and BoVW, resulting in the conclusion that the amount of the vocabulary employed in
BoVW is highly influenced by complexity of the scenes classified. The majority of studies employ BoVW; however, [13] reported a
BoVW comparison with other descriptions. [14] compares the performance of descriptors such as HOF and HOG with variations in
optical flow using Lucas-Kanade, HornSchunk, and Farnebäck as optical flow methods [15,16], one of the first attempts to use audio to
detect violence, defined violence as events including gunfire, explosions, fighting, and yelling, whereas nonviolent content was rep
resented by audio segments with music and talking. Descriptors included energy entropy, short-time energy, ZCR, spectrum flux and
roll-off, with a polynomial SVM classifier reaching an accuracy of 85.5%. Using Bag of Audio Words, [17] utilized MFCC (MelFre
quency Cepstral Coefficients) as an audio descriptor and dynamic Bayesian networks to get mid-level features (BoAW). The main
contribution of this research is the removal of video segmentation noise using BoAW. Based on Laptev’s study [18], they provided
BoVW utilising STIP (Space-Time Interest Point) as a descriptor and compared STIP-based BoVW with SIFT-based BoVW performance.
2
K.B. Sahay et al. Computers and Electrical Engineering 103 (2022) 108319
In this case, STIP outperformed the competition. Ullah et al. [19] proposed a HueSTIP (Hue Space-Time Interest Points) variant of STIP
that counts pixel colours and recognises general activities for detecting conflicts. HueSTIP outperforms STIP, albeit at a larger
computational cost. [20] used MoSIFT to distinguish fights and compared MoSIFT and STIP as classifiers to BoVW and SVM. In ex
periments, two datasets were used: movies and hockey games. STIP beat MoSIFT in the hockey dataset, with 91.7% accuracy versus
90.9% for STIP, but MoSIFT surpassed STIP in the movie dataset, with 89.5% accuracy compared 44.5% for STIP [21]. The authors of
[22] and [23], in contrast to [24], use localised detection rather than full-frame video processing. The study in [25] proposes, in
particular, utilising the intrinsic location of anomalies and examining if use of spatiotemporal data might aid in the detection of
abnormalities. They combine model with a tube extraction module to narrow the scope of the investigation to a specified set of
spatiotemporal coordinates. The authors favour human / in-hand annotation versus computer vision-driven localization, which is one
downside of this technique. It’s a time-consuming and fruitless attempt. In contrast, [26] extracts numerous activation boxes from a
motion activation map that assesses intensity of activity at each place to automatically determine all potential attention zones where
fighting actions may occur. The authors cluster all localised ideas around retrieved attention zones based on the geographical link
between each pair of human proposals and activation boxes. It’s worth noting that the study [27] primarily focuses on pinpointing the
location of a fight in a public space; consequently, this approach isn’t suitable to a unified ADS. In fact, in untrimmed public video
footage, occlusions, motion blur, lighting changes and other environmental alterations [28] are still difficult to deal with. As a result,
we present an automatic yet effective attention region localization strategy based on background subtraction in this study. To begin, a
robust background subtraction method is used to find attention/moving zones. Attention regions will then be supplied into a 3D CNN
action recognition system [29].
3. Proposed design for real-time violence detection in crime scene intelligent video surveillance systems
This section discuss the proposed model in implementing real time violence detection of crime scene for intelligent video sur
veillance systems. Here the feature extraction and classification has been carried out based on deep learning architectures for
extracting the video frame features and then to classify them in detecting abnormality of the crime scene surveillance. The overall
proposed architecture is represented in Fig. 1.
Let ytrepresent a video image frame at time t and hence is a spatial entity, and let y be a 3-D volume consisting of spatio-temporal
image frames. Each pixel in ytt corresponds to a site s, which is indicated byyst. Let Yt denote a random field, and ytdenote its realisation
at time t. Thus, yst denote a spatio-temporal co-ordinate of grid (s, t). Let x stand for segmentation of the video sequence y and xt for
segmented form of yt. In the same way, xst indicates the label of a site in frame. Let’s pretend that Xt denotes MRF from which xt is
derived. The label field can be estimated for any feature extraction issue by maximising the posterior probability distributions in Eqs.
(1) and (2).
̂x t = argmaxP(Xt = xt |Yt = yt ) (1)
xt
3
K.B. Sahay et al. Computers and Electrical Engineering 103 (2022) 108319
where ̂
x t indicates calculated labels. prior probability P(Yt = yt) is constant and hence reduces to Eq. (3)
̂x t = argmaxxt P(Yt = yt |Xt = xt , θ)P(Xt = xt , θ) (3)
where θ is parameter vector for xt’s clique potential function. Here, xˆt is MAP evaluate. Eq. (4) consists of two parts P(Xt = xt) the prior
probability and P(Yt = yt|Xt = xt,θ) as likelihood function.
The prior probability P(Xt = xt,θ) can be expressed as
∑
1 − U(xt ) 1 − { c∈C Vc (xt )}
P(Xt = xt , θ) = e T = e T (4)
z z
∑ − U(xt )
where z is partition function is given as z = xt e T , U(Xt )i is energy function. Vc(xt) in spatial realm, shows clique potential function
Eqs. (5) and (6) are used to express it in terms of the MRF model bonding parameter
{
+α if xst = xqt
Vc (xt ) = (5)
− α if xst ∕
= xqt ,
{
( ) +α if xst =∕ xqt and(s, t), (q, t) ∈ S
Vsc xst , xqt = (6)
− α if xst = xqt and(s, t), (q, t) ∈ S.
It is well known that if Xt is an MRF, it satisfies Markovianity property [20] in spatial direction, as shown by Eq. (7)
( ) ( )
= q = P Xst = xst |Xqt = xqt , (q, t) ∈ ηst
P Xst = xst |Xqt = xqt , ∀q ∈ S, s ∕ (7)
ρS,g S,g g
i (t + 1) = ρi (t)(1 − Πi (t)),
ρi (t + 1) = ρi (t)Πi (t) + (1 − ηg )ρE,g
E,g S,g g
i (t),
ρA,g
i (t + 1) = ηg E,g
ρ i (t) + (1 − α g A,g
)ρi (t),
ρI,g g A,g
i (t + 1) = α ρi (t) + (1 − μ )ρi (t),
g I,g
{
+γ = xer , (s, t), (e, r) ∈ S, t ∕
if xst ∕ = r, and r ∈ {(t − 1), (t − 2)}
Vteec (xst , xer ) = (8)
− γ if xst = xer , (s, t), (e, r) ∈ S, t ∕
= r, and r ∈ {(t − 1), (t − 2)}.
∑k+l− 1
i=k
ψi (9)
mk = , k = 1, …, q − l + 1.
l
∫
E(|γt ∩ A|) = kt(1) (x)dx
A
One-point correlation function k(1) t (x) calculates the predicted number of agents within a region L by using Eq. (10)
∫
E(|γt ∩ A|) = kt(1) (x)dx.
(10)
A
∫ ∫ ∫
E(|γt ∩ A1 ‖γt ∩ A2 |) = kt(2) (x1 , x2 )dx2 dx1 + kt(1) (x)dx
A1 A2 A1 ∩A2
The predicted product between number of agents in area L1 at time t and number of agents in area L2 at time is related to the spatio-
temporal correlation function by Eq. (11):
∫ ∫ ∫
E(|γt ∩ A1 ||γt+Δt ∩ A2 |) = kt,Δt (x1 , x2 )dx2 dx1 + kt,St (x)dx (11)
A1 A2 A1 ∩A2
4
K.B. Sahay et al. Computers and Electrical Engineering 103 (2022) 108319
d
kt (η) = (LΔ kt )(η)
dt
∫ ∫ (12)
d (1)
k (x) = − mkt(1) (x) − a− (x − y)kt(2) (x, y)dy + a+ (x − y)kt(1) (y)dy
dt i Ri Ri
OO O+ − O − +
kt,Δt (x, y) = kΔt (x, y) + kΔt (x, y) + kΔt (x, y) + kΔt (x, y)
O
(13)
kt,Δt (x) = kΔt (x)
∑
NG
eff
neff
i = (ngi )
g=1
cff
∑[ ]
(ngi ) = (1 − pg )δij + pg Rgji ngj .
j
( )
f (x) = 1 + 1 − e− ξx .
[( ) ]
nm,h h m,h h h h
j→i (t) = nj ρj (t) 1 − p δij + p Rji , m ∈ {A, I} (16)
the prior probability P(Xt = xt,θ) follows Gibb’s distribution and is of following Eq. (17),
The likelihood function P(Yt = yt∣Xt = xt) can be expressed as Eq. (19)
P(Yt = yt |Xt = xt ) = P(yt = xt +n|Xt = xt , θ) = P(N = yt − xt |Xt = xt , θ) (19)
where n is a realization of Gaussian degradation process N (μ, σ). Thus, P(Yt = yt∣Xt = xt) can be expressed as Eq. (20)
P(N = yt − xt |Xt , θ)
1
= √̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅e−
1(y −
2 t xt )T k− 1 (yt − xt ) (20)
(2 )f det[k]
π
where k denotes variance-covariance matrix, det[k] determinant of matrix k and f number of features expresses variance-covariance
matrix k from Eq. (21)
[ ]
k = kij f ×f (21)
5
K.B. Sahay et al. Computers and Electrical Engineering 103 (2022) 108319
In space-homogeneous case, ut,Δ(x, y) = ut,Δ(x − y). Eq. (3.1) can be used to express the primary model’s spatio-temporal cumulant in
terms of the auxiliary model’s spatial cumulants (23),
ut,Δ (x) = u∞ O+ − O − +
Δt (x) + uΔt (x) + uΔt (x) + uΔt (x) (23)
By applying eq. to the auxiliary model, we may get the perturbation expansion for spatial correlation functions (25)
(x)
gt,Δt = g∞ O+ − o − +
Δt (x) + gΔt (x) + gΔt (x) + gΔt (x)
d ( )
ht,Δt = Hh,Δt qt+t , ht,st
dΔt
1 v2 v2 v3 v4
V0 (y) = [rΓA + v1 Y] + [αφ1 Q + v3 U] + W (25)
v1 βαφ1 aβαφ1
It is obvious that V0(0) = 0, and V0(y) > 0, for all y > 0. Moreover from Eq. (26)
1 v2 v2 v1 v4
V̇ 0 (y) = [rΓȦ + v1 Ẏ] + [αφ1 Q̇ + v3 U̇] + Ẇ,
v1 β α1 αβα1
1[ ] v2 v2 v1 v4
= rΓB(W)W − v1 v2 Y − rΓμ2 A2 + [αφ1 βY + b1 αφ1 W − v3 v4 U] + [aU − v5 W]
v1 βαφ1 aβaφ1
[ ]
Γ1 v1 b1 v2 v2 v2 v5 r μ2 2
= B(W)W + − W− A
v1 β αβα1 v1
⎧
⎪
⎪ ∗ rΓ ∗ ∗ (1 − r)Γ ∗ v1 v4 v5 Rode
0
⎪
⎨ Y = v2 A , M = μ
⎪ A , Q∗ = A∗ ,
aNegg αφ1
(26)
M
⎪
⎪ v1 v5 Rode v1 Rode
⎪
⎪ ∗ 0
A∗ , W ∗ = 0
A∗ ,
⎩U =
aNegg Negs
A+ − m
q∗ =
A−
a+ (ξ) − q∗ a− (ξ)
g∗ (ξ) = q∗
A − a+ (ξ) + q∗ a− (ξ)
+ (27)
SVD, eigen images and associated eigen-time courses can divide video sequence frames into two halves by Eq. (28):
( )( )
̃V
X = UDVT = UD1/2 VT D1/2 = U ̃T (28)
By applying eq. to our discrete time Markov Chain, we can convert it to a continuous time differential Eq. (31)
[ ]
∑NG ∑ N nhj (1 − ph )δij + ph Rhji ( ) ( )
g
Pi = g g
z k fi C gh
bA ρA,h I I,h
j + b ρj + O ∈2 (31)
h eff
h=1 j=1 (ni )
6
K.B. Sahay et al. Computers and Electrical Engineering 103 (2022) 108319
[ ]
bm = ln (1 − βm )− 1 , m ∈ {A, I}
( )
neff (32)
fi = f i .
si
g
We then insert the above expression into Πi leading to Eq. (33):
N (
NG ∑
∑ )
gh ( A A,h ) ( )
Πgi = (M1 )gh gh gh
ij + (M2 )ij + (M3 )ij + (M4 )ij b ρj + bI ρI,h
j + O ∈2 ,
h=1 j=1
(1 − ph )nhj
(M1 )gh g g g
ij = δij (1 − p )z k fi C
gh
eff
(nhi )
Rhji ph nhj
(M2 )gh g g g
ij = (1 − p )z k fi C
gh
eff
(nhi )
(1 − ph )nhj
(M3 )gh g g g g
ij = p Rij z k fj C
gh
( h)
nj eff
∑
N
Rhjk ph nhj
(M4 )gh
ij = pg Rgik zg kg fk Cgh eff
∗ (33)
k=1 (nhk )
The associated differential equations assume the form eq. using the above definitions (34)
NG ∑
∑ N
( )
ρ̇E,g
i = − ηg ρE,g
i + Mijgh bA ρA,h I I,h
j + b ρj
h=1 j=1
ρ̇A,g
i = ηg ρE,g
i − αg ρA,g
i
ρ̇I,g g A,g
i = α ρi − μg ρI,g
i (34)
∑4
Where the tensor M is given by M = ℓ=1 Mℓ . Defining the vector (ρg)T = (ρE,g, ρA,g, ρI,g) the above system of differential equations can
be rewritten as Eq. (35)
Ng
∑ ( )
ρ̇g = F gh − V gh ρh (35)
h=1
Pixels corresponding to FMt component of original frame yt create VOP, while regions composing foreground part of temporal seg
mentation are recognised as moving object regions.
Time steps are denoted by the numbers t = 1, 2, 3, etc. St ∈ S denotes the Markov State at time t, where S is State Space (a countable
set).At t, action is denoted by At A, where A refers to the Action Space (a countable set).Rt D is the reward at time t, where D is a
countable subset of R. (representing the numerical feedback served by the Environment, along with the State, at each time step t) by
Eq. (37).
(37)
′ ′
p (p, q |r, a) = P[(St+1 = r, St+1 = s )|St = s, At = a]
γ∈ [0, 1] is known as discount factor utilized to concession Rewards when accumulating Rewards, as follows in Eq. (38):
In RL, we usually talk about a time-homogeneous Markov chain, where the transition probability is independent of t by Eqs. (40)–(42):
7
K.B. Sahay et al. Computers and Electrical Engineering 103 (2022) 108319
(40)
′ ′
P[Rt+1 = r |Rt = r] = P[Rt = r |Rt− 1 = r]
P = p1 p2 …pn , where pi ∈ {H, P}, ∀1 ≤ i ≤ n
B = {P = p1 p2 …pn |pi ∈ {H, P}, ∀1 ≤ i ≤ n, n ∈ N }
G = {G = (xi , yi )|xi , yi ∈ ℜ, 1 ≤ i ≤ n}
C:B →G
Bellman equation for Q-learning is defined by Q(s, a) denoting value of taking action an in states, r(s, a) denoting reward obtained in
states after performing action a and s 0 denoting environment state attained by agent after performing action an in states is followingQ
(r, b) = r(r, b) + γ ⋅ maxa′ Q(r′ ,b′ ) given by Eq. (45)
(45)
′
Q(r, b) = (1 − α) ⋅ Q(r, b) + α ⋅ (r(r, b) + γ ⋅ maxa′ Q(r , b))
n
{ } ( [ n ])
State space S will contain of 4 3− 1states, i.e. S = s1 , s2 , …, s4n − 1 . A state sik ∈ S ik ∈ 1, 4 3− 1 stretched by agent at a given moment
j
after it has visited states s1 , si1 , si2 , …, sik− 1 is a terminal state by current sequence is n − 1, i.e. k = n − 2 by Eq. (46).
( ) 4n− 1 − 1
δ sj , ak = s4,j− 3+k ∀k ∈ [1, 4], ∀j, 1 ≤ j ≤ (46)
3
Reward earned immediately after completing action a from states, plus value of following Eq. (47) for the optimum policy, equals the
value of Q.
(47)
′
Q∗ (r, b) = r(r, b) + γ ⋅ maxa′ Q∗ (δ(r, b), b )
Qn(r,b)agent’s evaluation of Q*(r, b)at n-th training episode. Demonstrate that limn→∞ Qn (r, b) = Q∗ (r, b), ∀s ∈ S , a ∈ δ(r, b) given by
Eqs. (48) and (49)
(n − 1) ⋅ (n − 2)
0 ≤ r(r, b) ≤ , ∀s ∈ S , a ∈ δ(r, b) (48)
2
∑
n− 2 ∑
n
(n − 1) ⋅ (n − 2)
0≥E≥ (− 1) = − (49)
i=1 j=i+2
2
During the training process, the estimates Q(r, b) for each state action pair (∀s ∈ S, a ∈ δ(r, b)), rise i.e. from Eq. (50)
Qn+1 (r, b) ≥ Qn (r, b), ∀n ∈ N ∗ (50)
To prove that Inequalities of Eqs. (51) and (52) hold for n + 1, also, i.e.
Qn+1 (r, b) ≥ Qn (r, b) (51)
′ ′ ′
Qn+1 (r, b) − Qn (r, b) = γ ⋅ (maxa′ Qn (r , b ) − − maxa′ Qn− 1 (r , b))
′ ′
Qn+1 (r, b) − Qn (r, b) ≥ γ ⋅ (maxa′ Qn− 1 (r , b ) −
8
K.B. Sahay et al. Computers and Electrical Engineering 103 (2022) 108319
Because all rewards are positive Q* (s, a) ≥ 0. Since Q0(s, a) = 0, and Q* (s, a) ≥ 0 obtain that Q0(s, a) ≤ Q* (s, a) is given by Eq. (53).
Qn (s, a) ≤ Q∗ (s, a)
(53)
′ ′ ′ ′
Qn+1 (s, a) − Q∗ (s, a) ≥ γ ⋅ (maxa′ Q∗ (s , a ) − maxa′ Q∗ (s , a )) = 0
Expected result starting from states, taking action a and then following policy π is action-value function qπ(s, a)given by Eq. (54)
[ ]
qπ (s, a) = Eπ Gt |St = s, At = a
(54)
qm (s, a) = Em [Rt+1 + γq= (St+1 , At+1 )|St = s, At = a]
To simplify notations, we define R as = E[Rt+1 |St = s+ At = a] relationship between vπ(s) and qπ(s,a) shown by Eqs. (55) and (56)
∑
vπ (s) = π(a|s)qn (s, a) (55)
π ∈Δ
∑
(56)
′
qπ (s, a)Ras + γ Pasx′ νπ (s )
′
s ∈S
Expressing qπ(s, a) in terms of vπ(s) in the expression of vπ(s), Bellman equation for vπ is given by Eq. (57),
( )
∑ ∑
(57)
′
vπ (s) = π(a|s) Ras + γ Pass′ vπ (s )
a∈A ′
s ∈S
Bellman equation shows state-value function of one state with that of other states. Bellman equation for qπ(s, a) as shown in Eq. (58),
∑ ∑ ′ ′
(58)
′ ′
qπ (s, a) = Ras + γ pss′ π(a |s )qn (s , a )
′ ′
s ∈S a ∈A
By maximising q(s, a) over all actions, we can discover an optimal policy immediately using the theorem by Eq. (59),
v∗ (s) = maxπ vπ (s)
′
π ≥ π vm (s) ≥ vm (s), ∀s
{
1 if a = argmq∗ (s, a)
σπ (a|s) =
̃ (59)
0 a ∈ .A
The remaining issue is determining how to obtain the best value function. The Bellman optimality equation is used to solve this
question. Optimal state-value function and have a link that can be discovered as shown by Eq. (60),
v∗ (s) = maxq∗ (s, a)
a
a
∑ ′ (60)
q∗ (s, a) = Rs + γ Pss′ v∗ (s )
′
s ∈S
Generate a Bellman optimality equation for v* and q* by expressing q*(s, a) in terms of v*(s) in expression of v*(s) by Eqs. (61) and (62)
( )
∑ ′
a
v∗ (s) = max Rs + γ Pss v∗ (s )
′
a
(61)
′
s ∈S
∑ ′ ′
q∗ (s, a) = Ras + γ Pass′ max
′
q∗ (s , a ).
′ a
s ∈S
( )
∑ ∑ a
(62)
′
vk+1 (s) = π(a|s) R as + γ P sx′ vk (s )
a∈A ′
n ∈S
Using synchronous updates, for each states, let Eqs. (63) and (64)
9
K.B. Sahay et al. Computers and Electrical Engineering 103 (2022) 108319
( )
∑ α
π(s) := argmaxq(s, a) = argmax R an + 7 P ss′ tπ (st ) (63)
a∈A a∈A ′
s ∈S
(64)
′
qπ (s, π (s)) = maxqm (s, a) ≥ qn (s, π(s)) = vn (s), vm (s) ≥ vπ (s)
a∈A
The iteration also improves the value functionvm′ (s) ≥ vπ(s) because by Eqs. (65) and (66)
′
vx (s) ≤ qn (s, π (s)) = En′ [Rt+1 + γvπ (St+1 )|St = s]
′
≤ Ez′ [Rt+1 + γqm (St+1 , π (St+1 ))|St = s]
[ ′ ]
≤ Ez′ Rt+1 + γRt+2 + γ2 qz (St+2 , π (St+2 ))|St = s
(65)
′
qπ (s, π (s)) = maxa∈A qπ (s, a) = qπ (s, π(s)) = vπ (s)
The ∞-norm, i.e. the biggest difference between state values, is utilized to evaluate distance between two state-value functions u and v
given by Eq. (67)
‖ u − v‖∞ = maxs∈S |u(s) − v(s)| (67)
where vπ is a column vector with one entry per state from Eq. (69)
⎡ ⎤ ⎡ ⎤
⎡ ⎤ Rπ P π ⋯ P π1n ⎡ vm (1) ⎤
vn (1) ⎢ 1⎥ ⎢ 11 ⎥
⎣ ⋮ ⎦ = ⎢ ⋮ ⎥ + γ⎢ ⋮
⎣ ⎦ ⎣ ⋮ ⋮ ⎥ ⎦
⎣ ⋮ ⎦ (69)
v= (n) R= P π
⋯ P ∗ vπ (n)
n n1 nn
Define the Bellman equation backup operator by Eqs. (70) and (71)
T n (v) = Rπ + γP n vo
‖ Tπ (u) − Tπ (v) ‖∞ = ‖ (R π + γP π u) − (R π + R P π v) ‖∞
= ‖ γP π (u − v) ‖∞ (70)
≤ ‖ γP π (u − v) ‖∞
≤ γ ‖ u − v‖∞
vs = maxa∈A R a + γP a vs
(71)
T∗ (v) = maxa∈A R a + γP a v
Value iteration will converge because this operator is likewise a γ -contraction map.
4. Performance analysis
We go over the experimental methodology and outcomes in detail in this section. All of the model training and implementation tests
were carried out in software settings that comprised an Intel Core-i7 8700 K processor running at 3.70 GHz and an Nvidia GeForce GTX
1080 Ti (11 GB Memory) GPU. The necessary Python code is built, and the neural network models are constructed using the Keras API
in conjunction with Tensorflow-GPU in the backend.
10
K.B. Sahay et al. Computers and Electrical Engineering 103 (2022) 108319
Table 1
Processing of Real-time violence detection in crime scene for various datasets.
Dataset Real-time violence detection in crime Processed video frames Spatio temporal based feature Classified video frames using
scene dataset extraction of video frames DRNN
Crowd Violence
Dataset
UCSD dataset
Violence flow
dataset
Table 2
Comparative analysis for Real-time violence detection in crime scene for various dataset.
Dataset Techniques Accuracy Precision Recall F1_score
the circulation of non-pedestrian entities on sidewalks or by abnormal pedestrian motion patterns. Bikers, skaters, tiny carts, and
individuals walking on a sidewalk or in grass that surrounds it are all common occurrences. There were a few cases of folks in
wheelchairs as well. All abnormalities were not produced for the purposes of constructing the dataset; they occurred naturally. The
information was separated into two groups, each representing a different setting. The video footage for each scene was broken into
multiple portions of about 200 frames apiece.
Violence flow dataset: A database containing real-world video footage of mob violence, as well as common benchmark standards
for determining violent/non-violent classification and detecting violence outbreaks. There are 246 videos in the data collection.
The above Table 1 shows Real-time violence detection in crime scene for various dataset. From above table the processed video
frames, proposed feature extracted video frames and classified video frames has been shown.
4.1.2.1. Accuracy. It defined the number of correctly predicted values to the total number of predictions. It is defined in Eq. (72)
TP + TN
Accuracy = (72)
TP + TN + FP + FN
4.1.2.2. Recall. It is defined as the correctly predicted value to the total prediction value. It is defined in Eq. (73)
TP
Recall = (73)
TP + FN
4.1.2.3. Precision. It provides the ratio of true positive values to the total predicted values. It is stated in Eq. (74)
TP
Precision = (74)
TP + FP
4.1.2.4. F1 - Score. It provides the ratio between average mean of precision and recall. F1-Score is stated in Eq. (75)
Precision ∗ Recall
F1 − Score = 2 ∗ (75)
Precision + Recall
The above Table 2 shows the comparative analysis for proposed real time crime scene dataset analysis. The analysis has been carried
out for various dataset in terms of accuracy, precision, recall and F-1 score for 3D CRNN and MIL based comparison.
11
K.B. Sahay et al. Computers and Electrical Engineering 103 (2022) 108319
Fig. 2. Comparative analysis for Crowd Violence Dataset in terms of (a) accuracy, (b) precision, (c) recall and (d) F-1 score.
Fig. 3. Comparative analysis for UCSD dataset in terms of (a) accuracy, (b) precision, (c) recall and (d) F-1 score.
Above Figs. 2, 3 and 4 shows comparative analysis for crowd violence dataset, UCSD dataset and violence flow dataset in terms of
(a) accuracy, (b) precision, (c) recall and (d) F-1 score. From above comparative analysis proposed technique ST_DRNN obtains ac
curacy of 98%, precision of 96%, Recall of 80% and F-1 score of 78% for crowd violence dataset; for UCSD dataset accuracy is 97%,
precision is 95%, Recall is 79%, F-1 Score is 76%; violence flow dataset obtained accuracy of 95%, precision of 96%, Recall of 80% and
F-1 score of 77%. Hence from above comparative analysis, the proposed technique obtained optimal results in detecting the violence
12
K.B. Sahay et al. Computers and Electrical Engineering 103 (2022) 108319
Fig. 4. Comparative analysis for Violence flow dataset in terms of (a) accuracy, (b) precision, (c) recall and (d) F-1 score.
5. Conclusion
In this research the proposed framework gives the novel technique for crime scene violence detection. The real time crime scene
dataset has been collected and converted to video frames where the abnormal activities has been detected. The features of a video-
based gesture are recovered by backward, forward and bidirectional predictions and converted video frames were extracted based
on spatio temporal. The prediction errors are thresholded and compiled into a single image that depicts sequence’s motion. Collected
features were then categorised with the help of a Deep Reinforcement Neural Network (DRNN). For various real-time datasets of video
surveillance systems, the experimental results indicate accuracy, precision, recall, and F-1 score. The limitation for proposed video
surveillance is in particular, utilising the intrinsic location of anomalies and examining if use of spatiotemporal data might aid in the
detection of abnormalities. They combine model with a tube extraction module to narrow the scope of the investigation to a specified
set of spatiotemporal coordinates.
Ethical approval
This article does not contain any studies with animals performed by any of the authors.
Funding
No Funding.
13
K.B. Sahay et al. Computers and Electrical Engineering 103 (2022) 108319
Informed consent
Informed consent was obtained from all individual participants included in the study.
Code availability
Data availability
References
[1] Mabrouk AB, Zagrouba E. Abnormal behavior recognition for intelligent video surveillance systems: a review. Expert Syst Appl 2018;91:480–91.
[2] Kardas K, Cicekli NK. SVAS: surveillance video analysis system. Expert Syst Appl 2017;89:343–61.
[3] Wang Y, Shuai Y, Zhu Y, Zhang J, An P. Jointly learning perceptually heterogeneous features for blind 3D video quality assessment. Neurocomputing 2019;332:
298–304 (ISSN 0925-2312).
[4] Tzelepis C, Galanopoulos D, Mezaris V, Patras I. Learning to detect video events from zero or very few video examples. Image Vis Comput 2016;53:35–44 (ISSN
0262-8856).
[5] Fakhar B, Kanan HR, Behrad A. Learning an event-oriented and discriminative dictionary based on an adaptive label-consistent K-SVD method for event
detection in soccer videos. J Vis Commun Image Represent 2018;55:489–503 (ISSN 1047-3203).
[6] Luo X, Li H, Cao D, Yu Y, Yang X, Huang T. Towards efcient and objective work sampling: recognizing workers’ activities in site surveillance videos with two-
stream convolutional networks. Autom Constr 2018;94:360–70 (ISSN 0926-5805).
[7] Shao L, Cai Z, Liu L, Lu K. Performance evaluation of deep feature learning for RGB-D image/video classifcation. Inf Sci 2017;385:266–83 (ISSN 0020-0255).
[8] Wang D, Tang J, Zhu W, Li H, Xin J, He D. Dairy goat detection based on Faster R-CNN from surveillance video. Comput Electron Agric 2018;154:443–9 (ISSN
0168-1699).
[9] Ahmed SA, Dogra DP, Kar S, Roy PP. Surveillance scene representation and trajectory abnormality detection using aggregation of multiple concepts. Expert Syst
Appl 2018;101:43–55 (ISSN 0957-4174).
[10] Arunnehru J, Chamundeeswari G, Prasanna Bharathi S. Human action recognition using 3D convolutional neural networks with 3D motion cuboids in
surveillance videos. Procedia Comput Sci 2018;133:471–7.
[11] Karri JB. Classification of crime scene images using the computer vision and deep learning techniques. Int J Mod Trends Sci Technol 2022;8(02):01–5.
[12] Ovaskainen O, Somervuo P, Finkelshtein D. A general mathematical method for predicting spatio-temporal correlations emerging from agent-based models. J R
Soc Interface 2020;17(171):20200655.
[13] Zhang XP, Chen Z. An automated video object extraction system based on spatiotemporal independent component analysis and multiscale segmentation.
EURASIP J Adv Signal Process 2006;2006:1–22.
[14] Arenas A, Cota W, Gómez-Gardenes J, Gómez S, Granell C, Matamalas JT, et al. A mathematical model for the spatiotemporal epidemic spreading of COVID19.
MedRxiv 2020.
[15] Mann Manyombe ML, Mbang J, Tsanou B, Bowong S, Lubuma J. Mathematical analysis of a spatio-temporal model for the population ecology of anopheles
mosquito. Math Methods Appl Sci 2020;43(6):3524–55.
[16] Mudgal M, Punj D, Pillai A. Suspicious action detection in intelligent surveillance system using action attribute modelling. J Web Eng 2021:129–46.
[17] Hidayat F. Intelligent video analytic for suspicious object detection: a systematic review. In: Proceedings if the international conference on ICT for smart society
(ICISS). IEEE; 2020. p. 1–8.
[18] Vosta S, Yow KC. A cnn-rnn combined structure for real-world violence detection in surveillance cameras. Appl Sci 2022;12(3):1021.
[19] Ullah W, Ullah A, Haq IU, Muhammad K, Sajjad M, Baik SW. CNN features with bi-directional LSTM for real-time anomaly detection in surveillance networks.
Multimed Tools Appl 2021;80(11):16979–95.
[20] Mathur R, Chintala T, Rajeswari D. Detecting criminal activities and promoting safety using deep learning. In: Proceedings if the international conference on
advances in computing, communication and applied informatics (ACCAI). IEEE; 2022. p. 1–8.
[21] Saad K, El-Ghandour M, Raafat A, Ahmed R, Amer E. A Markov model-based approach for predicting violence scenes from movies. In: Proceedings if the 2nd
international mobile, intelligent, and ubiquitous computing conference (MIUCC). IEEE; 2022. p. 21–6.
[22] Feng JC, Hong FT, Zheng WS. Mist: Multiple instance self-training framework for video anomaly detection. In: Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition. IEEE; 2021. p. 14009–18.
[23] Wu P, Liu J, Shi Y, Sun Y, Shao F, Wu Z, et al. Not only Look, but also Listen: learning multimodal violence detection under weak supervision. In: Proceedings of
the European conference on computer vision. Springer; 2020. p. 322–39.
[24] Zhong JX, Li N, Kong W, Liu S, Li TH, Li G. Graph convolutional label noise cleaner: train a plug-and-play action classifier for anomaly detection. In: Proceedings
of the IEEE/CVF conference on computer vision and pattern recognition. IEEE; 2019. p. 1237–46.
[25] Tian Y., Pang G., Chen Y., Singh R., Verjans J.W., Carneiro G., Weakly-supervised video anomaly detection with robust temporal feature magnitude learning.
arXiv 2021, arXiv:2101.10030.
[26] Dubey S., Boragule A., Jeon, M., 3D ResNet with ranking loss function for abnormal activity detection in videos. Proceedings of the 2019 international
conference on control, automation and information sciences (ICCAIS), Chengdu, China, 24–27 2019; 1–6.
14
K.B. Sahay et al. Computers and Electrical Engineering 103 (2022) 108319
[27] Ji H., Zeng X., Li H., Ding W., Nie X., Zhang Y., Xiao Z., Human abnormal behavior detection method based on T-TINY-YOLO. Proceedings of the 5th
international conference on multimedia and image processing, Nanjing, China, 10–12 2020. 1–5.
[28] Hu X, Dai J, Huang Y, Yang H, Zhang L, Chen W, et al. A weakly supervised framework for abnormal behavior detection and localization in crowded scenes.
Neurocomputing 2020;383:270–81.
[29] Mojarad R, Attal F, Chibani A, Amirat Y. A hybrid context-aware framework to detect abnormal human daily living behavior 2020;19–24:1–8.
Kishan Bhushan Sahay is M.tech in Power System. His area of interest is power system restructuring, optimization techniques in power system, application of artificial
intelligence in power system & other fields.
Dr. Bhuvaneswari Balachander received pH D in Medical Image Processing in the year 2020. Since 2012, she has been working as an Associate Professor in Saveetha
School of Engineering, Chennai. She has around 11 years of teaching experience. Her research areas include Medical Image Processing, Image fusion, Microwave
Antenna Design and Resonator Design. She has received a number of awards including Best Women faculty award, Best Researcher award and also Best paper award.
Dr. B. Jagadeesh working as an Associate Professor in the Department of Electronics and Communication Engineering at Gayatri Vidya Parishad College of Engineering
(Autonomous). He received his B.E. Degree in Electronics and Communication Engineering with distinction from G.I.T.A.M., M.E. degree from Andhra University,
Visakhapatnam and Ph.D. from JNTUA, Anantapuramu. He is in the teaching profession for more than21 years.
Dr. G. Anand Kumar working as an Assistant Professor in the Department of Electronics and Communication Engineering at GayatriVidyaParishad College of Engi
neering (Autonomous). He received his B.Tech Degree in Electronics and Communication Engineering with distinction, M.Tech degree in Digital Electronics and
Communication Systems with distinction from J.N.T.U.K. Kakinada and Ph.D from Andhra University. He is in the teaching profession for more than17 years.
Dr. Ravi Kumar did his Ph.D. from Department of Electronics and Communication Engineering, Jaypee University of Engineering & Technology, Guna in year 2013
with the specialization in MIMO Communication Systems and Smart Antennas. He has been serving in the field of teaching at Jaypee Institute of Engineering since
August 2005. He is a SENIOR MEMBER of Institute of Electrical and Electronics Engineers (IEEE).
Dr. L. Rama Parvathy, working as a Professor at Saveetha School of Engineering, Chennai. She has 22 years of Academic Training and Teaching students including 8
years of Research. Her research interests are Cloud Computing, Evolutionary Computing, Multi-Objective Optimization, and Data Analytics. She has published in many
journals which are all indexed in Scopus and Web of Science.
15