Data in Brief 33 (2020) 106587
Contents lists available at ScienceDirect
Data in Brief
journal homepage: www.elsevier.com/locate/dib
Data Article
A dataset for automatic violence detection in
videos
Miriana Bianculli a, Nicola Falcionelli b, Paolo Sernani b,∗,
Selene Tomassini b, Paolo Contardo b,c, Mara Lombardi a,
Aldo Franco Dragoni b
a
Università degli Studi di Roma La Sapienza, Piazzale Aldo Moro 5, Roma 00185, Italy
b
Dipartimento di Ingegneria dell’Informazione, Università Politecnica delle Marche, Via Brecce Bianche 12, Ancona
60131, Italy
c
Gabinetto Interregionale di Polizia Scientifica per le Marche e l’Abruzzo, Via Gervasoni 19, Ancona 60129, Italy
a r t i c l e i n f o a b s t r a c t
Article history: The automatic detection of violence and crimes in videos is
Received 7 October 2020 gaining attention, specifically as a tool to unburden secu-
Accepted 23 November 2020 rity officers and authorities from the need to watch hours
Available online 26 November 2020 of footages to identify event lasting few seconds. So far, most
of the available datasets was composed of few clips, in low
Keywords:
resolution, often built on too specific cases (e.g. hockey fight).
Violence detection
While high resolution datasets are emerging, there is still the
Crime detection
Computer vision
need of datasets to test the robustness of violence detection
Deep learning techniques to false positives, due to behaviours which might
resemble violent actions. To this end, we propose a dataset
composed of 350 clips (MP4 video files, 1920 × 1080 pixels,
30 fps), labelled as non-violent (120 clips) when represent-
ing non-violent behaviours, and violent (230 clips) when rep-
resenting violent behaviours. In particular, the non-violent
clips include behaviours (hugs, claps, exulting, etc.) that can
cause false positives in the violence detection task, due to
fast movements and the similarity with violent behaviours.
The clips were performed by non-professional actors, varying
from 2 to 4 per clip.
∗
Corresponding author.
E-mail addresses: [email protected] (M. Bianculli), [email protected] (N. Falcionelli), p.sernani@
univpm.it (P. Sernani), [email protected] (S. Tomassini), [email protected] (P. Contardo), mara.lombardi@
uniroma1.it (M. Lombardi), [email protected] (A.F. Dragoni).
https://doi.org/10.1016/j.dib.2020.106587
2352-3409/© 2020 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/)
2 M. Bianculli, N. Falcionelli and P. Sernani et al. / Data in Brief 33 (2020) 106587
© 2020 The Authors. Published by Elsevier Inc.
This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/)
Specifications Table
Subject Computer Vision and Pattern Recognition
Specific subject area Violence Detection in Videos
Type of data Video (mp4), Text files (csv)
How data were acquired The clips were recorded with two cameras placed in two different
spots, building a dataset with videos from two different points of
view. The cameras are:
• The front camera of the Asus Zenfone Selfie ZD551KL (13 MP,
Auto Focus, f/2.2).
• The TOPOP Action Cam OD009B (12 MP, fisheye lens 170°).
Data format Raw
Parameters for data collection The clips in the dataset are in MP4 format, H.264 codec, with a
resolution of 1920 × 1080 pixels, and 30 fps. The average length of
the clips is 5.63 s (minimum 2 s, maximum 14 s). 230 clips out of
the 350 included in the dataset are labelled as “violent”, whilst the
remaining 120 clips are labelled as “non-violent”.
Description of data collection The dataset includes 350 clips, split into two main directories,
“violent” and “non-violent”. Such directories are split into
subdirectories, “cam1” and “cam2”:
• “violent/cam1” includes 115 clips representing violent
behaviours;
• “violent/cam2” includes 115 clips with the same violent
behaviours in “violent/cam1”, but recorded with a different
camera and from a different point of view;
• “non-violent/cam1” includes 60 clips representing non-violent
behaviours;
• “non-violent/cam2” includes 60 clips with the same
non-violent behaviours in “non-violent/cam1” but recorded
with a different camera and from a different point of view.
The clips were performed by a group of non-professional actors
(varying from 2 to 4 per clip) simulating violent actions and
non-violent actions.
Data source location Dipartimento di Ingegneria dell’Informazione, Università Politecnica
delle Marche, Ancona, Italy.
Data accessibility Public repository: GitHub (https://github.com)
Repository name: A Dataset for Automatic Violence Detection in
Videos
Direct URL to data: https://github.com/airtlab/
A- Dataset- for- Automatic- Violence- Detection- in- Videos
Value of the Data
• As the interest towards automatic detection of violence and crimes in video is increasing, the
clips in the presented dataset are intended to train and benchmark techniques for automatic
violence detection in videos.
• In the short and mid-term, researchers can use the Full HD clips as an additional open
dataset to train and test their algorithms. In the long-term, law enforcement authorities and
the entire community might benefit from fine-tuned algorithms, capable of reducing the de-
cision time in violence and crime detection.
M. Bianculli, N. Falcionelli and P. Sernani et al. / Data in Brief 33 (2020) 106587 3
• A specific goal of the dataset is to verify the robustness to false positives of the violence de-
tection techniques. Thus, experiments involving the assessment of the classification accuracy
of algorithms can consider this specific feature in the evaluation phase.
1. Data Description
The pervasiveness of video surveillance cameras and the need of watching footages and mak-
ing decisions in a very short time [1] boosted the interest of researchers towards techniques
for the automatic detection of violence and crimes in videos. In facts, both techniques based on
handcrafted features [2,3] and deep learning [4,5] demonstrated their accuracy for automatic vi-
olence detection on open datasets such as the Hockey Fight Dataset [6], the Movie Fight Dataset
[6], and the Crowd Violence Dataset [7]. However, such datasets include few low-res videos,
sometimes in too specific environments (e.g. hockey arenas). These issues have been faced by
the RWF-20 0 0 [8], a dataset including 20 0 0 clips from real video surveillance cameras. Never-
theless, in terms of accuracy, especially for the prevention of false positives, there is still the
need to understand the effectiveness of the violence detection techniques in clips showing rapid
moves (hugs, claps, high-fives, etc.) which are not violent. To this end, we present a dataset for
violence detection specifically designed to include, as non-violent clips, scenes which can cause
false positives.
The dataset is composed of 350 clips which are MP4 video files (H.264 codec) of an average
length of 5.63 s, with the shortest video lasting 2 s and the longest 14 s. For all the clips, the res-
olution is 1920 × 1080 pixels and the frame rate 30 fps. The dataset is organized into directories
as shown in Fig. 1.
The dataset is split into two main directories, “non-violent” and “violent”, labelling the in-
cluded clips as showing non-violent behaviours and violent behaviours respectively. The direc-
tories are split into two subdirectories, “cam1” and “cam2”:
• “non-violent/cam1” includes 60 clips representing non-violent behaviours;
• “non-violent/cam2” includes 60 clips with the same non-violent behaviours in “non-
violent/cam1” but recorded with a different camera and from a different point of view;
• “violent/cam1” includes 115 clips representing violent behaviours;
• “violent/cam2” includes 115 clips with the same violent behaviours in “violent/cam1” but
recorded with a different camera and from a different point of view.
The clips were performed by a group of non-professional actors, varying from 2 to 4 per clip.
For the violent clips (Fig. 2), the actors were asked to simulate actions frequent in brawls, such
as kicks, punches, slapping, clubbing (beating with a cane), stabbing, and gun shots. For the non-
violent clips (Fig. 3), the actors were asked to simulate actions which can result in false positives
by violence detection techniques due to the speed of movements or the similarity with violent
actions. Specifically, the non-violent clips include actions such as hugging, giving high fives and
clapping, exulting, and gesticulating.
Fig. 1. The structure of the data repository with the 350 clips of the dataset, split in non-violent (120 clips) and violent
(230 clips).
4 M. Bianculli, N. Falcionelli and P. Sernani et al. / Data in Brief 33 (2020) 106587
Fig. 2. Example of a frame from a violent clip (camera 1).
Fig. 3. Example of a frame from a non-violent clip (camera 2).
An additional labelling is provided in three csv files available in the main data repository
directory:
- “action-class-occurrences.csv” lists all the actions recorded in the clips, with the number of
times each action occurs in the dataset and a label to explain if the action is violent (y) or
not (n). All the actions recorded in the clips are listed in Table 1;
- “non-violent-action-class.csv” lists the actions included in each non-violent clip;
- “violent-action-class.csv” lists the actions included in each violent clip.
M. Bianculli, N. Falcionelli and P. Sernani et al. / Data in Brief 33 (2020) 106587 5
Table 1
The list of the recorded actions, with the number of occurrences in the dataset.
Violent actions Non-violent actions
Action # of occurrences Action # of occurrences
fight 46 greet 33
club 36 hug 16
punch 23 handgesture 15
push 22 jump 10
kick 21 highfive 6
slap 18 handshake 3
stab 15 walk 1
gunshot 14
choke 13
2. Experimental Design, Materials and Methods
As highlighted in a previous study [5], violence detection techniques can fail due to actions
and behaviours which are wrongly interpreted as violent, due to fast movements and similarity
with violent behaviours. To this end, the non-violent clips were recorded to specifically challenge
techniques and prevent false positives, even with datasets unbalanced towards the violent clips,
as the one proposed in this paper. For the clips representing violent behaviours, in addition
to kicks, punches and slapping, a plastic toy gun, a plastic toy knife, and a wood cane rolled
into bubble wrap sheets were used to simulate actions involving weapons such as gun shots,
stubbing, and beating.
The clips were recorded with two cameras placed in two different spots, building a dataset
with videos from two different points of view. The cameras are:
• The front camera of the Asus Zenfone Selfie ZD551KL (13 MP, Auto Focus, f/2.2).
• The TOPOP Action Cam OD009B (12 MP, fisheye lens 170°).
All the clips were recorded in the same room, with natural lighting conditions. The Asus
Zenfone was placed in the top left corner in front of the door, while the Action Cam was placed
in the top right corner on the door side. All the performed actions and behaviours were recorded
with both cameras. Therefore, all the clips with the same label and name, but in different final
directories (for example “non-violent/cam1/1.mp4” and “non-violent/cam2/1.mp4”) represent the
same action, recorded from two different perspectives (the “cam1” directory identifies the Asus
Zenfone, while the “cam2” directory identifies the Action Cam).
In addition to the main classification of the clips into violent and non-violent, we manually
annotated the actions performed in each clip. This annotation can be used for further classifi-
cation experiments with violence detection techniques, to train and test algorithms capable of
performing action recognition.
Ethics Statement
All the actors involved in the clip recording read and signed an informed consent, conserved
at the Artificial Intelligence and Real-Time System Laboratory at the “Dipartimento di Ingegneria
dell’Informazione” of “Università Politecnica delle Marche”.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal rela-
tionships which have or could be perceived to have influenced the work reported in this article.
6 M. Bianculli, N. Falcionelli and P. Sernani et al. / Data in Brief 33 (2020) 106587
CRediT authorship contribution statement
Miriana Bianculli: Conceptualization, Methodology, Investigation. Nicola Falcionelli: Concep-
tualization, Methodology, Data curation. Paolo Sernani: Writing - original draft, Writing - review
& editing. Selene Tomassini: Writing - review & editing. Paolo Contardo: Writing - review &
editing. Mara Lombardi: Supervision. Aldo Franco Dragoni: Supervision.
Acknowledgement
The presented research has been part of the Memorandum of Understanding between the
Università Politecnica delle Marche, Centro “CARMELO” and the Ministero dell’Interno, Diparti-
mento di Pubblica Sicurezza, Direzione Centrale Anticrimine della Polizia di Stato.
References
[1] A. Castillo, S. Tabik, F. Pérez, R. Olmos, F. Herrera, Brightness guided preprocessing for automatic cold steel weapon
detection in surveillance videos with deep learning, Neurocomputing 330 (2019) 151–161, doi:10.1016/j.neucom.2018.
10.076.
[2] A. Ben Mabrouk, E. Zagrouba, Spatio-temporal feature using optical flow based distribution for violence detection,
Pattern. Recognit. Lett. 92 (2017) 62–67, doi:10.1016/j.patrec.2017.04.015.
[3] Y. Gao, H. Liu, X. Sun, C. Wang, Y. Liu, Violence detection using Oriented Violent Flows, Image Vis. Comput. 48-49
(2016) 37–41, doi:10.1016/j.imavis.2016.01.006.
[4] J.S.R. Dinesh, E. Fenil, M. Gunasekaran, G.N. Vivekananda, T. Thanjaivadivel, S. Jeeva, A. Ahilan, Real time violence
detection framework for football stadium comprising of big data analysis and deep learning through bidirectional
LSTM, Comput. Netw. 151 (2019) 191–200, doi:10.1016/j.comnet.2019.01.028.
[5] S. Accattoli, P. Sernani, N. Falcionelli, D.N. Mekuria, A.F. Dragoni, Violence detection in videos by combining 3D con-
volutional neural networks and support vector machines, Appl. Artif. Intell. 34 (4) (2020) 329–344, doi:10.1080/
08839514.2020.1723876.
[6] E. Bermejo Nievas, O. Deniz Suarez, G. Bueno García, R. Sukthankar, Violence detection in video using computer
vision techniques, in: P. Real, D. Diaz-Pernil, H. Molina-Abril, A. Berciano, W. Kropatsch (Eds.), Computer Analysis of
Images and Patterns, Springer, Berlin Heidelberg, 2011, pp. 332–339, doi:10.1007/978- 3- 642- 23678- 5_39.
[7] T. Hassner, Y. Itcher, O. Kliper-Gross, Violent flows: real-time detection of violent crowd behaviour, in: 2012
IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2012, pp. 1–6.
10.1109/CVPRW.2012.6239348.
[8] M. Cheng, K. Cai, M. Li, 2019. RWF-20 0 0: an open large scale video database for violence detection, arXiv preprint,
arXiv:1911.05913. https://arxiv.org/abs/1911.05913.