Olatunji-Cheng2019 Chapter VideoAnalyticsForVisualSurveil
Olatunji-Cheng2019 Chapter VideoAnalyticsForVisualSurveil
Olatunji-Cheng2019 Chapter VideoAnalyticsForVisualSurveil
Abstract Owing to the massive amount of video data being generated as a result
of high proliferation of surveillance cameras, the manpower to monitor such system
is relatively expensive. Passively monitoring surveillance video however, incapac-
itates the usefulness of surveillance camera. Therefore, a drive to monitor events
as they happen is expedient to fully harness the massive data generated by surveil-
lance cameras. This is the main goal of video analytics. In this chapter, we extend
the notion of surveillance. Surveillance refers not only to monitoring for security or
safety purposes but encapsulates all aspects of monitoring to capture the dynamics
of different application domains including retail, transportation, service industries
and healthcare. This chapter presents a detailed survey of video analytics as well
as its application. We present advances in video analytics research and emerging
trends from subdomains such as behavior analysis, moving object classification,
video summarization, object detection, object tracking, congestion analysis, abnor-
mality detection and information fusion from multiple cameras. We also summa-
rize recent development in video analytics and intelligent video systems (IVS). We
evaluated the state-of-the-art approach to video analytics including deep learning
approach and outlined research direction with emphasis on algorithm-based analyt-
ics and applications. Hardware-related issues are excluded from this chapter.
The widespread use of video cameras especially from mobile phones and low-cost
high performance IP surveillance cameras are contributing substantially to the expo-
nential growth of video data primarily used for surveillance. In China, over 170
million video surveillance cameras have been installed and expected to increase to
over 600 million by 2020 [1].
Surveillance is usually synonymous to crime and intrusion detection due to the
current increasing security issues in our society but we extend the ideology of surveil-
lance to resource monitoring in this chapter.
Therefore, we define surveillance as the monitoring of people or object’s activities
and behavior for the purpose of intrusion detection, crime detection, scene monitoring
and resource tracking.
In a traditional video surveillance system, an operator is assigned to actively
watch videos captured by the cameras with the notion of tracking and detecting
any suspicious persons or potentially dangerous abandoned object. However, this is
quite unrealistic while considering the vast amount of cameras and several hours of
recording currently available.
Subsequently, Caifeng et al. [2] has shown that the maximum attention span of
any personnel monitoring a video task in 20 min.
Manually monitoring videos through personnel staring at several video screen for
many hours is also relatively expensive and highly prone to errors causing several
records of missed events. Therefore, a drive to automatically analyze these video
data is of the essence. This is the goal of video analytics (VA); to ease the strenuous
tasks of manually monitoring several hours of video, provide real-time alert when
situation of interest occurs, and facilitate keyword search of enormous archives of
video via automation of the task.
Video analytics or video content analysis refers to automatic processing and under-
standing of video content in order to determine or detect spatio-temporal events and
extract information or knowledge about the observed scene. The generated video
content can be from a single camera or multiple cameras. Video analytics is a con-
stantly evolving field with novel techniques and algorithms continuously being devel-
oped in areas such as video semantic categorization, video retrieval from database,
human action recognition, summarization and anomaly detection. Video analytics
algorithms can either be implemented as a software package running in a central-
ized station where numerous servers are utilized for processing or as hardware on a
dedicated video processing unit.
Sport, retail, automotive, transport, security, entertainment, traffic control includ-
ing pedestrian crossing, digital real-time decision making devices, and healthcare
are some of the several domains in which video analytics have been applied.
For example, the 2018 FIFA world cup took a new phase of monitoring football
matches using videos to enhance referee’s capability. The system is called Video
Assistant Referee (VAR). VAR not only monitors the game but also analyzes player’s
performance in real-time. Another video system used for monitoring in the 2018 FIFA
15 Video Analytics for Visual Surveillance and Applications … 477
world cup is the Goal line technology (GLT). The Goal line technology is used to
detect if the football passes the goal line which is usually difficult for linesman or
assistant referee to see. These technologies driven by video surveillance on a football
field provides the referee with sufficient information to make decision when dispute
occur such as suspected penalty or a ferocious attack that deserves a yellow or red
card.
For resource monitoring, a work by Cheng and Olatunji [3] showed that videos
can be used in monitoring trolleys in an airport operation in real-time. This system
provides an up-to-date inventory of the available resource (trolley) and significantly
reduces replenishment time by about 70% due to its real-time alert system when
there is shortage of resource.
For autonomous cars, obstacle detection and path planning are important for effec-
tive navigation. This is achieved by analyzing the data captured by Light Detection
and Ranging (LiDAR) cameras or other 3D cameras installed in the car.
Cognitive factors such as attention span and time to react to an event can be
investigated by analyzing the video content of an activity or event. Real-time situation
awareness, target recognition, event prediction and prevention, and post-event data
analysis are key goals of intelligent video system (IVS) fueled by video analytics.
Video analytics spans over broad application areas such as biometrics (face, tat-
too, signature and iris recognition), detection and tracking (person, object, vehicle,
abandoned object and logos), text recognition from video, search and retrieval of
video content, geolocation and mapping, summarization and skimming, behavior
analysis and event analysis.
Typically, videos are segmented into frames or set of still images. A single video
camera can produce about 25–30 fps or more with 4 K and 3D video cameras which
is equivalent to thousands or millions of frames depending on the sampling rate of
the video. Similarly, it has been stated that video traffic will account for about 82%
of the whole internet traffic [4]. Therefore, considering the volume and speed of data
that can be generated by a single video data, algorithms to process the video must
offer real-time or near real-time solutions.
Most of the existing IVS are based on centralized approach but there has been
progressive research in the area of edge based architecture. Centralized approach
involves routing videos to a base cloud or storage where all analytics take place
whereas in edge-based architecture, video contents are analyzed near the source of
data generation. Edge-based architecture provides better solution and avoids band-
width and other cost associated with transporting data to the centralized station.
However, this is beyond the scope of this chapter. Readers are referred to [5] for
more information. Video analytics algorithm are parameter-oriented as they are key
factors in determining the accuracy of the output. Such parameters include frame
sampling, frame resolution and algorithmic parameters.
The most fundamental steps in automated video surveillance and monitoring
(VSAM) is background image estimation and updating the background image to
reflect changes in the background environment. However, since the emergence of
deep learning and the breakthrough of convolutional neural network (CNN) in image,
video and audio classification problems, there have been widespread adoption of
478 I. E. Olatunji and C.-H. Cheng
CNN especially in its application to large-scale video classification [6]. Due to high
computational cost, graphics processing unit (GPU) are leveraged to run video-
processing system’s algorithms for effective performance gains by parallelizing a
number of vision algorithms [7, 8]. However, computational details are excluded as
they are not the focus of this chapter.
Video analytics systems based on deep learning models such as CNN forms the
basis of state-of-the-art analytics systems applied in smart cities and real-time appli-
cations. The deep learning approach requires enormous amounts of data and training
time to complete a task such as object segmentation, classification or detection.
Despite enormous effort in developing automated VSAM systems, current surveil-
lance systems are not entirely capable of autonomously analyzing complex event
from observed scene. To address this problem, independent work on several areas
such as object tracking, behavior understanding, object classification, summariza-
tion and motion segmentation are combined to form a composite video analytic
framework for video surveillance.
This paper presents a general overview of video analytics, current state-of-the-art
method and integration of different aspect of video analytics algorithms to form an
intelligent video system (IVS). The rest of the paper is described as follows: Sect. 2
discusses the theory of video analytics systems, more especially the deep learning
approach which is the current state-of-the-art approach. Section 3 details survey of
the algorithms and task involved in video analytics for surveillance. In Sect. 4, we
discuss the application of analyzing video surveillance data in daily life operations.
Section 5 presents research direction and concludes the paper.
Video analytics have been a major area of research since the last few decades with
several techniques been developed to overcome challenges such as accuracy and
precision of analyzing video data. This section details the core concept of some early
works and also recent works sprawled by deep learning. The core mathematical
foundations discussed in this section has be applied to different application domain
such as object tracking, segmentation and motion detection to name a few.
Background image estimation and updating the background image to reflect changes
in the environment are the fundamental steps in automated VSAM. Several approach
to background image estimation of moving object exist. They include pixel-based
approach, block-based approach, neural network approach, and Gaussian Mixture
Model (GMM). Readers are advised to read [9] for a comprehensive survey of scene
background estimation.
15 Video Analytics for Visual Surveillance and Applications … 479
Fig. 1 Background estimation from a cluttered scene: a initial frame from video scene, b–c back-
ground initialization, d estimated background. Images adapted from [13]
in part or fused together for AOD detection [18–20]. An interesting work by Guler
[7] used two foreground images (one long-term and one short-term) to define AOD
as a situation where an object’s pixels are in the long-term foreground, but not in the
short-term background. Ramirez-Alonso et al. [21] proposed a temporal weighted
learning model for background estimation that can reinitialize and update parameters
adaptively. Details of their proposed method is discussed below. There are 4 mod-
ules in the system with the first module handling scenes with static object. Module
2 deals with normal scenes that can contain dynamic object. In module 3, threshold
is defined in order to separate background from foreground object and module 4
caters for dynamic scene changes using Speeded-Up Robust Features (SURF) algo-
rithm to align information. Once the information about the background is aligned,
the background model can be updated and background can be estimated.
Let W B S_H L R and W B S_L L R be the adaptive weight arrays updated by high and
low parameters respectively, L R be the learning rate values, F B E_H L R and F B E_L L R
be the foreground models of the estimated background in the RGB color space.
Classification of the foreground is performed by Eqs. (1) and (2):
1 W B S_H L R (x, t) RG B − I(x, t) RG B 2 > ε1
F B E_H L R (x, t) = (1)
0 other wise
1 W B S_L L R (x, t) RG B − I(x, t) RG B > ε2
F B E_L L R (x, t) = 2 (2)
0 other wise
where x is the pixel location [x, y] of a M × N image size (1 < x < M and 1 <
y < N ) and t ∈ N is the time index. ε1 and ε2 are the threshold values. If the result
of the Euclidean distance is more than the threshold values, the pixel is classified as
foreground. A zero in the foreground indicates that the pixel is part of the background.
Weight are updated adaptively by Eq. (3)
where L R is the matrix holding the learning rate value for each weight defined by
Eq. (4). The exponential factor in Eq. (4) generates fast learning in the initial frame.
Therefore the value of ta will be initialized to 0 and will increase by 1 if changes
occur between consecutive input frames as given in Eq. (5).
−ta
T
L R (x, t) = S0 f + A0 A(x, t) + L 0 (4)
ta + 1 ρv < 0.998
ta = (5)
ta other wise
where ρv is the Pearson correlation coefficient between the input frame at time t
and t − 1. L 0 is a classifier constraint whose value is < 0.1 and T f is the number of
frames. S0 is the bootstrap learning constant and A0 is the scaling factor for A(x, t).
A(x, t) is the scene information matrix that defines the learning of each pixel and
15 Video Analytics for Visual Surveillance and Applications … 481
used for classifying pixels based on the values of F B E_H L R (x, t) and F B E_L L R (x, t).
A(x, t) can be calculated as:
1
A(x, t) = 1 − F B E H L R (x, t) + F B E_L L R (x, t) (6)
2
when A(x, t) = 0, pixels represent dynamic object or ghost detection. If A(x, t) =
0.5, pixel is classified as a dynamic object only in one foreground model or ghost
only in F B E_L L R . If A(x, t) = 1, then the pixel is classified as background in both
foreground models.
Fig. 2 Illustration of object detection using DNN proposed by Szegedy et al. [41]
X = x1 , x2 , x3 , . . . , xn (7)
The number of channels from three (RGB) is reduced to one by converting the
frame to gray scale and thus causing reduction in the processing time. Gray-scale
conversion has no effect on algorithm accuracy. For the object detection phase, the
converted gray scale frame is input into object detection algorithm and a bounding
box is created around the region of interest. Haar cascade classifier [26] can be used
for object detection.
Let (x, c) be a labeled frame where x is the frame data and c is the ground truth.
The corresponding bounding box after an object is detected is given by:
R(x0 , y0 . . . xn , yn ) (8)
To extract desired object from video frame, cropping is performed around the
detected area and the cropped area of the frame serves as input into the object clas-
sification phase. The extracted objects are rescaled to a size w * h, say 150 * 150
pixels and normalized before feeding them into the deep neural network (DNN).
Normalization of the object is performed for better accuracy by transforming pixel
values from 0–255 to 0–1 before feeding them into the CNN as shown in Fig. 3.
CNN or DNN in general requires large amount of training data to give better per-
formance and to avoid overfitting. When training data is scarce, data transformation
(affine displacement field) such as contrast variation, skew, translation, rotation and
flipping is performed on the input data set to generate additional training data to
augment scarce data so as to increase the accuracy of the DNN. For details about
15 Video Analytics for Visual Surveillance and Applications … 483
how affine displacement can be used on video frames to generate additional data,
Yaseen et al. [50] gives an intuitive explanation.
Based on the input training data and or transformed data, the CNN is trained to
classify and discriminate the generated classes. A typical architecture consists of
multiple alternating layers of convolutional and subsampling layers.
The convolutional and sub-sampling layers are denoted mathematically in Eqs. 9
and 10 respectively.
where g(.) is the activation function, W and B represents the weight and biases of
the system and the sub-sampling layer consist of downsampled inputs. * represents
the convolution operation performed between the inputs and network weights.
Rectified Linear Units (ReLU) is the mostly used activation function for non-
linearity. It models positive real numbers and helps in solving vanishing gradi-
ent problem with range from [0, ∞]. Other forms of activation functions include
sigmoid, Leaky rectified linear unit (Leaky ReLU), Parametric rectified linear unit
(PReLU), Randomized leaky rectified linear unit (RReLU), Exponential linear unit
(ELU), Scaled exponential linear unit (SELU), S-shaped rectified linear activation
unit (SReLU), Inverse square root linear unit (ISRLU), Adaptive piecewise linear
(APL).
For the pooling layer, Max pooling is used. The purpose of the pooling layer is
for dimensionality reduction, downsampling feature maps from convolutional layer
and reducing the number of parameters so as to reduce computational cost. Yaseen
et al. [50] used local response normalization as generalization technique. A typical
video-based CNN architecture consists of two convolutional layer followed by two
response normalization layers. Three max pooling layers are stacked underneath the
response layer followed by the last convolutional layer. L2 regularization is used
484 I. E. Olatunji and C.-H. Cheng
where f (xi T ) is the function to calculate output value and e is the basis vector.
Zhang et al. [55] created a knowledge base to reduce semantic gap between com-
plex events by using tons of web images for learning noise-resistant classifiers to
effectively model event-centric semantic concepts. Group incremental learning of
target classifier was proposed by Wang et al. [56] where each concept group com-
prises of simple action videos and images querying from Web. Long et al. [57] and
Duan et al. [58] proposed transfer kernel learning method and multiple source domain
adaptation method respectively. In the multiple source domain adaptation method,
relevant images sources are selected for annotating videos.
Besides appearance based approach, depth map based action recognition method have
been proposed including combination of depth motion map (DMM) and HOG [68],
using graphical model to encode temporal information [69], histogram of oriented
4D normals (HON4D) [70], capturing local motion and geometry information [71],
and binary range-sample feature [72]. However, all the proposed methods are dataset-
dependent and are based on hand-crafted features.
Deep learning methods have also been applied to depth based action recognition
approach. A variant of DMM based on CNN was applied by [73, 74]. Wu. et al.
[75] used 3D CNN as a feature extractor from depth data. Other techniques such
as structured images [76, 77] technique has been proposed for depth based action
recognition.
C. Deep leaning-based motion recognition
Deep learning approach for motion recognition can be partitioned into one of four
categories:
Category 1 Video is viewed as a set of still images [29, 30]. In this category, each
channel of the images is input onto one channel of a CNN. This approach is subop-
timal but performs quite well.
Category 2 Video is represented as a volume and replaces the 2D filters of CNN with
3D filters thereby introducing temporal dimension [78, 79]. This approach doesn’t
work quite well probably due to lack of annotated training set.
Category 3 Video is regarded as a sequence of images and fed into Recurrent neural
network (RNN) [80–82]. RNN allows for sequential parsing of video frames due
to its sensitivity to both long term and short term patterns and its memory cell-like
nature. It encodes frame-level information in the memory. It performs in similar
magnitude to category 2.
Category 4 Video is represented as compact images and fed into pre-trained CNN
architecture [83]. This approach achieved the state-of-the-art performance of action
recognition due to the pretraining. Ochs et al. [84] proposed a method of segmenting
moving objects using semi-dense point tracker based on optical flow to produce
trajectories over several frames by long term analysis of the motion vector. They
claim that intricate details can be extracted over a long period of analyzing videos by
segmenting the meaningful or whole part of an object instead of a short time. Their
method performs well than two-frame optical flow and color-based segmentation
methods.
Moving object detection from video is important due to its invaluable application to
several application domains such as intelligent video surveillance, human behavior
recognition, traffic control and action recognition.
15 Video Analytics for Visual Surveillance and Applications … 487
Motion detection algorithm is similar to that of object tracking with the exception of
consecutive frame matching with the detected object. Therefore, moving object can
be detected via foreground images extracted by GMM background and connected
component algorithm for noise removal. Upon applying connected component algo-
rithm, the area of detection is well refined and produces the bounding box information
of the moving objects [7]. Similarly, deep learning methods can be used. The goal of
moving object detection is to capture video sequence from a fixed or moving camera.
The output of the detection is a binary mask representing the moving object for each
frame in a particular sequence. Shadows, variation in illumination and cloud move-
ment makes object detection for moving object a difficult task [87]. Methods for
moving object classification and detection can be categorized into moving camera
with moving object and stationary camera with moving object.
A. Moving object with stationary camera
Analyzing moving objects using fixed cameras where background image pixels in the
frames remain the same throughout a video sequence has been extensively studied
in the literature [88–92]. The generic approach of handling the fixed camera prob-
lems is to model a stable background and apply background subtraction technique
as described previously in Sect. 2 (theory of video analytics). Shantaiya et al. in [93]
conducted an extensive review on object detection in video and grouped the meth-
ods into feature-based, motion-based, classifier-based and template-based models.
Categorization of object tracking in videos into point tracking, kernel tracking and
488 I. E. Olatunji and C.-H. Cheng
a set of image frame. If it exists, the low rank representation of the matrix created
by these frame contains the coherency and the sparse matrix representation con-
tains the outliers which represents the moving object in these frames. Low rank
and sparse decomposition involves segmenting moving objects from the fixed
background by applying principal component pursuit (PCP). It is a valuable
technique in background modelling. Mathematical formula and optimization of
this method can be found in [96].
Fig. 4 a Crowd gathering at the train station during peak hours, b crowd gathering for religious
activity. Image retrieved from alamy.com and www.thereformationroom.com respectively
490 I. E. Olatunji and C.-H. Cheng
The significance of crowd analysis has grown with the increase in the world popu-
lation. For public safety, crowd management is very important in the construction of
shopping malls, stadiums, and subway stations in order to avoid stampede or other
disastrous outcomes. Therefore, using cameras for crowd analysis is important to
detect or avoid terrorist attacks, bomb explosion, fire outbreak and other incidences
that can cause havoc to public safety.
Crowd behavior analysis spans through several areas including pattern recogni-
tion, computer vision, mathematical modelling, artificial intelligence and data min-
ing. The meaning of crowd differs based on the situation. For example, 10 persons
gathering in a subway station can be regarded as crowd. Crowd analysis involves
studying both group and individual behavior to determine abnormality. The def-
inition abnormal behavior is quite ambiguous which makes crowd analysis quite
an interesting area of research. Extensive survey of state-of-the-art deep learning
method for crowd analysis can be found in [121].
The following attributes are used in analyzing crowd:
1. Counting and density estimation (congestion analysis)
2. Motion detection
3. Tracking
4. Behavior understanding.
Several factors must be considered when performing crowd analysis including
terrain features, geometrical information as well as crowd flow.
Motion features present an invaluable stance point for analysis of crowded scene.
Existing work on motion feature analysis for crowded scenes can be divided into
flow-based features, local spatio-temporal features and trajectory or tracklet [122].
These feature representations can be used for several task such as crowd behavior
recognition, abnormality detection in crowd and motion pattern segmentation.
A. Flow-based Features
Tracking a person in a highly crowded environment is extremely difficult. However,
in flow based features extraction, attention is only given to the occurrence not the actor
(who is involved in what is happening). For example, singly looking at a person’s
action doesn’t say much and may seem random but overall view of the crowd can be
conclusive [123]. Flow based features are pixel level features. Several methods have
been presented over the years [124–128]. Categorization of existing work is what
follows:
1. Optical Flow:
Optical flow involves computing pixel-wise motion between consecutive frames.
Optical flow handles multi-camera object motion and has been applied to detection
crowd motion as well as crowd segmentation [129–132]. The drawback of Optical
15 Video Analytics for Visual Surveillance and Applications … 491
flow is that it cannot encapsulate spatio-temporal properties of the flow and does not
capture long range dependencies.
2. Particle Flow:
Inspired by the Lagrangian framework of fluid dynamics [133], particle flow involves
moving a grid of particles with the optical flow and providing trajectories that maps
a particle’s initial position to its future or current position. This method has shown
dynamic application in crowd segmentation and detection of abnormal behavior in
crowd [122]. However, there is time lag and cannot handle spatial changes.
3. Streak Flow:
Streakline was introduced by Mehran et al. [128] for analyzing crowd video by
computing motion field. The proposed method is called Streak flow. Streak flow
overcomes the challenges of particle flow. Although it captures motion information
similar to particle flow, changes in the flow is faster and performs well in dynamic
motion flow.
B. Local Spatio-Temporal Features:
Less structural (very crowded) scene have high variability and non-uniform move-
ment. Motion in this type of scenes can be generated by any moving object and any of
the optical flow features cannot provide useful information about the motion. Local
spatio-temporal features are 2D patches or 3D cubes representation of the scene.
They explore motion patterns and characterizes their spatio-temporal distributions
on local 2D patches or 3D cubes. Spatio-temporal features are described below:
1. Spatio-temporal Gradients:
Kratz and Nishino [134] used spatio-temporal motion pattern model to capture
steady-state motion behavior and their result shows that abnormal activities can
be detected.
2. Motion Histogram:
Motion histograms considers motion information within the local region. Comput-
ing motion orientation on motion histogram takes considerable amount of time and
it is highly susceptible to error. Thus, it is not suitable for crowd analysis. How-
ever, several improved methods based on motion histogram has been proposed by
researchers.
Jodoin et al. [131] proposed orientation distribution function (ODF), a feature
that does not have any information about the magnitude of flow but represents the
probability density of a particular motion orientation. Multiscale histogram of optical
flow (MHOF) was proposed by Cong et al. [135] as a feature descriptor that preserves
both the spatio-contextual information and motion information.
C. Trajectory/Tracklet
492 I. E. Olatunji and C.-H. Cheng
[151], energy model [152], stationary-map [153], social force model (SFM) [154],
Gaussian regression [155], principal component analysis (PCA) model [156], global
motion-map [157], motion influence map [158], and salient motion map [159] are
approaches used in the global pattern-based method.
C. Grid pattern-based method of anomaly detection
In contrast with the global pattern-based methods, the grid pattern-based methods do
not consider frames as a whole but rather splits frames into blocks and individually
analyze pattern on a block-level basis [160]. Grid pattern-based methods are more
efficient due to the processing time reduction fueled by individual evaluation of
pattern in the block level and ignoring inter-object connections. Spatio-temporal
anomaly maps, local features probabilistic framework, joint sparsity model, mixtures
of dynamic textures with GMM, low-rank and sparse decomposition (LSD), cell-
based texture analysis, sparse coding (SC) and deep networks are used in evaluating
grid pattern-based methods [142].
Xu et al. [161] used sparse reconstruction of dynamic textures over an overcom-
plete basis set to detect anomaly. Cong et al. [162] proposed the concept of searching
for the best match in the training dataset using motion context descriptor. Thida et al.
[163] extracted diverse crowd activities from videos using spatio-temporal Laplacian.
Eigen map method. All these methods are based on Sparse Coding (SC). The
notion behind SC is that abnormal events in videos are characterized by sparse linear
combinations of normal patterns with large reconstruction error while normal events
are characterized by small reconstruction errors. Yu et al. [164] classified events as
abnormal and normal using hierarchical sparse coding method. Li et al. [165] and
Lu et al. [166] computed sparse representation in each scales by splitting frames in
multiscale. To generate better representation of abnormal events, Lu et al. [167] and
Han et al. [168] proposed adaptive dictionary learning and online adaptive dictionary
learning respectively.
Sparse coding and sliding window was adopted by Zhao et al. [169] for detection
of abnormal events in videos.
Xu et al. [142] proposed a method of detecting anomalies in video based on stacked
sparse coding (SSC) with intra-frame classification. The video is first divided into
blocks. The appearance and motion features for each block is described by the fore-
ground interest point (FIP) descriptor and encoded by SSC. Support vector machine
(SVM) is used to evaluate the intra-frame classification to determine abnormality in
each block.
of significant or important aspect of a video varies based on several criteria. The goal
of video summarization is to generate a dense representation of a specific video.
Determining informative or important sections of a video requires understanding of
the video content. However, video content is diverse thereby making summarization
a difficult task. Video summarization can also enhance video retrieval results.
Zhang et al. [170] defined a good summarization technique as one which is diverse,
representative of videos of similar group and discriminative against videos in dissim-
ilar groups. Domain-specific video summarization method was the early approach
used for determining important segment of the video. For example, in sport, spe-
cific structure can facilitate the important segment according to the rules governing
the sport. In movies, metadata such as movie script and captions can be used in
generating video summaries.
Methods of dealing with video summarization can be categorized into unsuper-
vised, supervised, query extractive and discriminative approach.
Supervised and unsupervised approaches have been developed to encapsulate
domain knowledge for video summarization. Unsupervised summarization approach
creates summary based on precise selection criteria. Supervised approach on the other
hand, trains a summarization model using human-created summaries.
Potapov et al. [171] used classifier’s confidence score to define important seg-
ment of a video. Methods in the supervised approach is difficult to generalize to
other genres because it is highly dependent on domain knowledge but offers better
performance.
The unsupervised approach is independent of domain knowledge and thus suitable
for generic application. Yang et al. [172] used an auto-encoder to convert input video’s
features into a concise one and reconstruct the input using the decoder. Zhao et al.
[173] proposed a method that reconstruct the rest of the original video based on a
video summary.
Query extractive summarization methods [174, 175] are a variant of summariza-
tion methods that generates summary based on keyword input. This model assumes
that a video can have multiple summaries. However, it may be unrealistic for real
applications due to frame-level importance annotation for each keyword.
Panda et al. [176] introduced discriminative information by training a spatio-
temporal CNN for classifying the category of each video and calculates the impor-
tance scores via gradient aggregation of the network’s output. Kanehira et al. [170]
proposed viewpoint-aware video summarization in which summary is built based on
the aspect of the video that the viewer focuses on. To determine viewpoint, they lever-
age other videos in folders on the viewer’s laptop or phone and performed semantic
similarity and dissimilarity between the videos in the folder and the current video
being watched to produce viewpoint-specific summaries. Otani et al. [177] proposed
a method of improving video summarization techniques by using deep video features
to encode various levels of content semantics such as actions, scenes and objects.
The architecture used is a deep neural network for mapping videos and description
to a semantic space. Clustering is applied to the segmented video content. Table 2
gives the taxonomy of video summarization methods.
496 I. E. Olatunji and C.-H. Cheng
Table 2 (continued)
Reference Year of publication Summarization extraction method
Elhamifar and Kaluza 2017 Proposed an incremental subset selection
[188] framework for generating summarized
videos by updating set of representatives
features based on previously selected set
of representatives and new batch of data
Fleischman et al. [189] 2007 Proposed temporal feature induction
method that extracts complex temporal
information from video for classifying
video highlights
Hong et al. [190] 2009 Used multi-video summarization
technique to determine key shots as a
combination of ranked list of web videos
and user-defined skimming ratio
Khosla et al. [191] 2013 Used web-image based prior information
to generate summarization obtained
through crowdsourcing for poor quality
videos
Kim et al. [192] 2014 Used storyline graph for creating
structural video summaries that illustrates
various events based on diversity ranking
between images and video frames
Lu and Grauman [193] 2013 Used text analysis based method to
determine random walk-based metric of
influence between sub shots which
captures event connectivity
Mahasseni et al. [194] 2017 Trained a system to learn a deep
summarizer network based on
autoencoder long short-term memory
network (LSTM)
Song et al. [195] 2015 Proposed a video summarization
framework called TVSum that detects
important shots based on titles of the
retrieved image
Query extractive summarization
Sharghi et al. [174] 2016 Proposed a method based on Sequential
and Hierarchical Determinantal Point
Process (SH-DPP) to select key shot
determined by the relevance of user query
relative to the video context
Sharghi et al. [175] 2017 Used extracted semantic information for
evaluating the performance of a video
summarizer
Discriminative method of video summarization
[170] 2018 Introduced a viewpoint approach to build
a summary that depends on what the
viewer focuses on using classification
techniques that discriminates semantic
similarity between different groups
498 I. E. Olatunji and C.-H. Cheng
Occlusion is one of the major problem in video analytics. Effectively handling occlu-
sion can greatly improve analytics accuracy. Occlusion can either be partial or total
occlusion.
Objects can be occluded by other objects in the scene causing some parts of the
object to be unseen (partial) or completely hidden from the observed video frame
by other object (total). For instance, consider a mini cooper as the target object on a
15 Video Analytics for Visual Surveillance and Applications … 499
Table 3 (continued)
References Method Description
Jinjie et al. [206] Top-push distance learning model Top-push constrain is used for
(TDL) matching video features of
persons instead of matching still
images of a person across
multiple camera. This approach
provides high-level
discriminative features and
provides better matching for
person re-identification in
multiple non-overlapping camera
views
Mclaughlin et al. [49] Recurrent neural network CNN is used as a feature extractor
architecture for each frame combined with
temporal pooling layer for person
re-identification and detection
across multiple cameras
Tesfaye et al. [196] Constrained dominant sets Proposed a three-layer
clustering (CDSC) hierarchical framework using
CDSC technique for tracking
object across multiple
non-overlapping cameras and
detection of person
re-identification
freeway. The mini cooper due to its size could be occluded by larger truck and vans
which will in turn affect the detection of the mini cooper. Occlusion adversely affects
object detection by changing the appearance model for a short time which can affect
tracking of the object. Figure 5 shows an example of occlusion in real-life.
Several methods have been proposed for occlusion handling based on appear-
ance model. Jepson et al. [207] used Estimation Maximization (EM) algorithm with
appearance model based on filter responses from a steerable pyramid to deal with
changing appearance of an object.
Spatio-temporal context information obtained from gradual analysis of the occlu-
sion situation was proposed by Pan and Hu [208] to distinguish occluded object
effectively. Contour-based approach was proposed by in [209]. In this approach,
energy function evaluated in the contour is minimized which causes easy tracking
of the object. Subsequently, maintaining appearance models of moving objects over
time can effectively manage occlusion [210].
Method of handling occlusion for moving camera has been proposed by Hou
et al. [211]. Their method used HOG and multiple kernel tracker based on mean shift
to discriminate different condition of moving cameras thereby handling occlusion
effectively.
15 Video Analytics for Visual Surveillance and Applications … 501
Fig. 5 President Trump occluding Queen Elizabeth II. a Full or total occlusion b–c Partial occlu-
sion. Images obtained from twitter
15.4 Applications
Currently, video analytics drives many application domains from self-driving cars
and drones to surveillance with majority of the models taking advantage of the deep
neural networks paradigm.
Video analytics in surveillance applications can be grouped into three categories:
1. Predictive analytics: This involves making a prediction or forecast about the
future. It identifies risk, opportunities and assesses the likelihood of a similar
subject in different category exhibiting specific attribute. Simply put, it is the
analysis of what will happen in the future. For example, predicting the next
direction or move of a target based on the behavior of the target in the current
frame.
2. Prescriptive analytics: Prescriptive analytics encompasses actions that should be
taken in a given situation. For example; in resource tracking, when the availability
of resource is low in a particular resource station, video analytics can proffer
solution of using resource from a station with low resource usage to replenish
the busy station requiring more resources.
3. Forensics: Categories of video analytics application in forensics involves using
videos to analyze what has happened. In congestion analysis, when there is a
situation of interest such as stampede, videos can be used to trace back the cause
of the stampede.
502 I. E. Olatunji and C.-H. Cheng
Information fusion between cameras and efficient analytic algorithm can greatly
enhance the power of video analytics in several domains such as healthcare, trans-
15 Video Analytics for Visual Surveillance and Applications … 505
portation, security and resource tracking. The goal of video analytics is to convert
observed scenes into quantitative and actionable intelligence that can provide real-
time situational awareness of an event of interest as well as post-event evaluation.
Combining other data such as data from geographical information system (GIS), loca-
tion data and other metadata will form a composite suite of video analytics system and
result in more efficient algorithms. Similarly, more research should be conducted on
visualizing camera position and perspective including 3D projection, geospatial con-
text and narrative time sequence as they affect the data quality and algorithm accuracy.
Integration of social media data into video analytics systems and augmented real-
ity (AR) views may increase algorithm robustness. Currently, application-specific
systems exist such as burglary detection system, shoplifting detection system and
fall detection system to mention a few. Moving towards a generic video analytic
framework instead of current application-specific systems will be a good research
direction.
In this chapter, we presented a concise survey of recent video analytics methods
and techniques applied to surveillance camera data. We extended the definition of
surveillance to resource management which is important for capacity-constrained
industries. We discussed the theory behind current state-of-the-art approach of video
analytics. It is envisaged that this chapter serves as a starting point for new researchers
in the area of video analytics to have a broad knowledge of the field and also to stream-
line their focus on any of the modules that makes an automated video surveillance
and monitoring (VSAM) system.
References
1. I.E. Olatunji, C.-H. Cheng, Dynamic threshold for resource tracking in observed scenes. in
IEEE International Conference on Information, Intelligence, Systems and Applications (2018)
2. S. Caifeng, P. Fatih, X. Tao, G. Shaogang, Video Analytics for Business Intelligence (2012)
3. C.-H. Cheng, I. E. Olatunji, Harnessing constrained resources in service industries via video
analytics. Arch. Ind. Eng. J. (2018)
4. C.V. Networking Index, Forecast Methodol. 2016–2021 white Pap., vol. 1 (2016)
5. M. Ali, A. Anjum, M. U. Yaseen, A.R. Zamani, D. Balouek-Thomert, O. Rana, M. Parashar,
Edge enhanced deep learning system for large-scale video stream analytics, in 2018 IEEE
2nd International Conference on Fog and Edge Computing (ICFEC) (2018), pp. 1–10
6. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, L. Fei-Fei, large-scale video
classification with convolutional neural networks, in Proceedings of the 2014 IEEE Confer-
ence on Computer Vision and Pattern Recognition (2014), pp. 1725–1732
7. P. Guler, Real-Time Multi-camera Video Analytics System on GPU, no. (Mar 2013, 2015)
8. D.-S. Lee, Effective Gaussian mixture learning for video background subtraction. IEEE Trans.
Pattern Anal. Mach. Intell. 27(5), 827–832 (2005)
9. T. Bouwmans, L. Maddalena, A. Petrosino, Scene background initialization. Pattern Recogn.
Lett. 96, no. C, pp. 3–11, (2017)
10. Z. Zhou, D. Wu, X. Peng, Z. Zhu, C. Wu, J. Wu, Face, Tracking Based on Particle Filter with
Multi-feature Fusion (2013)
11. I. Ishii, T. Ichida, Q. Gu, T. Takaki, 500-fps face tracking system. J. Real-Time Image Process.
8(4), 379–388 (2013)
506 I. E. Olatunji and C.-H. Cheng
12. V. Pham, P. Vo, V.T. Hung, L.H. Bac, GPU implementation of extended gaussian mixture
model for background subtraction, in 2010 IEEE RIVF International Conference on Comput-
ing & Communication Technologies, Research, Innovation, and Vision for the Future (RIVF)
(2010), pp. 1–4
13. V. Reddy, C. Sanderson, B.C. Lovell, A low-complexity algorithm for static background
estimation from cluttered image sequences in surveillance contexts. J. Image Video Process.
2011, 1:1–1:14 (2011)
14. G. Zhang, J. Jia, W. Xiong, T.-T. Wong, P.-A. Heng, H. Bao, Moving object extraction with
a hand-held camera, ICCV 2007. in IEEE 11th International Conference on Computer Visio
(2007), pp. 1–8
15. M. Gelgon, P. Bouthemy, A region-level motion-based graph representation and labeling for
tracking a spatial image partition. Pattern Recognit. 33(4), 725–740 (2000)
16. P. Angelov, P. Sadeghi-Tehran, C. Clarke, AURORA: autonomous real-time on-board video
analytics. Neural Comput. Appl. 28(5), 855–865 (2017)
17. E. Auvinet, E. Grossmann, C. Rougier, M. Dahmane, J. Meunier, Left-luggage detection using
homographies and simple heuristics
18. D. Emeksiz, A. Temizel, A Continuous Object Tracking System with Stationary and Moving
Camera Modes, vol. 854115, no. Oct 2012
19. P. Gil-Jiménez, R. López-Sastre, P. Siegmann, J. Acevedo-Rodríguez, S. Maldonado-Bascón,
automatic control of video surveillance camera sabotage, in Nature Inspired Problem-Solving
Methods in Knowledge Engineering (2007), pp. 222–231
20. A. Saglam, A. Temizel, Real-Time adaptive camera tamper detection for video surveillance, in
2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance
(2009), pp. 430–435
21. G. Ramirez-Alonso, J.A. Ramirez-Quintana, M.I. Chacon-Murguia, Temporal weighted learn-
ing model for background estimation with an automatic re-initialization stage and adaptive
parameters update. Pattern Recognit. Lett. 96, 34–44 (2017)
22. O. Déniz, G. Bueno, J. Salido, F. De la Torre, Face recognition using histograms of oriented
gradients. Pattern Recognit. Lett. 32(12), 1598–1603 (2011)
23. A.E. Abdel-Hakim, A.A. Farag, CSIFT: A SIFT descriptor with color invariant characteristics,
in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR’06), vol. 2 (2006), pp. 1978–1983
24. M.U. Yaseen, M.S. Zafar, A. Anjum, R. Hill, High performance video processing in cloud
data centres. IEEE Symp. Serv.-Oriented Syst. Eng. (SOSE) 2016, 152–161 (2016)
25. M.U. Yaseen, A. Anjum, N. Antonopoulos, Spatial frequency based video stream analysis for
object classification and recognition in clouds, in 2016 IEEE/ACM 3rd International Confer-
ence on Big Data Computing Applications and Technologies (BDCAT) (2016), pp. 18–26
26. M.U. Yaseen, A. Anjum, O. Rana, R. Hill, Cloud-based scalable object detection and classi-
fication in video streams. Futur. Gener. Comput. Syst. 80, 286–298 (2018)
27. A.R. Zamani, M. Zou, J. Diaz-Montes, I. Petri, O. Rana, A. Anjum, M. Parashar, Dead-
line constrained video analysis via in-transit computational environments. IEEE Trans. Serv.
Comput. 1 (2018)
28. A. Anjum, T. Abdullah, M. Tariq, Y. Baltaci, N. Antonopoulos, Video stream analysis in
clouds: an object detection and classification framework for high performance video analytics.
IEEE Trans. Cloud Comput. 1 (2018)
29. K. Simonyan, A. Zisserman, Two-stream convolutional networks for action recognition in
videos, in Proceedings of the 27th International Conference on Neural Information Processing
Systems, vol. 1 (2014), pp. 568–576
30. J.Y. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, G. Toderici, Beyond
Short Snippet : Deep Networks for Video Classification (2014), p. 4842
31. A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional
neural networks, in Proceedings of the 25th International Conference on Neural Information
Processing Systems, vol. 1 (2012), pp. 1097–1105
15 Video Analytics for Visual Surveillance and Applications … 507
56. H. Wang, H. Song, X. Wu, Y. Jia, Video annotation by incremental learning from grouped
heterogeneous sources, in Asian Conference on Computer Vision (2014), pp. 493–507
57. M. Long, J. Wang, G. Ding, S.J. Pan, P.S. Yu, Adaptation regularization: a general framework
for transfer learning. IEEE Trans. Knowl. Data Eng. 26(5), 1076–1089 (2014)
58. L. Duan, D. Xu, S. Chang, Exploiting web images for event recognition in consumer videos:
A multiple source domain adaptation approach, in 2012 IEEE Conference on Computer Vision
and Pattern Recognition (2012), pp. 1338–1345
59. H. Song, X. Wu, W. Liang, Y. Jia, Recognizing key segments of videos for video annotation
by learning from web image sets. Multimed. Tools Appl. 76(5), 6111–6126 (2017)
60. K. Tang, L. Fei-Fei, D. Koller, Learning latent temporal structure for complex event detection,
in 2012 IEEE Conference on Computer Vision and Pattern Recognition (2012), pp. 1250–1257
61. W. Li, Q. Yu, A. Divakaran, N. Vasconcelos, Dynamic pooling for complex event recognition,
in 2013 IEEE International Conference on Computer Vision (2013), pp. 2728–2735
62. P. Over, G. M. Awad, J. Fiscus, M. Michel, A. F. Smeaton, W. Kraaij, Trecvid 2009-goals
tasks data evaluation mechanisms and metrics. TRECVid Work, 2009 (2010)
63. H.J. Escalante, I. Guyon, V. Athitsos, P. Jangyodsuk, J. Wan, Principal motion components
for one-shot gesture recognition. Pattern Anal. Appl. 20(1), 167–182 (2017)
64. J. Wan, Q. Ruan, W. Li, S. Deng, One-shot learning gesture recognition from RGB-D data
using bag of features. J. Mach. Learn. Res. 14, 2549–2582 (2013)
65. J. Wan, V. Athitsos, P. Jangyodsuk, H.J. Escalante, Q. Ruan, I. Guyon, CSMMI: class-specific
maximization of mutual information for action and gesture recognition. IEEE Trans. Image
Process. 23(7), 3152–3165 (2014)
66. D. Wu, F. Zhu, L. Shao, One shot learning gesture recognition from RGBD images, in 2012
IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
(2012), pp. 7–12
67. F. Jiang, S. Zhang, S. Wu, Y. Gao, D. Zhao, Multi-layered gesture recognition with kinect. J.
Mach. Learn. Res. 16, 227–254 (2015)
68. X. Yang, C. Zhang, and Y. Tian, Recognizing actions using depth motion maps-based his-
tograms of oriented gradients, in Proceedings of the 20th ACM International Conference on
Multimedia (2012), pp. 1057–1060
69. W. Li, Z. Zhang, Z. Liu, Action recognition based on a bag of 3D points, in 2010 IEEE Com-
puter Society Conference on Computer Vision and Pattern Recognition—Workshops (2010),
pp. 9–14
70. O. Oreifej, Z. Liu, HON4D: histogram of oriented 4D normals for activity recognition from
depth sequences, in Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern
Recognition (2013), pp. 716–723
71. X. Yang, Y. Tian, Super normal vector for activity recognition using depth sequences, in 2014
IEEE Conference on Computer Vision and Pattern Recognition (2014), pp. 804–811
72. C. Lu, J. Jia, C. Tang, Range-Sample depth feature for action recognition, in 2014 IEEE
Conference on Computer Vision and Pattern Recognition (2014), pp. 772–779
73. P. Wang, W. Li, Z. Gao, C. Tang, J. Zhang, P. Ogunbona, ConvNets-Based action recognition
from depth maps through virtual cameras and pseudocoloring, in Proceedings of the 23rd
ACM International Conference on Multimedia (2015), pp. 1119–1122
74. P. Wang, W. Li, Z. Gao, J. Zhang, C. Tang, P.O. Ogunbona, Action recognition from depth maps
using deep convolutional neural networks. IEEE Trans. Hum.-Mach. Syst. 46(4), 498–509
(2016)
75. D. Wu, L. Pigou, P. Kindermans, N.D. Le, L. Shao, J. Dambre, J. Odobez, Deep dynamic
neural networks for multimodal gesture segmentation and recognition. IEEE Trans. Pattern
Anal. Mach. Intell. 38(8), 1583–1597 (2016)
76. Y. Hou, S. Wang, P. Wang, Z. Gao, W. Li, Spatially and temporally structured global to local
aggregation of dynamic depth information for action recognition. IEEE Access 6, 2206–2219
(2018)
77. P. Wang, S. Wang, Z. Gao, Y. Hou, W. Li, Structured images for RGB-D action recognition,
in 2017 IEEE International Conference on Computer Vision Workshops (ICCVW) (2017),
pp. 1005–1014
15 Video Analytics for Visual Surveillance and Applications … 509
102. F.A. Setyawan, J.K. Tan, H. Kim, S. Ishikawa, Detection of moving objects in a video captured
by a moving camera using error reduction, in SICE Annual Conference, Sapporo, Japan, (Sept.
2014) (2004), pp. 347–352
103. Y. Jin, L. Tao, H. Di, N. I. Rao, G. Xu, Background modeling from a free-moving camera by
Multi-layer homography algorithm, in 2008 15th IEEE International Conference on Image
Processing (2008), pp. 1572–1575
104. P. Lenz, J. Ziegler, A. Geiger, M. Roser, Sparse scene flow segmentation for moving object
detection in urban environments, in Intelligent Vehicles Symposium (IV), 2011 IEEE (2011),
pp. 926–932
105. L. Gong, M. Yu, T. Gordon, Online codebook modeling based background subtraction with
a moving camera,” in 2017 3rd International Conference on Frontiers of Signal Processing
(ICFSP), 2017, pp. 136–140
106. Y. Wu, X. He, T.Q. Nguyen, Moving Object Detection with a Freely Moving Camera via
Background Motion Subtraction. IEEE Trans. Circuits Syst. Video Technol. 27(2), 236–248
(2017)
107. Y. Zhu, A.M. Elgammal, A multilayer-based framework for online background subtraction
with freely moving cameras, in ICCV, 2017, pp. 5142–5151
108. S. Minaeian, J. Liu, Y.-J. Son, Effective and Efficient Detection of Moving Targets from a
UAV’s Camera. IEEE Trans. Intell. Transp. Syst. 19(2), 497–506 (2018)
109. M. Braham, M. Van Droogenbroeck, Deep background subtraction with scene-specific con-
volutional neural networks, in IEEE International Conference on Systems, Signals and Image
Processing (IWSSIP), Bratislava 23–25 May 2016 (2016), pp. 1–4
110. T. Brox, J. Malik, Object segmentation by long term analysis of point trajectories, in European
Conference on Computer Vision (2010), pp. 282–295
111. X. Yin, B. Wang, W. Li, Y. Liu, M. Zhang, Background subtraction for moving cameras
based on trajectory-controlled segmentation and label inference. KSII Trans. Internet Inf.
Syst. 9(10), 4092–4107 (2015)
112. S. Zhang, J.-B. Huang, J. Lim, Y. Gong, J. Wang, N. Ahuja, M.-H. Yang, Tracking persons-
of-interest via unsupervised representation adaptation (2017). arXiv1710.02139
113. P. Rodríguez, B. Wohlberg, Translational and rotational jitter invariant incremental principal
component pursuit for video background modeling, in 2015 IEEE International Conference
on Image Processing (ICIP) (2015), pp. 537–541
114. S.E. Ebadi, V.G. Ones, E. Izquierdo, Efficient background subtraction with low-rank and
sparse matrix decomposition, in 2015 IEEE International Conference on Image Processing
(ICIP) (2015), pp. 4863–4867
115. T. Bouwmans, A. Sobral, S. Javed, S.K. Jung, E.-H. Zahzah, Decomposition into low-rank
plus additive matrices for background/foreground separation: A review for a comparative
evaluation with a large-scale dataset. Comput. Sci. Rev. 23, 1–71 (2017)
116. I. Elhart, M. Mikusz, C.G. Mora, M. Langheinrich, N. Davies, F. Informatics, Audience
Monitor—an Open Source Tool for Tracking Audience Mobility in front of Pervasive Display
117. Intel AIM Suite, Intel Corporation. https://aimsuite.intel.com/
118. Fraunhofer IIS, Fraunhofer AVARD. http://www.iis.fraunhofer.de/en/ff/bsy/tech/bildanalyse/
avard.html
119. G. M. Farinella, G. Farioli, S. Battiato, S. Leonardi, G. Gallo, Face re-identification for digital
signage applications, in Video Analytics for Audience Measurement (2014), pp. 40–52
120. N. Gillian, S. Pfenninger, S. Russell, and J. A. Paradiso, “Gestures Everywhere: A Multi-
modal Sensor Fusion and Analysis Framework for Pervasive Displays,” in Proceedings of
The International Symposium on Pervasive Displays, 2014, p. 98:98–98:103
121. G. Tripathi, K. Singh, D. Kumar, Convolutional neural networks for crowd behaviour analysis
: a survey. Vis. Comput. (2018)
122. T. Li, H. Chang, M. Wang, B. Ni, R. Hong, Crowded Scene Analysis : A Survey, vol. 25, no.
3 (2015), pp. 367–386
123. R. Leggett, Real-Time Crowd Simulation: A Review (2004)
15 Video Analytics for Visual Surveillance and Applications … 511
124. M. Hu, S. Ali, M. Shah, Detecting global motion patterns in complex videos, in 19th Inter-
national Conference on Pattern Recognition, 2008. ICPR 2008 (2008), pp. 1–5
125. X. Wang, X. Yang, X. He, Q. Teng, M. Gao, A high accuracy flow segmentation method in
crowded scenes based on streakline. Opt. J. Light Electron Opt. 125(3), 924–929 (2014)
126. S. Wu, B.E. Moore, M. Shah, Chaotic invariants of Lagrangian particle trajectories for anomaly
detection in crowded scenes, in 2010 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition (2010), pp. 2054–2060
127. R. Mehran, A. Oyama, M. Shah, Abnormal crowd behavior detection using social force model,
in 2009 IEEE Conference on Computer Vision and Pattern Recognition (2009), pp. 935–942
128. R. Mehran, B.E. Moore, M. Shah, A streakline representation of flow in crowded scenes, in
Computer Vision—ECCV 2010 (2010), pp. 439–452
129. H. Su, H. Yang, S. Zheng, Y. Fan, S. Wei, The large-scale crowd behavior perception based
on spatio-temporal viscous fluid field. IEEE Trans. Inf. Forensics Secur. 8(10), 1575–1589
(2013)
130. M. Hu, S. Ali, M. Shah, Learning motion patterns in crowded scenes using motion flow field,
in 2008 19th International Conference on Pattern Recognition (2008), pp. 1–5
131. P. Jodoin, Y. Benezeth, Y. Wang, Meta-tracking for video scene understanding, in 2013 10th
IEEE International Conference on Advanced Video and Signal Based Surveillance (2013),
pp. 1–6
132. Y. Benabbas, N. Ihaddadene, C. Djeraba, Motion pattern extraction and event detection for
automatic visual surveillance. J. Image Video Process. 2011, 7 (2011)
133. S.C. Shadden, F. Lekien, J.E. Marsden, Definition and properties of Lagrangian coherent
structures from finite-time Lyapunov exponents in two-dimensional aperiodic flows. Phys. D
Nonlinear Phenom. 212(3–4), 271–304 (2005)
134. L. Kratz, K. Nishino, Tracking pedestrians using local spatio-temporal motion patterns in
extremely crowded scenes. IEEE Trans. Pattern Anal. Mach. Intell. 34(5), 987–1002 (2012)
135. Y. Cong, J. Yuan, J. Liu, Abnormal event detection in crowded scenes using sparse represen-
tation. Pattern Recognit. 46(7), 1851–1864 (2013)
136. M. Lewandowski, D. Simonnet, D. Makris, S.A. Velastin, J. Orwell, Tracklet reidentification
in crowded scenes using bag of spatio-temporal histograms of oriented gradients, in Mexican
Conference on Pattern Recognition (2013), pp. 94–103
137. C. Kuo, C. Huang, R. Nevatia, Multi-target tracking by on-line learned discriminative appear-
ance models, in 2010 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (2010), pp. 685–692
138. S. B\kak, D.-P. Chau, J. Badie, E. Corvee, F. Brémond, M. Thonnat, Multi-target tracking by
discriminative analysis on Riemannian manifold, in 2012 19th IEEE International Conference
on Image Processing (ICIP) (2012), pp. 1605–1608
139. B. Zhou, X. Wang, X. Tang, Understanding collective crowd behaviors: learning a mixture
model of dynamic pedestrian-agents, in 2012 IEEE Conference on Computer Vision and
Pattern Recognition (2012), pp. 2871–2878
140. W. Chongjing, Z. Xu, Z. Yi, L. Yuncai, Analyzing motion patterns in crowded scenes via
automatic tracklets clustering. China Commun. 10(4), 144–154 (2013)
141. B. Zhou, X. Wang, X. Tang, Random field topic model for semantic region analysis in crowded
scenes from tracklets. CVPR 2011, 3441–3448 (2011)
142. K. Xu, X. Jiang, T. Sun, Anomaly Detection Based on Stacked Sparse Coding With Intraframe
Classification Strategy, vol. 20, no. 5 (2018), pp. 1062–1074
143. B.T. Morris, M.M. Trivedi, A survey of vision-based trajectory learning and analysis for
surveillance. IEEE Trans. Circuits Syst. Video Technol. 18(8), 1114–1127 (2008)
144. L. Brun, A. Saggese, M. Vento, Dynamic scene understanding for behavior analysis based on
string Kernels. IEEE Trans. Circuits Syst. Video Technol. 24(10), 1669–1681 (2014)
145. C. Piciarelli, C. Micheloni, G.L. Foresti, Trajectory-Based anomalous event detection. IEEE
Trans. Circuits Syst. Video Technol. 18(11), 1544–1554 (2008)
146. D. Tran, J. Yuan, D. Forsyth, Video event detection: from subvolume localization to spa-
tiotemporal path search. IEEE Trans. Pattern Anal. Mach. Intell. 36(2), 404–416 (2014)
512 I. E. Olatunji and C.-H. Cheng
147. S. Coşar, G. Donatiello, V. Bogorny, C. Garate, L.O. Alvares, F. Brémond, Toward abnormal
trajectory and event detection in video surveillance. IEEE Trans. Circuits Syst. Video Technol.
27(3), 683–695 (2017)
148. X. Song, X. Shao, Q. Zhang, R. Shibasaki, H. Zhao, J. Cui, H. Zha, A fully online and
unsupervised system for large and high-density area surveillance: tracking, semantic scene
learning and abnormality detection. ACM Trans. Intell. Syst. Technol. 4(2), 35:1–35:21 (2013)
149. A.R. Revathi, D. Kumar, An efficient system for anomaly detection using deep learning
classifier. Signal, Image Video Process. 11(2), 291–299 (2017)
150. O.P. Popoola, K. Wang, Video-Based abnormal human behavior recognition—a review. IEEE
Trans. Syst. Man Cybern. Part C (Applications Rev.) 42(6), 865–878 (2012)
151. Y. Yuan, Y. Feng, X. Lu, Statistical hypothesis detector for abnormal event detection in
crowded scenes. IEEE Trans. Cybern. 47(11), 3597–3608 (2017)
152. G. Xiong, J. Cheng, X. Wu, Y.-L. Chen, Y. Ou, Y. Xu, An energy model approach to people
counting for abnormal crowd behavior detection. Neurocomputing 83, 121–135 (2012)
153. S. Yi, X. Wang, C. Lu, J. Jia, L0 regularized stationary time estimation for crowd group
analysis, in 2014 IEEE Conference on Computer Vision and Pattern Recognition (2014),
pp. 2219–2226
154. Y. Zhang, L. Qin, R. Ji, H. Yao, Q. Huang, Social attribute-aware force model: exploiting rich-
ness of interaction for abnormal crowd detection. IEEE Trans. Circuits Syst. Video Technol.
25(7), 1231–1245 (2015)
155. K. Cheng, Y. Chen, W. Fang, Video anomaly detection and localization using hierarchical fea-
ture representation and Gaussian process regression, in 2015 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR) (2015), pp. 2909–2917
156. Y. Lee, Y. Yeh, Y.F. Wang, Anomaly detection via online oversampling principal component
analysis. IEEE Trans. Knowl. Data Eng. 25(7), 1460–1470 (2013)
157. B. Krausz, C. Bauckhage, Loveparade 2010: automatic video analysis of a crowd disaster.
Comput. Vis. Image Underst. 116(3), 307–319 (2012)
158. D. Lee, H. Suk, S. Park, S. Lee, Motion influence map for unusual human activity detection and
localization in crowded scenes. IEEE Trans. Circuits Syst. Video Technol. 25(10), 1612–1623
(2015)
159. C.C. Loy, T. Xiang, S. Gong, Salient motion detection in crowded scenes, in 2012 5th Inter-
national Symposium on Communications, Control and Signal Processing (2012), pp. 1–4
160. S. Vishwakarma, A. Agrawal, A survey on activity recognition and behavior understanding
in video surveillance. Vis. Comput. 29(10), 983–1009 (2013)
161. J. Xu, S. Denman, S. Sridharan, C. Fookes, R. Rana, Dynamic texture reconstruction from
sparse codes for unusual event detection in crowded scenes, in Proceedings of the 2011 Joint
ACM Workshop on Modeling and Representing Events (2011), pp. 25–30
162. Y. Cong, J. Yuan, Y. Tang, Video anomaly search in crowded scenes via spatio-temporal
motion context. IEEE Trans. Inf. Forensics Secur. 8(10), 1590–1599 (2013)
163. M. Thida, H. Eng, P. Remagnino, Laplacian eigenmap with temporal constraints for local
abnormality detection in crowded scenes. IEEE Trans. Cybern. 43(6), 2147–2156 (2013)
164. K. Yu, Y. Lin, J. Lafferty, Learning image representations from the pixel level via hierarchical
sparse coding, in 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2011), pp. 1713–1720
165. W. Li, V. Mahadevan, N. Vasconcelos, Anomaly detection and localization in crowded scenes.
IEEE Trans. Pattern Anal. Mach. Intell. 36(1), 18–32 (2014)
166. C. Lu, J. Shi, J. Jia, Abnormal event detection at 150 FPS in MATLAB, in 2013 IEEE
International Conference on Computer Vision (2013), pp. 2720–2727
167. C. Lu, J. Shi, J. Jia, Scale adaptive dictionary learning. IEEE Trans. Image Process. 23(2),
837–847 (2014)
168. S. Han, R. Fu, S. Wang, X. Wu, Online adaptive dictionary learning and weighted sparse cod-
ing for abnormality detection, in 2013 IEEE International Conference on Image Processing
(2013), pp. 151–155
15 Video Analytics for Visual Surveillance and Applications … 513
169. B. Zhao, L. Fei-Fei, E.P. Xing, Online detection of unusual events in videos via dynamic
sparse coding. CVPR 2011, 3313–3320 (2011)
170. A. Kanehira, L. Van Gool, Y. Ushiku, T. Harada, Viewpoint-aware Video Summarization
171. D. Potapov, M. Douze, Z. Harchaoui, C. Schmid, Category-Specific Video Summarization,
in Computer Vision—ECCV 2014 (2014), pp. 540–555
172. H. Yang, B. Wang, S. Lin, D.P. Wipf, M. Guo, B. Guo, Unsupervised extraction of video
highlights via robust recurrent auto-encoders, in 2015 IEEE International Conference on
Computer Vision (2015), pp. 4633–4641
173. B. Zhao, E.P. Xing, Quasi real-time summarization for consumer videos, in 2014 IEEE Con-
ference on Computer Vision and Pattern Recognition (2014), pp. 2513–2520
174. A. Sharghi, B. Gong, M. Shah, Query-focused extractive video summarization, in European
Conference on Computer Vision (2016), pp. 3–19
175. A. Sharghi, J.S. Laurel, B. Gong, Query-focused video summarization: dataset, evaluation,
and a memory network based approach, in The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), (2017), pp. 2127–2136
176. R. Panda, A. Das, Z. Wu, J. Ernst, A.K. Roy-Chowdhury, Weakly supervised summarization
of web videos, in 2017 IEEE International Conference on Computer Vision (ICCV) (2017),
pp. 3677–3686
177. M. Otani, Y. Nakashima, E. Rahtu, N. Yokoya, Video Summarization using Deep Semantic
Features, pp. 1–16
178. B. Gong, W.-L. Chao, K. Grauman, F. Sha, Diverse sequential subset selection for supervised
video summarization, in Advances in Neural Information Processing Systems 27, ed. by Z.
Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, K.Q. Weinberger (Curran Associates,
Inc., 2014), pp. 2069–2077
179. M. Gygli, H. Grabner, H. Riemenschneider, L. Van Gool, Creating Summaries from User
Videos, in Computer Vision—ECCV 2014 (2014), pp. 505–520
180. M. Gygli, H. Grabner, L. Van Gool, Video summarization by learning submodular mixtures
of objectives, in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2015), pp. 3090–3098
181. A. Kulesza, B. Taskar, others, Determinantal point processes for machine learning. Found.
Trends®in Mach. Learn. 5(2–3), 123–286 (2012)
182. Y. J. Lee, J. Ghosh, K. Grauman, Discovering important people and objects for egocentric
video summarization, in 2012 IEEE Conference on Computer Vision and Pattern Recognition
(2012), pp. 1346–1353
183. D. Liu, G. Hua, T. Chen, A hierarchical visual model for video object summarization. IEEE
Trans. Pattern Anal. Mach. Intell. 32(12), 2178–2190 (2010)
184. B.A. Plummer, M. Brown, S. Lazebnik, Enhancing video summarization via vision-language
embedding, in Computer Vision and Pattern Recognition, vol. 2 (2017)
185. M. Sun, A. Farhadi, T. Chen, S. Seitz, ranking highlights in personal videos by analyzing
edited videos. IEEE Trans. Image Process. 25(11), 5145–5157 (2016)
186. K. Zhang, W.-L. Chao, F. Sha, K. Grauman, Summary transfer: exemplar-based subset selec-
tion for video summarization, in Proceedings of the IEEE conference on computer vision and
pattern recognition (2016), pp. 1059–1067
187. F. Chen, C. De Vleeschouwer, Formulating team-sport video summarization as a resource
allocation problem. IEEE Trans. Circuits Syst. Video Technol. 21(2), 193–205 (2011)
188. E. Elhamifar, M.C.D.P. Kaluza, Online summarization via submodular and convex optimiza-
tion, in CVPR (2017), pp. 1818–1826
189. M. Fleischman, B. Roy, D. Roy, Temporal feature induction for baseball highlight classi-
fication, in Proceedings of the 15th ACM international conference on Multimedia (2007),
pp. 333–336
190. R. Hong, J. Tang, H.-K. Tan, S. Yan, C. Ngo, T.-S. Chua, Event driven summarization for web
videos, in Proceedings of the First SIGMM Workshop on Social Media (2009), pp. 43–48
191. A. Khosla, R. Hamid, C. Lin, N. Sundaresan, Large-Scale video summarization using Web-
Image priors, in 2013 IEEE Conference on Computer Vision and Pattern Recognition (2013),
pp. 2698–2705
514 I. E. Olatunji and C.-H. Cheng
192. G. Kim, L. Sigal, E.P. Xing, Joint summarization of large-scale collections of web images
and videos for storyline reconstruction, in 2014 IEEE Conference on Computer Vision and
Pattern Recognition (2014), pp. 4225–4232
193. Z. Lu, K. Grauman, Story-Driven summarization for egocentric video, in 2013 IEEE Confer-
ence on Computer Vision and Pattern Recognition (2013), pp. 2714–2721
194. B. Mahasseni, M. Lam, S. Todorovic, Unsupervised video summarization with adversarial
lstm networks, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
vol. 1 (2017)
195. Y. Song, J. Vallmitjana, A. Stent, A. Jaimes, TVSum: summarizing web videos using titles,
in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015),
pp. 5179–5187
196. Y.T. Tesfaye, S. Member, E. Zemene, S. Member, Multi-target Tracking in Multiple Non-
overlapping Cameras using Constrained Dominant Sets, pp. 1–15
197. Y. Wang, S. Velipasalar, M.C. Gursoy, Distributed wide-area multi-object tracking with non-
overlapping camera views. Multimed. Tools Appl. 73(1), 7–39 (2014)
198. Y.T. Tesfaye, E. Zemene, M. Pelillo, A. Prati, Multi-object tracking using dominant sets. IET
Comput. Vis. 10(4), 289–297 (2016)
199. A. Roshan Zamir, A. Dehghan, and M. Shah, GMCP-Tracker: global multi-object track-
ing using generalized minimum clique graphs, in Computer Vision—ECCV 2012 (2012),
pp. 343–356
200. A. Dehghan, S.M. Assari, M. Shah, GMMCP tracker: globally optimal generalized maximum
multi clique problem for multiple object tracking, in 2015 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR) (2015), pp. 4091–4099
201. C.-H. Kuo, C. Huang, R. Nevatia, Inter-Camera association of multi-target tracks by on-line
learned appearance affinity models, in Computer Vision—ECCV 2010 (2010), pp. 383–396
202. D. Cheng, Y. Gong, J. Wang, Q. Hou, N. Zheng, Part-Aware trajectories association across
non-overlapping uncalibrated cameras. Neurocomputing 230, 30–39 (2017)
203. Y. Gao, R. Ji, L. Zhang, A. Hauptmann, symbiotic tracker ensemble toward a unified tracking
framework. IEEE Trans. Circuits Syst. Video Technol. 24(7), 1122–1131 (2014)
204. S. Zhang, Y. Zhu, A. Roy-Chowdhury, Tracking multiple interacting targets in a camera
network. Comput. Vis. Image Underst. 134, 64–73 (2015)
205. Y. Cai, G. Medioni, Exploring context information for inter-camera multiple target tracking,
in IEEE Winter Conference on Applications of Computer Vision (2014), pp. 761–768
206. J. You, A. Wu, X. Li, W.-S. Zheng, Top-Push video-based person re-identification, in 2016
IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 1345–1353
207. A.D. Jepson, D.J. Fleet, T.F. El-Maraghi, Robust online appearance models for visual tracking.
IEEE Trans. Pattern Anal. Mach. Intell. 25(10), 1296–1311 (2003)
208. J. Pan, B. Hu, Robust occlusion handling in object tracking, in 2007 IEEE Conference on
Computer Vision and Pattern Recognition (2007), pp. 1–8
209. A. Yilmaz, X. Li, M. Shah, Contour-based object tracking with occlusion handling in video
acquired using mobile cameras. IEEE Trans. Pattern Anal. Mach. Intell. 26(11), 1531–1536
(2004)
210. A. Senior, A. Hampapur, Y.-L. Tian, L. Brown, S. Pankanti, R. Bolle, Appearance models for
occlusion handling. Image Vis. Comput. 24(11), 1233–1243 (2006)
211. L. Hou, W. Wan, K.-H. Lee, J.-N. Hwang, G. Okopal, J. Pitton, Robust human tracking based
on DPM constrained multiple-kernel from a moving camera. J. Signal Process. Syst. 86(1),
27–39 (2017)
212. H. Zhang, G. Ananthanarayanan, P. Bodik, M. Philipose, P. Bahl, M.J. Freedman, Live Video
Analytics at Scale with Approximation and Delay-Tolerance
213. G. Ananthanarayanan, P. Bahl, P. Bodík, K. Chintalapudi, M. Philipose, L. Ravindranath, S.
Sinha, Real-Time video analytics: the killer app for edge computing. Computer (Long. Beach.
Calif) 50(10), 58–67 (2017)
214. F. Loewenherz, V. Bahl, Y. Wang, Video analytics towards vision zero. ITE 87, 25–28 (2017)
15 Video Analytics for Visual Surveillance and Applications … 515
215. H. Qiu, X. Liu, S. Rallapalli, A.J. Bency, K. Chan, Kestrel: Video Analytics for Augmented
Multi-camera Vehicle Tracking (2018), pp. 48–59
216. J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: unified, real-time object
detection, in Proceedings of the IEEE conference on computer vision and pattern recognition
(2016), pp. 779–788
217. E. K. Bowman, M. Turek, P. Tunison, S. Thomas, E.K. Bowman, M. Turek, P. Tunison, R.
Porter, V. Gintautas, P. Shargo, J. Lin, Q. Li, X. Li, R. Mittu, C.P. Rosé, K. Maki, Advanced
text and video analytics for proactive decision making, no. May 2017 (2018)
218. K.P. Seng, Video Analytics for Customer Emotion and Satisfaction at Contact Centers, vol.
48, no. 3 (2018), pp. 266–278