Olatunji-Cheng2019 Chapter VideoAnalyticsForVisualSurveil

Chapter 15
Video Analytics for Visual Surveillance

and Applications: An Overview
and Survey
Iyiola E. Olatunji and Chun-Hung Cheng
Abstract Owing to the massive amount of video data being generated as a result
of high proliferation of surveillance cameras, the manpower to monitor such system
is relatively expensive. Passively monitoring surveillance video however, incapac-
itates the usefulness of surveillance camera. Therefore, a drive to monitor events
as they happen is expedient to fully harness the massive data generated by surveil-
lance cameras. This is the main goal of video analytics. In this chapter, we extend
the notion of surveillance. Surveillance refers not only to monitoring for security or
safety purposes but encapsulates all aspects of monitoring to capture the dynamics
of different application domains including retail, transportation, service industries
and healthcare. This chapter presents a detailed survey of video analytics as well
as its application. We present advances in video analytics research and emerging
trends from subdomains such as behavior analysis, moving object classification,
video summarization, object detection, object tracking, congestion analysis, abnor-
mality detection and information fusion from multiple cameras. We also summa-
rize recent development in video analytics and intelligent video systems (IVS). We
evaluated the state-of-the-art approach to video analytics including deep learning
approach and outlined research direction with emphasis on algorithm-based analyt-
ics and applications. Hardware-related issues are excluded from this chapter.
Keywords Video analytics · Intelligent video system · Video surveillance ·

Computer vision · Video survey and applications
I. E. Olatunji (B) · C.-H. Cheng

Department of Systems Engineering and Engineering Management,
The Chinese University of Hong Kong, Sha Tin, Hong Kong
e-mail: [email protected]
C.-H. Cheng
e-mail: [email protected]
© Springer Nature Switzerland AG 2019 475

G. A. Tsihrintzis et al. (eds.), Machine Learning Paradigms,
Learning and Analytics in Intelligent Systems 1,
https://doi.org/10.1007/978-3-030-15628-2_15
476 I. E. Olatunji and C.-H. Cheng
15.1 Overview of Video Analytics
The widespread use of video cameras especially from mobile phones and low-cost
high performance IP surveillance cameras are contributing substantially to the expo-
nential growth of video data primarily used for surveillance. In China, over 170
million video surveillance cameras have been installed and expected to increase to
over 600 million by 2020 [1].
Surveillance is usually synonymous to crime and intrusion detection due to the
current increasing security issues in our society but we extend the ideology of surveil-
lance to resource monitoring in this chapter.
Therefore, we define surveillance as the monitoring of people or object’s activities
and behavior for the purpose of intrusion detection, crime detection, scene monitoring
and resource tracking.
In a traditional video surveillance system, an operator is assigned to actively
watch videos captured by the cameras with the notion of tracking and detecting
any suspicious persons or potentially dangerous abandoned object. However, this is
quite unrealistic while considering the vast amount of cameras and several hours of
recording currently available.
Subsequently, Caifeng et al. [2] has shown that the maximum attention span of
any personnel monitoring a video task in 20 min.
Manually monitoring videos through personnel staring at several video screen for
many hours is also relatively expensive and highly prone to errors causing several
records of missed events. Therefore, a drive to automatically analyze these video
data is of the essence. This is the goal of video analytics (VA); to ease the strenuous
tasks of manually monitoring several hours of video, provide real-time alert when
situation of interest occurs, and facilitate keyword search of enormous archives of
video via automation of the task.
Video analytics or video content analysis refers to automatic processing and under-
standing of video content in order to determine or detect spatio-temporal events and
extract information or knowledge about the observed scene. The generated video
content can be from a single camera or multiple cameras. Video analytics is a con-
stantly evolving field with novel techniques and algorithms continuously being devel-
oped in areas such as video semantic categorization, video retrieval from database,
human action recognition, summarization and anomaly detection. Video analytics
algorithms can either be implemented as a software package running in a central-
ized station where numerous servers are utilized for processing or as hardware on a
dedicated video processing unit.
Sport, retail, automotive, transport, security, entertainment, traffic control includ-
ing pedestrian crossing, digital real-time decision making devices, and healthcare
are some of the several domains in which video analytics have been applied.
For example, the 2018 FIFA world cup took a new phase of monitoring football
matches using videos to enhance referee’s capability. The system is called Video
Assistant Referee (VAR). VAR not only monitors the game but also analyzes player’s
performance in real-time. Another video system used for monitoring in the 2018 FIFA
15 Video Analytics for Visual Surveillance and Applications … 477
world cup is the Goal line technology (GLT). The Goal line technology is used to
detect if the football passes the goal line which is usually difficult for linesman or
assistant referee to see. These technologies driven by video surveillance on a football
field provides the referee with sufficient information to make decision when dispute
occur such as suspected penalty or a ferocious attack that deserves a yellow or red
card.
For resource monitoring, a work by Cheng and Olatunji [3] showed that videos
can be used in monitoring trolleys in an airport operation in real-time. This system
provides an up-to-date inventory of the available resource (trolley) and significantly
reduces replenishment time by about 70% due to its real-time alert system when
there is shortage of resource.
For autonomous cars, obstacle detection and path planning are important for effec-
tive navigation. This is achieved by analyzing the data captured by Light Detection
and Ranging (LiDAR) cameras or other 3D cameras installed in the car.
Cognitive factors such as attention span and time to react to an event can be
investigated by analyzing the video content of an activity or event. Real-time situation
awareness, target recognition, event prediction and prevention, and post-event data
analysis are key goals of intelligent video system (IVS) fueled by video analytics.
Video analytics spans over broad application areas such as biometrics (face, tat-
too, signature and iris recognition), detection and tracking (person, object, vehicle,
abandoned object and logos), text recognition from video, search and retrieval of
video content, geolocation and mapping, summarization and skimming, behavior
analysis and event analysis.
Typically, videos are segmented into frames or set of still images. A single video
camera can produce about 25–30 fps or more with 4 K and 3D video cameras which
is equivalent to thousands or millions of frames depending on the sampling rate of
the video. Similarly, it has been stated that video traffic will account for about 82%
of the whole internet traffic [4]. Therefore, considering the volume and speed of data
that can be generated by a single video data, algorithms to process the video must
offer real-time or near real-time solutions.
Most of the existing IVS are based on centralized approach but there has been
progressive research in the area of edge based architecture. Centralized approach
involves routing videos to a base cloud or storage where all analytics take place
whereas in edge-based architecture, video contents are analyzed near the source of
data generation. Edge-based architecture provides better solution and avoids band-
width and other cost associated with transporting data to the centralized station.
However, this is beyond the scope of this chapter. Readers are referred to [5] for
more information. Video analytics algorithm are parameter-oriented as they are key
factors in determining the accuracy of the output. Such parameters include frame
sampling, frame resolution and algorithmic parameters.
The most fundamental steps in automated video surveillance and monitoring
(VSAM) is background image estimation and updating the background image to
reflect changes in the background environment. However, since the emergence of
deep learning and the breakthrough of convolutional neural network (CNN) in image,
video and audio classification problems, there have been widespread adoption of
CNN especially in its application to large-scale video classification [6]. Due to high
computational cost, graphics processing unit (GPU) are leveraged to run video-
processing system’s algorithms for effective performance gains by parallelizing a
number of vision algorithms [7, 8]. However, computational details are excluded as
they are not the focus of this chapter.
Video analytics systems based on deep learning models such as CNN forms the
basis of state-of-the-art analytics systems applied in smart cities and real-time appli-
cations. The deep learning approach requires enormous amounts of data and training
time to complete a task such as object segmentation, classification or detection.
Despite enormous effort in developing automated VSAM systems, current surveil-
lance systems are not entirely capable of autonomously analyzing complex event
from observed scene. To address this problem, independent work on several areas
such as object tracking, behavior understanding, object classification, summariza-
tion and motion segmentation are combined to form a composite video analytic
framework for video surveillance.
This paper presents a general overview of video analytics, current state-of-the-art
method and integration of different aspect of video analytics algorithms to form an
intelligent video system (IVS). The rest of the paper is described as follows: Sect. 2
discusses the theory of video analytics systems, more especially the deep learning
approach which is the current state-of-the-art approach. Section 3 details survey of
the algorithms and task involved in video analytics for surveillance. In Sect. 4, we
discuss the application of analyzing video surveillance data in daily life operations.
Section 5 presents research direction and concludes the paper.
15.2 Theory of Video Analytics
Video analytics have been a major area of research since the last few decades with
several techniques been developed to overcome challenges such as accuracy and
precision of analyzing video data. This section details the core concept of some early
works and also recent works sprawled by deep learning. The core mathematical
foundations discussed in this section has be applied to different application domain
such as object tracking, segmentation and motion detection to name a few.
15.2.1 Background Initialization or Estimation
Background image estimation and updating the background image to reflect changes
in the environment are the fundamental steps in automated VSAM. Several approach
to background image estimation of moving object exist. They include pixel-based
approach, block-based approach, neural network approach, and Gaussian Mixture
Model (GMM). Readers are advised to read [9] for a comprehensive survey of scene
background estimation.
Fig. 1 Background estimation from a cluttered scene: a initial frame from video scene, b–c back-
ground initialization, d estimated background. Images adapted from [13]
Pixel-based approach extracts in-depth shape information of moving object by

modelling each pixel separately. However, segmentation in pixel-based approach is
highly susceptible to changing background. Block-based approach is less sensitive
to changing background because the image is divided into blocks and the features
extracted from these blocks are used for background modeling. However, it cannot
obtain in-depth shape information of moving object like pixel-based method. To
overcome the short coming of both methods, artificial neural network algorithm for
background subtraction was proposed by Zhou et al. [10] to automatically extract
background information and detect moving object based on the extracted background
information. GMM requires large memory space and its convergence is slow. Ishii
et al. [11] developed an online method for improving GMM where each Gaussian
have a learning parameter that approach basic recursive learning after several obser-
vations by combining expectation maximization (EM) method with recursive fil-
ter observation. The drawback of the above method however is, their performance
degrades when handing complex situations like different weather condition, camera
instability and noise, and non-static background. Pham et al. [12] proposed Improved
Adaptive Gaussian Mixture Model (IAGMM) to cater for some of the above prob-
lems. An example of background estimation is shown in Fig. 1.
Background subtraction is the fundamental method for dynamic image analysis.
It models background for detecting moving objects by identifying statistically sig-
nificant changes from a background model and separates it from foreground object.
However, when the camera is also in motion, it is computationally complex to
model such backgrounds as consecutive frames cannot be compared. To solve this
problem, Zhang et al. [14] used dense optical flow fields to estimate camera motion
and segment the moving object over multiple frames. However, optical flow methods
are prone to varying object boundaries. Gelgon and Bouthemy [15] combined color
segmentation and motion-based regions to solve this problem but it is computation-
ally expensive and not in real-time.
Angelov et al. [16] proposed a real-time and less computational expensive method
for detecting moving object with moving camera. For Abandoned Object Detection
(AOD), background subtraction and ground-truth homography were combined to
detect objects and handle occlusion [17]. Several methods such as object tracking
based method, stationary object detection algorithm, drop-off event detection algo-
rithm, color and shape information of static foreground objects have been used either
in part or fused together for AOD detection [18–20]. An interesting work by Guler
[7] used two foreground images (one long-term and one short-term) to define AOD
as a situation where an object’s pixels are in the long-term foreground, but not in the
short-term background. Ramirez-Alonso et al. [21] proposed a temporal weighted
learning model for background estimation that can reinitialize and update parameters
adaptively. Details of their proposed method is discussed below. There are 4 mod-
ules in the system with the first module handling scenes with static object. Module
2 deals with normal scenes that can contain dynamic object. In module 3, threshold
is defined in order to separate background from foreground object and module 4
caters for dynamic scene changes using Speeded-Up Robust Features (SURF) algo-
rithm to align information. Once the information about the background is aligned,
the background model can be updated and background can be estimated.
Let W B S_H L R and W B S_L L R be the adaptive weight arrays updated by high and
low parameters respectively, L R be the learning rate values, F B E_H L R and F B E_L L R
be the foreground models of the estimated background in the RGB color space.
Classification of the foreground is performed by Eqs. (1) and (2):

1 W B S_H L R (x, t) RG B − I(x, t) RG B 2 > ε1
F B E_H L R (x, t) = (1)
0 other wise

1 W B S_L L R (x, t) RG B − I(x, t) RG B > ε2
F B E_L L R (x, t) = 2 (2)
0 other wise
where x is the pixel location [x, y] of a M × N image size (1 < x < M and 1 <
y < N ) and t ∈ N is the time index. ε1 and ε2 are the threshold values. If the result
of the Euclidean distance is more than the threshold values, the pixel is classified as
foreground. A zero in the foreground indicates that the pixel is part of the background.
Weight are updated adaptively by Eq. (3)
W B E (x, t + 1) RG B = W B E (x, t) RG B (1 − L R (x, t)) + L R (x, t)I(x, t) RG B (3)
where L R is the matrix holding the learning rate value for each weight defined by
Eq. (4). The exponential factor in Eq. (4) generates fast learning in the initial frame.
Therefore the value of ta will be initialized to 0 and will increase by 1 if changes
occur between consecutive input frames as given in Eq. (5).
−ta
T
L R (x, t) = S0 f + A0 A(x, t) + L 0 (4)

ta + 1 ρv < 0.998
ta = (5)
ta other wise
where ρv is the Pearson correlation coefficient between the input frame at time t
and t − 1. L 0 is a classifier constraint whose value is < 0.1 and T f is the number of
frames. S0 is the bootstrap learning constant and A0 is the scaling factor for A(x, t).
A(x, t) is the scene information matrix that defines the learning of each pixel and
used for classifying pixels based on the values of F B E_H L R (x, t) and F B E_L L R (x, t).
A(x, t) can be calculated as:
1
A(x, t) = 1 − F B E H L R (x, t) + F B E_L L R (x, t) (6)
2
when A(x, t) = 0, pixels represent dynamic object or ghost detection. If A(x, t) =
0.5, pixel is classified as a dynamic object only in one foreground model or ghost
only in F B E_L L R . If A(x, t) = 1, then the pixel is classified as background in both
foreground models.
15.2.2 State-of-the-Art Video Analysis Model
Handcrafted features such as Histogram Of Oriented Gradients (HOG) [22], Scale-

invariant Feature Transform (SIFT) [23], Local Binary Pattern (LBP) [24], Local
Ternary Patterns (LTP) [25], and Haar [26–28] have been used for video analysis
and are usually considered as shallow network. These handcrafted features produce
high dimensional feature vectors obtained by the aggregation of small local patches
of subsequent video frames for motion and appearance information. However, these
high dimensional feature vectors do not scale well for large scale video processing
in uncontrolled environment.
Convolutional Neural Network (CNN) or ConvNet-based video analytics systems
have shown superiority over shallow network based video analytic system in terms
of precision and accuracy. Since the advent of CNN, several video classification
methods [6, 29–31] based on CNN have been proposed to learn features from raw
pixels but can only classify short video clips and still images. LeNet [32] was the first
successful application of ConvNet for handwritten digit recognition which achieved
an accuracy of 99.2% on the MNIST dataset [33] followed by deep CNN proposed
by Alex Krizhevsky et al. [31] (AlexNet) for the ImageNet [34] dataset and achieved
high accuracy. Several other methods have emerged since then. Dan Ciresan et al.
[35] used a multi-column deep neural network to classify images on MNIST [33],
CIFAR [36] and NORB [37] datasets. DeepFace was proposed by Yaniv Taigman
et al. [38] for face verification from facial images on the Labeled Faces in the Wild
(LFW) dataset [39]. However, these methods only perform well on images and not
on videos. To solve this problems, Kang et al. [40] proposed temporal convolution
network for extracting temporal information and fed it into CNN for object detection
in video tubelets (sequence of detection bounding boxes).
For event detection, Kang et al. [42] proposed discriminative CNN. Karpathy
et al. [6], Ng et al. [30] and Zha et al. [43] used CNN architectures to perform video
classification by retaining the top layers of their neural network for performance
optimization and reported better accuracy. High accuracy was reported by Simonyan
et al. [44] for using CNNs for action recognition. Other variation of CNN such as
GoogleNet [45], ResNet [46], ZFNet [47] and VGGNet [48] have emerged following
Fig. 2 Illustration of object detection using DNN proposed by Szegedy et al. [41]
the breakthrough of AlexNet [31]. Mathematical and architectural analysis of video

analytics based on CNN model is what follows.
In a video analytics system, objects are detected and extracted from video and
rescaled to a size say 150 × 150 pixels as shown in Fig. 2. Normalization of the
object is performed for better accuracy by transforming pixel values from 0–255 to
0–1 before feeding them into the deep neural network.
Videos are firstly decoded into frames depending on the length of the input video
data and other analysis are done on the generated video frames. For example, a 3 min
(120 s) video can generate 4500 video frames at a rate of 25 fps (frame per second).
Let X be the training set and xi be the decoded frames from video
X = x1 , x2 , x3 , . . . , xn (7)
The number of channels from three (RGB) is reduced to one by converting the
frame to gray scale and thus causing reduction in the processing time. Gray-scale
conversion has no effect on algorithm accuracy. For the object detection phase, the
converted gray scale frame is input into object detection algorithm and a bounding
box is created around the region of interest. Haar cascade classifier [26] can be used
for object detection.
Let (x, c) be a labeled frame where x is the frame data and c is the ground truth.
The corresponding bounding box after an object is detected is given by:
R(x0 , y0 . . . xn , yn ) (8)
To extract desired object from video frame, cropping is performed around the
detected area and the cropped area of the frame serves as input into the object clas-
sification phase. The extracted objects are rescaled to a size w * h, say 150 * 150
pixels and normalized before feeding them into the deep neural network (DNN).
Normalization of the object is performed for better accuracy by transforming pixel
values from 0–255 to 0–1 before feeding them into the CNN as shown in Fig. 3.
CNN or DNN in general requires large amount of training data to give better per-
formance and to avoid overfitting. When training data is scarce, data transformation
(affine displacement field) such as contrast variation, skew, translation, rotation and
flipping is performed on the input data set to generate additional training data to
augment scarce data so as to increase the accuracy of the DNN. For details about
Fig. 3 Video-based person re-identification based on recurrent CNN architecture as proposed by

[49]
how affine displacement can be used on video frames to generate additional data,
Yaseen et al. [50] gives an intuitive explanation.
Based on the input training data and or transformed data, the CNN is trained to
classify and discriminate the generated classes. A typical architecture consists of
multiple alternating layers of convolutional and subsampling layers.
The convolutional and sub-sampling layers are denoted mathematically in Eqs. 9
and 10 respectively.
Convk , l = g(xk , l ∗ Wk , l + Bk , l) (9)
Subk , l = g(xk , l ∗ wk , l + bk , l) (10)
where g(.) is the activation function, W and B represents the weight and biases of
the system and the sub-sampling layer consist of downsampled inputs. * represents
the convolution operation performed between the inputs and network weights.
Rectified Linear Units (ReLU) is the mostly used activation function for non-
linearity. It models positive real numbers and helps in solving vanishing gradi-
ent problem with range from [0, ∞]. Other forms of activation functions include
sigmoid, Leaky rectified linear unit (Leaky ReLU), Parametric rectified linear unit
(PReLU), Randomized leaky rectified linear unit (RReLU), Exponential linear unit
(ELU), Scaled exponential linear unit (SELU), S-shaped rectified linear activation
unit (SReLU), Inverse square root linear unit (ISRLU), Adaptive piecewise linear
(APL).
For the pooling layer, Max pooling is used. The purpose of the pooling layer is
for dimensionality reduction, downsampling feature maps from convolutional layer
and reducing the number of parameters so as to reduce computational cost. Yaseen
et al. [50] used local response normalization as generalization technique. A typical
video-based CNN architecture consists of two convolutional layer followed by two
response normalization layers. Three max pooling layers are stacked underneath the
response layer followed by the last convolutional layer. L2 regularization is used
in the architecture in order to avoid overfitting by penalizing network weight. The

output layer of the CNN network is the softmax layer used for optimizing negative
log likelihood. It is given by:
l(i, xi T ) = M(ei , f (xi T )) (11)
where f (xi T ) is the function to calculate output value and e is the basis vector.
15.3 Algorithmic Domains and Task Involved in Video

Analytics for Surveillance
15.3.1 Video Segmentation
Complex event is defined as a combination of several actions, objects and scenes

[51]. Uncertainty in video content makes complex event detection an extremely
challenging task. Moreover, semantic concept such as a single event class labels can-
not explicitly capture the description of the event [52]. Therefore, it is necessary to
combine multiple semantic concept to describe an event. For instance, the event of
“playing soccer” can be easily associated with action concept of “running”, “jump-
ing” and “kicking”, where the objects are “ball” and “players” and scene concept
of “stadium”. The single concept of “stadium” does not fully capture the descrip-
tion of the event as several activities can occur in the stadium such as running on the
track, Super Bowl etc. Therefore, video segmentation is needed to extract informative
segments about the event occurring in the video.
In video segmentation, it is important to consider temporal relations between key
segments in a specific event for effective event detection. Considering all possible
instances of event classes requires several training videos because intra-class varia-
tion exist in event videos. However, manual annotation of several videos is laborious.
Song and Wu [53] proposed an intuitive approach of automatically extracting key
segments for event detection in videos by learning from loosely labelled collection
of web images and videos. Loosely labelled images were collected from Google and
Flickr while videos were collected from YouTube. Their model is an adaptive latent
structural Support Vector Machine (SVM) model where latent variables are the loca-
tions of key segments in the video. Set of semantic concepts were used for overall
content description while single semantic concept were used for each local video
segment. Temporal Relation Model (TRM) was proposed for the temporal relations
between video segments and Segment-Event Interaction Model (SEIM) was used
for evaluating correlation between key segments and events. The authors adapted
labelled web images and videos from the web into their model and employed N
adjacent point sample consensus (NAPSAC) [54] for noise elimination in videos and
images.
Zhang et al. [55] created a knowledge base to reduce semantic gap between com-
plex events by using tons of web images for learning noise-resistant classifiers to
effectively model event-centric semantic concepts. Group incremental learning of
target classifier was proposed by Wang et al. [56] where each concept group com-
prises of simple action videos and images querying from Web. Long et al. [57] and
Duan et al. [58] proposed transfer kernel learning method and multiple source domain
adaptation method respectively. In the multiple source domain adaptation method,
relevant images sources are selected for annotating videos.
15.3.1.1 Extracting Video Segments for Action or Event Detection
Action recognition in videos involves both segmentation and classification. This

problem can be addressed individually and sequentially with the use of sliding tem-
poral window and aggregation, or perform both task simultaneously.
Song et al. [58] used segment-based approach by dividing video into a number
of segments for feature extraction and classification. The extracted segments are
used for event detection. Image sets were used by Song et al. [59] for detecting
key segment from complex videos. Hidden Markov Model (HMM) [60] and global
dynamic pooling structure [61] have been proposed for video segmentation. Habibian
et al. [52] constructed a bag of 1346 concept detectors that were trained on the
ImageNet [34] and TRECVID [62] dataset to generate a large vocabulary for event
recognition.
Works in action recognition based on video segmentation can be categorized into
three: Action segmentation, depth-based action recognition and deep learning based
motion recognition. We briefly discuss this processes relative to video segmentation.
A. Similarity-based model
Dynamic time warping (DTW) is the widely used method for action segmentation
[63–65]. The first step is to obtain difference images. Two consecutive gray scale
images are first subtracted to obtain difference images followed by partitioning of
each difference image into 3 × 3 grid cells. The size of each cell is the average
value of pixels within the cell while a motion feature flattens the difference image
as a vector. The motion feature is extracted from both test and training video and
calculated for each frame. It sums up to 9 × (K − 1) matrix of motion features where
K is the total number of frames. The two matrices represent two temporal sequences
and the DTW distance between the two sequence is calculated by Viterbi algorithm
to segment the actions.
Appearance based method of action segmentation from videos assumes similarity
between the start and end frames of adjacent actions. Methods for identifying the
start and end frames of actions are K-nearest neighbor (KNN) algorithm with HOG
[66], and quantity of movement (QOM) [67]
B. Depth-based approach
Besides appearance based approach, depth map based action recognition method have
been proposed including combination of depth motion map (DMM) and HOG [68],
using graphical model to encode temporal information [69], histogram of oriented
4D normals (HON4D) [70], capturing local motion and geometry information [71],
and binary range-sample feature [72]. However, all the proposed methods are dataset-
dependent and are based on hand-crafted features.
Deep learning methods have also been applied to depth based action recognition
approach. A variant of DMM based on CNN was applied by [73, 74]. Wu. et al.
[75] used 3D CNN as a feature extractor from depth data. Other techniques such
as structured images [76, 77] technique has been proposed for depth based action
recognition.
C. Deep leaning-based motion recognition
Deep learning approach for motion recognition can be partitioned into one of four
categories:
Category 1 Video is viewed as a set of still images [29, 30]. In this category, each
channel of the images is input onto one channel of a CNN. This approach is subop-
timal but performs quite well.
Category 2 Video is represented as a volume and replaces the 2D filters of CNN with
3D filters thereby introducing temporal dimension [78, 79]. This approach doesn’t
work quite well probably due to lack of annotated training set.
Category 3 Video is regarded as a sequence of images and fed into Recurrent neural
network (RNN) [80–82]. RNN allows for sequential parsing of video frames due
to its sensitivity to both long term and short term patterns and its memory cell-like
nature. It encodes frame-level information in the memory. It performs in similar
magnitude to category 2.
Category 4 Video is represented as compact images and fed into pre-trained CNN
architecture [83]. This approach achieved the state-of-the-art performance of action
recognition due to the pretraining. Ochs et al. [84] proposed a method of segmenting
moving objects using semi-dense point tracker based on optical flow to produce
trajectories over several frames by long term analysis of the motion vector. They
claim that intricate details can be extracted over a long period of analyzing videos by
segmenting the meaningful or whole part of an object instead of a short time. Their
method performs well than two-frame optical flow and color-based segmentation
methods.
15.3.2 Moving Object Classification and Detection
Moving object detection from video is important due to its invaluable application to
several application domains such as intelligent video surveillance, human behavior
recognition, traffic control and action recognition.
15.3.2.1 Object Tracking
Object tracking in video surveillance involves the process of locating moving

object(s) in video. Several applications of object tracking in video surveillance has
been reported in the literature such as in resource tracking, customer queue analysis
and transport [3].
The general step of object tracking in video is the extraction of foreground infor-
mation to detect the object. Background subtraction algorithm such as IAGMM [7,
12] is then applied to capture background modelling of the scene. Shadow removal
methods [85] can be subsequently applied to the foreground frame since shadows
decreases tracking accuracy. Bounding box of an object is determined by connected
component algorithm. Connected component algorithm scans object and groups its
pixel into component while calculating the bounding box and area of the object.
To ensure frame to frame matching of the detected object, method such as adaptive
mean shift [86] can be used for comparison. The distance and size are some factors
that defines object matching between frames. Subsequently, occlusion is detected
and resolved using any of the occlusion method described later in 3.7 (Handling
occlusion).
15.3.2.2 Motion Detection
Motion detection algorithm is similar to that of object tracking with the exception of
consecutive frame matching with the detected object. Therefore, moving object can
be detected via foreground images extracted by GMM background and connected
component algorithm for noise removal. Upon applying connected component algo-
rithm, the area of detection is well refined and produces the bounding box information
of the moving objects [7]. Similarly, deep learning methods can be used. The goal of
moving object detection is to capture video sequence from a fixed or moving camera.
The output of the detection is a binary mask representing the moving object for each
frame in a particular sequence. Shadows, variation in illumination and cloud move-
ment makes object detection for moving object a difficult task [87]. Methods for
moving object classification and detection can be categorized into moving camera
with moving object and stationary camera with moving object.
A. Moving object with stationary camera
Analyzing moving objects using fixed cameras where background image pixels in the
frames remain the same throughout a video sequence has been extensively studied
in the literature [88–92]. The generic approach of handling the fixed camera prob-
lems is to model a stable background and apply background subtraction technique
as described previously in Sect. 2 (theory of video analytics). Shantaiya et al. in [93]
conducted an extensive review on object detection in video and grouped the meth-
ods into feature-based, motion-based, classifier-based and template-based models.
Categorization of object tracking in videos into point tracking, kernel tracking and
silhouette tracking as well as feature-based, region-based and contour-based was

performed by [93, 94] respectively.
However, moving object detection with moving camera is relatively complicated
than that of fixed camera due to camera motion and background modelling for gen-
erating foreground and background pixel fails [95].
B. Moving object with moving camera
Significant improvement has been made on cameras installed in drones and mobile
phones with powerful imaging capabilities. These cameras are non-stationary and
methods that can handle moving object detection for moving cameras are required.
Although methods for analyzing moving object with fixed cameras is sufficient for
some surveillance task since most cameras don’t move. However, recent surveillance
cameras have been equipped with pan-tilt-zoom (PTZ) functionalities or drones
which may cause the cameras to move. Thus approach used in the fixed camera-
moving object cannot be directly applicable where the background image pixel
changes position throughout the video sequence.
Challenges faced in moving object detection includes defining the notion of mov-
ing object in terms of spatio-temporal relationship of pixels, variation in illumination
or lighting condition, occlusion, changes in object appearance and reappearance,
complex background such as moving cloud, sudden camera motion or other abrupt
motion and shadows.
Readers are referred to [96] for an extensive survey on moving object detection
in moving camera including the challenges of moving object detection in videos.
Solutions for moving object detection with moving cameras can be categorized
into three:
1. Background modelling based methods [97, 98]: This approach aims at creat-
ing frame-by-frame background for each sequence using motion compensation
method. Proposed algorithms includes Gaussian-based method [98], mixture of
Gaussian (MoG) [99], adaptive MoG [100], double Gaussian model [101], kernel-
based spatio-temporal model, Harris corner detector for feature point selection
[102], multi-layer homography transform [103], complex homography [104],
codebook modeling [105], thresholding [106], motion and appearance model
[107], background keypoints and segmentation [108], and CNN-based method
[109] for background modelling. The complexity of background modelling-based
techniques is reasonable and thus well suited for real-time application.
2. Trajectory classification [110, 111]: Involves computing long trajectories for fea-
ture point and discriminating trajectories that belongs to different object from
those background using clustering method. Proposed algorithms includes com-
pensating long term motion based on flow optic technique [110, 111], bag-of-
word classifier, and pre-trained CNN method for detecting moving object trajec-
tories [112].
3. Extension of background subtraction method for static camera [113, 114]:
low rank and sparse matrix decomposition method for static camera [115] are
extended to moving camera. This tries to determine if there is coherency between
a set of image frame. If it exists, the low rank representation of the matrix created
by these frame contains the coherency and the sparse matrix representation con-
tains the outliers which represents the moving object in these frames. Low rank
and sparse decomposition involves segmenting moving objects from the fixed
background by applying principal component pursuit (PCP). It is a valuable
technique in background modelling. Mathematical formula and optimization of
this method can be found in [96].
15.3.3 Behavior Analysis
Interpreting behaviors from video footages is relatively fascinating as context needed

for better understanding of the action can be extracted.
Human behavior can be understood through understanding of audience behav-
ior by analyzing user interaction with digital display [116]. Typically, a camera is
installed near the digital display or integrated into the display. Intel Anonymous
Video Analytics (AIM) system [117] and Fraunhofer Avard [118] are commercial
tools that have been deployed for understanding audience in terms of age, dwell time,
gender, view time and distance from display through video processing. This can be
combined with sales data to improve advertising campaigns or efforts [119]. Gillian
et al. [120] developed a framework for analyzing user interaction with multiple dis-
plays i.e. across displays using depth-cameras. In surveillance, audience detection
and tracking has been an interesting topic in crowd detection and estimation as well
as single entity tracking within crowd. Crowd may refer to a group of people present
at the same place with different (unstructured) or same (structured) reasons.
Crowd analysis involves studying crowd behavior and detecting abnormal behav-
iors in a static or dynamic (motion) scene. The analysis of crowd behavior becomes
more challenging when the density of the crowd is high as shown in Fig. 4.
Fig. 4 a Crowd gathering at the train station during peak hours, b crowd gathering for religious
activity. Image retrieved from alamy.com and www.thereformationroom.com respectively
The significance of crowd analysis has grown with the increase in the world popu-
lation. For public safety, crowd management is very important in the construction of
shopping malls, stadiums, and subway stations in order to avoid stampede or other
disastrous outcomes. Therefore, using cameras for crowd analysis is important to
detect or avoid terrorist attacks, bomb explosion, fire outbreak and other incidences
that can cause havoc to public safety.
Crowd behavior analysis spans through several areas including pattern recogni-
tion, computer vision, mathematical modelling, artificial intelligence and data min-
ing. The meaning of crowd differs based on the situation. For example, 10 persons
gathering in a subway station can be regarded as crowd. Crowd analysis involves
studying both group and individual behavior to determine abnormality. The def-
inition abnormal behavior is quite ambiguous which makes crowd analysis quite
an interesting area of research. Extensive survey of state-of-the-art deep learning
method for crowd analysis can be found in [121].
The following attributes are used in analyzing crowd:
1. Counting and density estimation (congestion analysis)
2. Motion detection
3. Tracking
4. Behavior understanding.
Several factors must be considered when performing crowd analysis including
terrain features, geometrical information as well as crowd flow.
15.3.3.1 Motion Feature Representation in Crowded Scenes
Motion features present an invaluable stance point for analysis of crowded scene.
Existing work on motion feature analysis for crowded scenes can be divided into
flow-based features, local spatio-temporal features and trajectory or tracklet [122].
These feature representations can be used for several task such as crowd behavior
recognition, abnormality detection in crowd and motion pattern segmentation.
A. Flow-based Features
Tracking a person in a highly crowded environment is extremely difficult. However,
in flow based features extraction, attention is only given to the occurrence not the actor
(who is involved in what is happening). For example, singly looking at a person’s
action doesn’t say much and may seem random but overall view of the crowd can be
conclusive [123]. Flow based features are pixel level features. Several methods have
been presented over the years [124–128]. Categorization of existing work is what
follows:
1. Optical Flow:
Optical flow involves computing pixel-wise motion between consecutive frames.
Optical flow handles multi-camera object motion and has been applied to detection
crowd motion as well as crowd segmentation [129–132]. The drawback of Optical
flow is that it cannot encapsulate spatio-temporal properties of the flow and does not
capture long range dependencies.
2. Particle Flow:
Inspired by the Lagrangian framework of fluid dynamics [133], particle flow involves
moving a grid of particles with the optical flow and providing trajectories that maps
a particle’s initial position to its future or current position. This method has shown
dynamic application in crowd segmentation and detection of abnormal behavior in
crowd [122]. However, there is time lag and cannot handle spatial changes.
3. Streak Flow:
Streakline was introduced by Mehran et al. [128] for analyzing crowd video by
computing motion field. The proposed method is called Streak flow. Streak flow
overcomes the challenges of particle flow. Although it captures motion information
similar to particle flow, changes in the flow is faster and performs well in dynamic
motion flow.
B. Local Spatio-Temporal Features:
Less structural (very crowded) scene have high variability and non-uniform move-
ment. Motion in this type of scenes can be generated by any moving object and any of
the optical flow features cannot provide useful information about the motion. Local
spatio-temporal features are 2D patches or 3D cubes representation of the scene.
They explore motion patterns and characterizes their spatio-temporal distributions
on local 2D patches or 3D cubes. Spatio-temporal features are described below:
1. Spatio-temporal Gradients:
Kratz and Nishino [134] used spatio-temporal motion pattern model to capture
steady-state motion behavior and their result shows that abnormal activities can
be detected.
2. Motion Histogram:
Motion histograms considers motion information within the local region. Comput-
ing motion orientation on motion histogram takes considerable amount of time and
it is highly susceptible to error. Thus, it is not suitable for crowd analysis. How-
ever, several improved methods based on motion histogram has been proposed by
researchers.
Jodoin et al. [131] proposed orientation distribution function (ODF), a feature
that does not have any information about the magnitude of flow but represents the
probability density of a particular motion orientation. Multiscale histogram of optical
flow (MHOF) was proposed by Cong et al. [135] as a feature descriptor that preserves
both the spatio-contextual information and motion information.
C. Trajectory/Tracklet
Trajectory or tracklet represents motion by computing individual tracks. Motion

features such as distance between object or motion energy can be extracted from
trajectories of objects and can be used to analyze crowd activities.
However, object detection and tracking in a highly dense crowd is very difficult.
Thus, the notion of tracklet emerges due to the inability to obtain complete trajectory
in such setting.
Tracklet is a fragment or part of a trajectory obtained within a short period of time.
When occlusion occurs, tracklet terminates. Traklets have been used in the area of
human action recognition [136–138] by connecting them together to form a complete
trajectory. Quite a number of tracklet-based approaches have been proposed for
representing motion in crowded scenes [139–141]. The general ideology of tracklet
is to extract them from dense region and enforce statial-temporal correlation between
tracklet to detect patterns of behavior in crowded scenes.
In general, spatio-temporal feature have shown promising results in both motion
understanding and crowd anomaly detection.
15.3.4 Anomaly Detection
Anomaly detection is an application of crowd behavior analysis. Detecting anomalies

in videos is nontrivial due to variation in the definition of anomaly. i.e. an anomaly in
one scene can be considered normal in another. This has attracted many researchers
and several methods have been proposed. The general approach is to learn what is
considered normal in a training video and use it to detect events that drifts from
them (abnormality). Occlusion, distance between object and camera, and viewpoint
may cause variation and thus contribute to anomaly in video. Existing methods in
anomaly detection research can be categorized into trajectory-based method, global
pattern-based method, and grid pattern based method [142]. Table 1 provides the
summary of these methods.
A. Trajectory-based method of anomaly detection
Trajectory-based method segments scenes into different objects while the objects are
tracked throughout the video sequence. The tracked object forms a trajectory which
defines the behavior of the object [143]. String kernels clustering [144], single-class
SVM [145], spatio-temporal path search [146], zone-based analysis [147], semantic
tracking [148] and deep learning-based approach [149] have been used in evaluating
abnormality in trajectory-based methods.
B. Global pattern-based method of anomaly detection
Global pattern-based method analyzes video sequence in entirety by extracting low
or medium level features from video using spatio-temporal gradients or optical flow
methods [150]. The advantage of this method is that it does not individually track
each object in the video and thus suitable for crowd analysis. However, locating the
position at which an anomaly occur is non-trivial. Gaussian mixture model (GMM)
Table 1 Summary of Reference Methods

anomaly detection methods
Global pattern-based methods
Popoola and Wang [150] Optical flow methods
Yuan et al. [151] Gaussian mixture model (GMM)
Xiong et al. [152] Energy model
Wang et al. [153] Stationary-map
Zhang et al. [154] Social force model (SFM)
Cheng et al. [155] Gaussian regression
Lee et al. [156] Principal component analysis
(PCA) model
Krausz et al. [157] Global motion-map
Lee et al. [158] Motion influence map
Chen et al. [159] Salient motion map
Trajectory-based methods
Brun et al. [144] String kernels clustering
Piciarelli et al. [145] Single-class SVM
Tran et al. [146] Spatio-temporal path search
Cosar et al. [147] Zone-based analysis
Song et al. [148] Semantic tracking
Revathi and Kumar [149] Deep learning-based method
Grid pattern based methods
Xu et al. [161] Sparse reconstruction of dynamic
textures over an overcomplete
basis set
Cong et al. [162] Motion context descriptor
Thida et al. [163] Spatio-temporal Laplacian Eigen
map method
Yu et al. [164] Hierarchical sparse coding
Li et al. [165] Multiscale splitting of frames
Lu et al. [166] Multiscale splitting of frames
Lu et al. [167] Adaptive dictionary learning
Han et al. [168] Online adaptive dictionary
learning
Zhao et al. [169] Sparse coding and sliding window
Xu et al. [142] Stacked sparse coding (SSC) and
SVM
[151], energy model [152], stationary-map [153], social force model (SFM) [154],
Gaussian regression [155], principal component analysis (PCA) model [156], global
motion-map [157], motion influence map [158], and salient motion map [159] are
approaches used in the global pattern-based method.
C. Grid pattern-based method of anomaly detection
In contrast with the global pattern-based methods, the grid pattern-based methods do
not consider frames as a whole but rather splits frames into blocks and individually
analyze pattern on a block-level basis [160]. Grid pattern-based methods are more
efficient due to the processing time reduction fueled by individual evaluation of
pattern in the block level and ignoring inter-object connections. Spatio-temporal
anomaly maps, local features probabilistic framework, joint sparsity model, mixtures
of dynamic textures with GMM, low-rank and sparse decomposition (LSD), cell-
based texture analysis, sparse coding (SC) and deep networks are used in evaluating
grid pattern-based methods [142].
Xu et al. [161] used sparse reconstruction of dynamic textures over an overcom-
plete basis set to detect anomaly. Cong et al. [162] proposed the concept of searching
for the best match in the training dataset using motion context descriptor. Thida et al.
[163] extracted diverse crowd activities from videos using spatio-temporal Laplacian.
Eigen map method. All these methods are based on Sparse Coding (SC). The
notion behind SC is that abnormal events in videos are characterized by sparse linear
combinations of normal patterns with large reconstruction error while normal events
are characterized by small reconstruction errors. Yu et al. [164] classified events as
abnormal and normal using hierarchical sparse coding method. Li et al. [165] and
Lu et al. [166] computed sparse representation in each scales by splitting frames in
multiscale. To generate better representation of abnormal events, Lu et al. [167] and
Han et al. [168] proposed adaptive dictionary learning and online adaptive dictionary
learning respectively.
Sparse coding and sliding window was adopted by Zhao et al. [169] for detection
of abnormal events in videos.
Xu et al. [142] proposed a method of detecting anomalies in video based on stacked
sparse coding (SSC) with intra-frame classification. The video is first divided into
blocks. The appearance and motion features for each block is described by the fore-
ground interest point (FIP) descriptor and encoded by SSC. Support vector machine
(SVM) is used to evaluate the intra-frame classification to determine abnormality in
each block.
15.3.5 Video Summarization
Video summarization is the process of compressing a video by extracting only the

important part of the video sequence. Video summarization is important due to current
widespread of cameras leading to massive amount of data. It involves producing only
significant or important highlights of a video that conveys the overall story. Definition
of significant or important aspect of a video varies based on several criteria. The goal
of video summarization is to generate a dense representation of a specific video.
Determining informative or important sections of a video requires understanding of
the video content. However, video content is diverse thereby making summarization
a difficult task. Video summarization can also enhance video retrieval results.
Zhang et al. [170] defined a good summarization technique as one which is diverse,
representative of videos of similar group and discriminative against videos in dissim-
ilar groups. Domain-specific video summarization method was the early approach
used for determining important segment of the video. For example, in sport, spe-
cific structure can facilitate the important segment according to the rules governing
the sport. In movies, metadata such as movie script and captions can be used in
generating video summaries.
Methods of dealing with video summarization can be categorized into unsuper-
vised, supervised, query extractive and discriminative approach.
Supervised and unsupervised approaches have been developed to encapsulate
domain knowledge for video summarization. Unsupervised summarization approach
creates summary based on precise selection criteria. Supervised approach on the other
hand, trains a summarization model using human-created summaries.
Potapov et al. [171] used classifier’s confidence score to define important seg-
ment of a video. Methods in the supervised approach is difficult to generalize to
other genres because it is highly dependent on domain knowledge but offers better
performance.
The unsupervised approach is independent of domain knowledge and thus suitable
for generic application. Yang et al. [172] used an auto-encoder to convert input video’s
features into a concise one and reconstruct the input using the decoder. Zhao et al.
[173] proposed a method that reconstruct the rest of the original video based on a
video summary.
Query extractive summarization methods [174, 175] are a variant of summariza-
tion methods that generates summary based on keyword input. This model assumes
that a video can have multiple summaries. However, it may be unrealistic for real
applications due to frame-level importance annotation for each keyword.
Panda et al. [176] introduced discriminative information by training a spatio-
temporal CNN for classifying the category of each video and calculates the impor-
tance scores via gradient aggregation of the network’s output. Kanehira et al. [170]
proposed viewpoint-aware video summarization in which summary is built based on
the aspect of the video that the viewer focuses on. To determine viewpoint, they lever-
age other videos in folders on the viewer’s laptop or phone and performed semantic
similarity and dissimilarity between the videos in the folder and the current video
being watched to produce viewpoint-specific summaries. Otani et al. [177] proposed
a method of improving video summarization techniques by using deep video features
to encode various levels of content semantics such as actions, scenes and objects.
The architecture used is a deep neural network for mapping videos and description
to a semantic space. Clustering is applied to the segmented video content. Table 2
gives the taxonomy of video summarization methods.
Table 2 Taxonomy of video summarization methods

Reference Year of publication Summarization extraction method
Supervised summarization methods
Gong et al. [178] 2014 Used human-created summaries to train a
system to select informative and diverse
subsets of video using sequential
determinantal point process (seqDPP)
Gygli et al. [179] 2014 Video segmentation using set of
consecutive frames where the beginning
and end are aligned with positions of a
video suitable for a cut (superframe
segmentation)
Gygli et al. [180] 2015 Learnt important overall characteristics of
a summary by jointly optimizing multiple
objectives
Kulesza et al. [181] 2012 Built explanatory summaries by selecting
diverse sentences using determinantal
point processes (DPPs)
Lee et al. [182] 2012 Trained a regressor that predicts important
regions of a video using egocentric
features
Liu et al. [183] 2010 Elimination of irrelevant frames from
video to generate informative summaries
using window-level representation based
on probabilistic graphical model
Plummer et al. [184] 2017 Used Semantically-aware video
summarization technique by selecting a
sequence of segments that best represent
the content of input video
Potapov et al. [171] 2014 Proposed category-based video
summarization method that transforms
temporal segmentation into
semantically-consistent segments and
assigns scores to each segment
Sun et al. [185] 2014 Learning latent model for ranking
domain-specific highlights and comparing
raw video to edited video using latent
linear ranking model
Zhang et al. [186] 2016 Used human-created summaries to
automatically extract keyframe-based
video summarization by transferring
summary structures from annotated
videos to unseen videos
Unsupervised summarization methods
Chen et al. [187] 2011 Used combination of knowledge and
individual narrative preference to generate
summarized video content by segmenting
video contents into local stories
(continued)
Table 2 (continued)
Reference Year of publication Summarization extraction method
Elhamifar and Kaluza 2017 Proposed an incremental subset selection
[188] framework for generating summarized
videos by updating set of representatives
features based on previously selected set
of representatives and new batch of data
Fleischman et al. [189] 2007 Proposed temporal feature induction
method that extracts complex temporal
information from video for classifying
video highlights
Hong et al. [190] 2009 Used multi-video summarization
technique to determine key shots as a
combination of ranked list of web videos
and user-defined skimming ratio
Khosla et al. [191] 2013 Used web-image based prior information
to generate summarization obtained
through crowdsourcing for poor quality
videos
Kim et al. [192] 2014 Used storyline graph for creating
structural video summaries that illustrates
various events based on diversity ranking
between images and video frames
Lu and Grauman [193] 2013 Used text analysis based method to
determine random walk-based metric of
influence between sub shots which
captures event connectivity
Mahasseni et al. [194] 2017 Trained a system to learn a deep
summarizer network based on
autoencoder long short-term memory
network (LSTM)
Song et al. [195] 2015 Proposed a video summarization
framework called TVSum that detects
important shots based on titles of the
retrieved image
Query extractive summarization
Sharghi et al. [174] 2016 Proposed a method based on Sequential
and Hierarchical Determinantal Point
Process (SH-DPP) to select key shot
determined by the relevance of user query
relative to the video context
Sharghi et al. [175] 2017 Used extracted semantic information for
evaluating the performance of a video
summarizer
Discriminative method of video summarization
[170] 2018 Introduced a viewpoint approach to build
a summary that depends on what the
viewer focuses on using classification
techniques that discriminates semantic
similarity between different groups
15.3.6 Information Fusion from Multiple Camera
In this section, we review multi-camera surveillance systems for wide-area video

monitoring. Multi-camera settings are typical in most real-world surveillance sys-
tems due to the numerous amount of cameras available and impracticability of one-
camera to one-monitor methodology. Therefore, performing analytics with multiple
camera is nontrivial since it requires modelling spatio-temporal relationship among
objects, events, and sometimes careful configuration of camera views. Using multiple
cameras for tracking can either be based on overlapping cameras or non-overlapping
cameras. Multiple cameras with overlapping field of view is both economically and
computationally expensive and requires well designed network to correlate the over-
lapping field of view. However, non-overlapping multiple camera field of views are
quite realistic and applied in real-world application. In this section, we will focus
on multiple cameras with non-overlapping field of view with the goal of tracking
multiple targets across multiple cameras without losing track of any of the target as
they move between and among the cameras.
Several methods such as Probabilistic Petri Net-based approach, Dominant sets
clustering, Generalized maximum clique problem (GMCP), Generalized maximum
multi clique problem (GMMCP), Multiple Instance Learning (MIL), Markov Ran-
dom Fields (MRF), Multi-tracker ensemble framework, Structural Support Vector
Machine (SSVM), Spatio-temporal context information, Top-push distance learning
model (TDL), Recurrent neural network architecture and Constrained dominant sets
clustering (CDSC) have been proposed to combine views from multiple cameras.
Table 3 gives the details of these works.
Lu et al. [196] proposed a composite three-layer hierarchical framework using
constrained dominant sets clustering (CDSC) technique for tracking object across
multiple non-overlapping cameras. Within-camera tracking problem is solved in
the first two layers of the framework while across-camera tracking is solved by
concurrently combining tracks of the same person in all cameras in the third layer.
The proposed CDSC method works by finding constrained dominant sets from a
graph by generating cluster or clique that captures subset of the constraint set. This
method can also be used for detection of person re-identification since the third layer
can link broken tracks of the same person occurring during within-camera tracking.
15.3.7 Handling Occlusion
Occlusion is one of the major problem in video analytics. Effectively handling occlu-
sion can greatly improve analytics accuracy. Occlusion can either be partial or total
occlusion.
Objects can be occluded by other objects in the scene causing some parts of the
object to be unseen (partial) or completely hidden from the observed video frame
by other object (total). For instance, consider a mini cooper as the target object on a
Table 3 Recent works on object tracking using non-overlapping multiple camera

References Method Description
Wang et al. [197] Probabilistic Petri Net-based Tracks target based on
approach appearance features of objects
and spatio-temporal matching of
target across camera views
Tesfaye et al. [198] Dominant sets clustering Pairwise relationship between
detected objects in a temporal
sliding window are considered
and used as input into a fully
connected edge-weighted graph
Zamir et al. [199] Generalized maximum clique Used motion and appearance
problem (GMCP) feature to solve data association
between object in multiple
camera views based on
generalized maximum clique
problem (GMCP)
Deghan et al. [200] Generalized maximum multi Formulated data association for
clique problem (GMMCP) tracking multiple object as a
generalized maximum multi
clique problem (GMMCP)
Kuo et al. [201] Multiple instance learning (MIL) Used on-line learned
discriminative appearance affinity
model to model association
between tracked objects
Chen et al. [202] Markov random Fields (MRF) Used human part configurations
to determine across-camera
spatio-temporal constraints and
pair-wise group activity
constraints for multi-target
tracking
Gao et al. [203] Multi-tracker ensemble Proposed a multi-tracker based
framework approach that captures
consistency between two
successive frames and pair-wise
correlation among different
trackers
Zhang et al. [204] Structural support vector machine Expressed multi-target tracking
(SSVM) as a network flow problem whose
solution can be obtained by
K-shortest paths algorithm
Cai et al. [205] Spatio-temporal context Spatio-temporal context
information information is used for
inter-camera tracking and for
discriminating appearances of
target
(continued)
Table 3 (continued)
References Method Description
Jinjie et al. [206] Top-push distance learning model Top-push constrain is used for
(TDL) matching video features of
persons instead of matching still
images of a person across
multiple camera. This approach
provides high-level
discriminative features and
provides better matching for
person re-identification in
multiple non-overlapping camera
views
Mclaughlin et al. [49] Recurrent neural network CNN is used as a feature extractor
architecture for each frame combined with
temporal pooling layer for person
re-identification and detection
across multiple cameras
Tesfaye et al. [196] Constrained dominant sets Proposed a three-layer
clustering (CDSC) hierarchical framework using
CDSC technique for tracking
object across multiple
non-overlapping cameras and
detection of person
re-identification
freeway. The mini cooper due to its size could be occluded by larger truck and vans
which will in turn affect the detection of the mini cooper. Occlusion adversely affects
object detection by changing the appearance model for a short time which can affect
tracking of the object. Figure 5 shows an example of occlusion in real-life.
Several methods have been proposed for occlusion handling based on appear-
ance model. Jepson et al. [207] used Estimation Maximization (EM) algorithm with
appearance model based on filter responses from a steerable pyramid to deal with
changing appearance of an object.
Spatio-temporal context information obtained from gradual analysis of the occlu-
sion situation was proposed by Pan and Hu [208] to distinguish occluded object
effectively. Contour-based approach was proposed by in [209]. In this approach,
energy function evaluated in the contour is minimized which causes easy tracking
of the object. Subsequently, maintaining appearance models of moving objects over
time can effectively manage occlusion [210].
Method of handling occlusion for moving camera has been proposed by Hou
et al. [211]. Their method used HOG and multiple kernel tracker based on mean shift
to discriminate different condition of moving cameras thereby handling occlusion
effectively.
Fig. 5 President Trump occluding Queen Elizabeth II. a Full or total occlusion b–c Partial occlu-
sion. Images obtained from twitter
15.4 Applications
Currently, video analytics drives many application domains from self-driving cars
and drones to surveillance with majority of the models taking advantage of the deep
neural networks paradigm.
Video analytics in surveillance applications can be grouped into three categories:
1. Predictive analytics: This involves making a prediction or forecast about the
future. It identifies risk, opportunities and assesses the likelihood of a similar
subject in different category exhibiting specific attribute. Simply put, it is the
analysis of what will happen in the future. For example, predicting the next
direction or move of a target based on the behavior of the target in the current
frame.
2. Prescriptive analytics: Prescriptive analytics encompasses actions that should be
taken in a given situation. For example; in resource tracking, when the availability
of resource is low in a particular resource station, video analytics can proffer
solution of using resource from a station with low resource usage to replenish
the busy station requiring more resources.
3. Forensics: Categories of video analytics application in forensics involves using
videos to analyze what has happened. In congestion analysis, when there is a
situation of interest such as stampede, videos can be used to trace back the cause
of the stampede.
General algorithms include facial recognition, vehicle tracking, motion detection,

person (object) tracking and re-identification, loitering, vehicle license plate reading,
slip and fall detection, abandoned object detection and crowd analysis. A full-fledge
analytic system called VideoStorm [212] has been developed that tracks objects,
classifies them and also allows querying of events from recorded video data. Some
real-life application of predictive, prescriptive and forensics applications is what
follows.
A. Application in Traffic and transport
According to Ananthanarayanan et al. [213], traffic-related accidents are one of
the leading causes of death in the world. Proactive measures for risk identification
and taking steps to prevent injuries on the road is required to mitigate the num-
ber of traffic-related deaths. Most of the time, there are early warning signs such
as near-collision events at specific locations that can provide insight into why and
when crashes can occur. Also monitoring traffic with videos and analyzing them can
capture undocumented or unreported crashes. Traffic monitoring systems can use
vision-based algorithms to detect and count the number of cars on the highways.
Insight of the steering and braking behaviors can be equally provided. Therefore,
analyzing video data generated by the large amount of traffic cameras installed in
large cities is invaluable. VisonZero [214] is one successful real-world application of
video analytics for eliminating traffic-related deaths by detecting situations in which
something bad is about to happen and avoiding it (close calls) between road users
(pedestrians, cars and bikers). For example, when there is an area where jaywalking
is predominant, through the help of VisonZero, authorities can proffer solution of
installing crosswalk or making adjustment to the traffic-light controllers.
Vehicle tracking is an important aspect of the application of video analytics in
transport. Tracking of license plate, analysis of cause of collision, over speeding
and vehicle size detection can be obtained by analyzing video data which provides
in-depth information about the vehicle. A fully functional vehicle tracking system
has been developed called Kestrel. Kestrel [215] is a video analytic system that uses
information from heterogeneous camera networks of non-overlapping cameras to
detect vehicle path. Detected path forms a large corpus that allows users to query
based on location and event of interest in near real-time or after the fact fashion. The
architecture of Kestrel is based on YOLO [216], a deep CNN for object detection
that draws bounding boxes around detected object on a frame.
B. Application in self-driving cars and augmented reality (AR)
For self-driving cars, video analytics serves a great deal where the accuracy of algo-
rithms are required to be at the optimum. Several onboard cameras are installed to
make driving decisions and fusing all information from multiple video cameras is
important with zero tolerance for anomalies. Several algorithms have been developed
for self-driving cars including traffic light detection, human detection, pedestrian
crossing detection, and automatic braking system. These algorithms are integrated
together to enhance sensing and detection capabilities. Also dedicated short range
communication (DSRC) can be used to communicate with other approaching self-

driving car.
In AR, additional information is projected into user’s view either by projecting
holograms or by recording surrounding views and rendering objects directly on top
of them when Virtual Reality (VR) headset are worn. AR systems uses combination
of several algorithms such as object detection, face recognition and object count to
analyze observed scenes.
C. Application in security
Body-worn cameras of law enforcement officers and government agencies have
increased surveillance data. Therefore, the application of video analytics for surveil-
lance video data is of uttermost importance. This has led to the application of video
analytics to crime detection and suspicious vehicle license plates detection. Prod-
ucts such as Ring video doorbell [1] {https://ring.com/} and Nest video doorbell [2]
{https://nest.com/} have brought a new light to cameras for home applications. They
can identify people, motion and detect events such as stray animals, burglary and
package delivery. They also allow home owner to communicate with the visitors or
guest from mobile phones. Law enforcement and counter-terrorism agents can track
public threats in real-time such as multiple coordinated attacks on public transport
through video analytics. Drones can also be used for home and public surveillance.
D. Application to warfighters and unmannered aerial vehicles (UAV)
Warfighters operates in an uncertain world faced with the compulsion of making
quick decisions. Video analytics from surveillance cameras can provide warfighters
with decision-relevant information. To aid quick decision, moving objects can be
detected and tracked in real-time. The tracked objects are classified and anomalies
are detected [217]. Targeted suspect can be monitored via UAV also called drones.
Most drones are equipped with small low-cost high-resolution cameras and video
footage taken from drones have contributed to growing data sources in the last few
years [16]. Camera motion estimation and compensation task can be performed on
moving objects from videos taken from a camera mounted on a UAV without prior
knowledge of the object being detected thereby reducing information transmitted to
the ground in real-time.
E. Application in customer service and queue analytics
Queuing is an important part of our daily activities either at the restaurant, bus station,
train station, supermarket or coffee shop. Analysis of queue provides number of
people waiting in line and the trends over a particular period of time. Queue length,
body and face counting, although challenging can be tracked from camera views.
This provides detailed insight of daily operations. Analyzing queues using existing
surveillance cameras can enhance decision making of human resources scheduling
for providing efficient services. It can also benefit the customers by displaying how
long they will wait on the queue before being serviced.
Video analytics has been used by Seng [218] to quantify customer’s emotion
and satisfaction at the contact center. Emotion analysis can be used in identifying
customer’s perception about a product in terms of the presentation and interaction

with company’s representative. This can provide competitive advantage as well as
improving customer’s experience. Human emotion can also be recognized via these
systems and translated to give customer satisfaction scores recovered from the con-
tinuous video data.
Customer’s behavior can be monitored and understood which is a key factor for
organization’s success.
F. Application in resource tracking
Effective utilization of resource is of great importance in business operation consid-
ering that every organization is constrained in resources. Capacity management is a
complex and difficult task. However, a link has been established between capacity
management and resource management. It has been shown that managing resources
can enhance capacity management especially in a capacity-constrained environment
[3].
Olatunji and Cheng [1, 3] applied video analytics using existing surveillance
cameras to track trolleys in an airport operation in real-time. Their algorithm is based
on CNN and motif detection algorithm for sending timely alerts when replenishment
of the resource is needed. The analytics algorithm running on CCTV footage can
also detect abandoned bags and station’s inventory of resource can be tracked.
G. Application in Healthcare and physical wellbeing
Healthcare is a crucial sector that requires special attention when applying video
analytics. However, analyzing video data obtained from surveillance cameras can
be very effective in understanding the behavior of the tracked person or object. For
example, monitoring senior citizens or elderly homes can be quite challenging but
video analytics on the surveillance footage obtained can be of great help if they run
algorithms such as fall detection or detect any potential threat and triggers emergency
notification to respective authorities like the emergency or ambulance unit. This
also allows for remote monitoring via smartphone or tablet which provides more
flexibility and alleviate the burden of physically being at the location of an elderly
home. In gym and fitness centers, installed surveillance cameras can be analyzed to
determine the most frequently used equipment as well as the duration or the amount
of time a particular trainer spends on an equipment. This information can then be
used for procurement of a type of equipment if frequently used or restructuring the
fitness center. The fitness condition of gym members using an equipment can also be
remotely monitored. For example, if a gym member falls off the treadmill, an alarm
can be triggered and a staff attend to the person accordingly.
15.5 Conclusion and Research Direction
Information fusion between cameras and efficient analytic algorithm can greatly
enhance the power of video analytics in several domains such as healthcare, trans-
portation, security and resource tracking. The goal of video analytics is to convert
observed scenes into quantitative and actionable intelligence that can provide real-
time situational awareness of an event of interest as well as post-event evaluation.
Combining other data such as data from geographical information system (GIS), loca-
tion data and other metadata will form a composite suite of video analytics system and
result in more efficient algorithms. Similarly, more research should be conducted on
visualizing camera position and perspective including 3D projection, geospatial con-
text and narrative time sequence as they affect the data quality and algorithm accuracy.
Integration of social media data into video analytics systems and augmented real-
ity (AR) views may increase algorithm robustness. Currently, application-specific
systems exist such as burglary detection system, shoplifting detection system and
fall detection system to mention a few. Moving towards a generic video analytic
framework instead of current application-specific systems will be a good research
direction.
In this chapter, we presented a concise survey of recent video analytics methods
and techniques applied to surveillance camera data. We extended the definition of
surveillance to resource management which is important for capacity-constrained
industries. We discussed the theory behind current state-of-the-art approach of video
analytics. It is envisaged that this chapter serves as a starting point for new researchers
in the area of video analytics to have a broad knowledge of the field and also to stream-
line their focus on any of the modules that makes an automated video surveillance
and monitoring (VSAM) system.
References
1. I.E. Olatunji, C.-H. Cheng, Dynamic threshold for resource tracking in observed scenes. in
IEEE International Conference on Information, Intelligence, Systems and Applications (2018)
2. S. Caifeng, P. Fatih, X. Tao, G. Shaogang, Video Analytics for Business Intelligence (2012)
3. C.-H. Cheng, I. E. Olatunji, Harnessing constrained resources in service industries via video
analytics. Arch. Ind. Eng. J. (2018)
4. C.V. Networking Index, Forecast Methodol. 2016–2021 white Pap., vol. 1 (2016)
5. M. Ali, A. Anjum, M. U. Yaseen, A.R. Zamani, D. Balouek-Thomert, O. Rana, M. Parashar,
Edge enhanced deep learning system for large-scale video stream analytics, in 2018 IEEE
2nd International Conference on Fog and Edge Computing (ICFEC) (2018), pp. 1–10
6. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, L. Fei-Fei, large-scale video
classification with convolutional neural networks, in Proceedings of the 2014 IEEE Confer-
ence on Computer Vision and Pattern Recognition (2014), pp. 1725–1732
7. P. Guler, Real-Time Multi-camera Video Analytics System on GPU, no. (Mar 2013, 2015)
8. D.-S. Lee, Effective Gaussian mixture learning for video background subtraction. IEEE Trans.
Pattern Anal. Mach. Intell. 27(5), 827–832 (2005)
9. T. Bouwmans, L. Maddalena, A. Petrosino, Scene background initialization. Pattern Recogn.
Lett. 96, no. C, pp. 3–11, (2017)
10. Z. Zhou, D. Wu, X. Peng, Z. Zhu, C. Wu, J. Wu, Face, Tracking Based on Particle Filter with
Multi-feature Fusion (2013)
11. I. Ishii, T. Ichida, Q. Gu, T. Takaki, 500-fps face tracking system. J. Real-Time Image Process.
8(4), 379–388 (2013)
12. V. Pham, P. Vo, V.T. Hung, L.H. Bac, GPU implementation of extended gaussian mixture
model for background subtraction, in 2010 IEEE RIVF International Conference on Comput-
ing & Communication Technologies, Research, Innovation, and Vision for the Future (RIVF)
(2010), pp. 1–4
13. V. Reddy, C. Sanderson, B.C. Lovell, A low-complexity algorithm for static background
estimation from cluttered image sequences in surveillance contexts. J. Image Video Process.
2011, 1:1–1:14 (2011)
14. G. Zhang, J. Jia, W. Xiong, T.-T. Wong, P.-A. Heng, H. Bao, Moving object extraction with
a hand-held camera, ICCV 2007. in IEEE 11th International Conference on Computer Visio
(2007), pp. 1–8
15. M. Gelgon, P. Bouthemy, A region-level motion-based graph representation and labeling for
tracking a spatial image partition. Pattern Recognit. 33(4), 725–740 (2000)
16. P. Angelov, P. Sadeghi-Tehran, C. Clarke, AURORA: autonomous real-time on-board video
analytics. Neural Comput. Appl. 28(5), 855–865 (2017)
17. E. Auvinet, E. Grossmann, C. Rougier, M. Dahmane, J. Meunier, Left-luggage detection using
homographies and simple heuristics
18. D. Emeksiz, A. Temizel, A Continuous Object Tracking System with Stationary and Moving
Camera Modes, vol. 854115, no. Oct 2012
19. P. Gil-Jiménez, R. López-Sastre, P. Siegmann, J. Acevedo-Rodríguez, S. Maldonado-Bascón,
automatic control of video surveillance camera sabotage, in Nature Inspired Problem-Solving
Methods in Knowledge Engineering (2007), pp. 222–231
20. A. Saglam, A. Temizel, Real-Time adaptive camera tamper detection for video surveillance, in
2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance
(2009), pp. 430–435
21. G. Ramirez-Alonso, J.A. Ramirez-Quintana, M.I. Chacon-Murguia, Temporal weighted learn-
ing model for background estimation with an automatic re-initialization stage and adaptive
parameters update. Pattern Recognit. Lett. 96, 34–44 (2017)
22. O. Déniz, G. Bueno, J. Salido, F. De la Torre, Face recognition using histograms of oriented
gradients. Pattern Recognit. Lett. 32(12), 1598–1603 (2011)
23. A.E. Abdel-Hakim, A.A. Farag, CSIFT: A SIFT descriptor with color invariant characteristics,
in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR’06), vol. 2 (2006), pp. 1978–1983
24. M.U. Yaseen, M.S. Zafar, A. Anjum, R. Hill, High performance video processing in cloud
data centres. IEEE Symp. Serv.-Oriented Syst. Eng. (SOSE) 2016, 152–161 (2016)
25. M.U. Yaseen, A. Anjum, N. Antonopoulos, Spatial frequency based video stream analysis for
object classification and recognition in clouds, in 2016 IEEE/ACM 3rd International Confer-
ence on Big Data Computing Applications and Technologies (BDCAT) (2016), pp. 18–26
26. M.U. Yaseen, A. Anjum, O. Rana, R. Hill, Cloud-based scalable object detection and classi-
fication in video streams. Futur. Gener. Comput. Syst. 80, 286–298 (2018)
27. A.R. Zamani, M. Zou, J. Diaz-Montes, I. Petri, O. Rana, A. Anjum, M. Parashar, Dead-
line constrained video analysis via in-transit computational environments. IEEE Trans. Serv.
Comput. 1 (2018)
28. A. Anjum, T. Abdullah, M. Tariq, Y. Baltaci, N. Antonopoulos, Video stream analysis in
clouds: an object detection and classification framework for high performance video analytics.
IEEE Trans. Cloud Comput. 1 (2018)
29. K. Simonyan, A. Zisserman, Two-stream convolutional networks for action recognition in
videos, in Proceedings of the 27th International Conference on Neural Information Processing
Systems, vol. 1 (2014), pp. 568–576
30. J.Y. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, G. Toderici, Beyond
Short Snippet : Deep Networks for Video Classification (2014), p. 4842
31. A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional
neural networks, in Proceedings of the 25th International Conference on Neural Information
Processing Systems, vol. 1 (2012), pp. 1097–1105
32. Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document

recognition. Proc. IEEE 86(11), 2278–2324 (1998)
33. L. Yann, C. Corinna, and J. C. B. Christopher, MNIST Handwritten Digit Database (2010)
34. J. Deng, W. Dong, R. Socher, L. Li, K. Li, L. Fei-Fei, ImageNet: a large-scale hierarchical
image database, in 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009,
pp. 248–255
35. D. Ciregan, U. Meier, J. Schmidhuber, Multi-column deep neural networks for image clas-
sification, in Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) (2012), pp. 3642–3649
36. K. Alex, N. Vinod, H. Geoffrey, The CIFAR-10 dataset (2014)
37. F.J. Huang, Y. LeCun, Large-scale learning with SVM and convolutional for generic object
categorization, in 2006 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR’06), vol. 1, (2006) pp. 284–291
38. Y. Taigman, M. Yang, M. Ranzato, L. Wolf, DeepFace: Closing the Gap to Human-Level
Performance in Face Verification, in 2014 IEEE Conference on Computer Vision and Pattern
Recognition (2014), pp. 1701–1708
39. B.H. Gary, R. Manu, B. Tamara, L.-M. Erik, Labeled Faces in the Wild: A Database for
Studying Face Recognition in Unconstrained Environments (2007)
40. K. Kang, X. Wang, Fully convolutional neural networks for crowd segmentation. CoRR (2014).
abs/1411.4464
41. C. Szegedy, A. Toshev, D. Erhan, Deep neural networks for object detection, in NIPS (2013)
42. K. Kang, W. Ouyang, H. Li, X. Wang, Object detection from video tubelets with convolutional
neural networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (2016), pp. 817–825
43. S. Zha, F. Luisier, W. Andrews, N. Srivastava, R. Salakhutdinov, Exploiting image-trained
CNN architectures for unconstrained video classification, in BMVC (2015)
44. T. Pfister, K. Simonyan, J. Charles, A. Zisserman, Deep convolutional neural networks for
efficient pose estimation in gesture video, Asian Conf. Comput. Vis. 538–552 (2014)
45. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,
A. Rabinovich, Going deeper with convolutions, in Proceedings of the IEEE conference on
computer vision and pattern recognition (2015), pp. 1–9
46. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778
47. M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in European
Conference on Computer Vision (2014), pp. 818–833
48. K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image
Recognition (2014). abs/1409.1556
49. N. McLaughlin, J.M.D. Rincon, P. Miller, Recurrent convolutional network for video-based
person re-identification, in 2016 IEEE Conference on Computer Vision and Pattern Recogni-
tion (CVPR) (2016), pp. 1325–1334
50. M.U. Yaseen, A. Anjum, N. Antonopoulos, Modeling and analysis of a deep learning pipeline
for cloud based video analytics, in Proceedings of the Fourth IEEE/ACM International Con-
ference on Big Data Computing, Applications and Technologies (BDCAT 2017)
51. S. Chen, N. Ram, DISCOVER: discovering important segments for classification of video
events and recounting. IEEE Conf. Comput. Vis. Pattern Recognit. (2014)
52. A. Habibian, C.G.M. Snoek, Recommendations for recognizing video events by concept
vocabularies. Comput. Vis. Image Underst. 124, 110–122 (2014)
53. H. Song, X. Wu, Extracting Key Segments of Videos for Event Detection by Learning From
Web Sources, vol. 20, no. 5 (2018), pp. 1088–1100
54. H. Wang, X. Wu, Y. Jia, Video annotation via image groups from the web. IEEE Trans.
Multimed. 16(5), 1282–1291 (2014)
55. X. Zhang, Y. Yang, Y. Zhang, H. Luan, J. Li, H. Zhang, T. Chua, Enhancing video event
recognition using automatically constructed semantic-visual knowledge base. IEEE Trans.
Multimed. 17(9), 1562–1575 (2015)
56. H. Wang, H. Song, X. Wu, Y. Jia, Video annotation by incremental learning from grouped
heterogeneous sources, in Asian Conference on Computer Vision (2014), pp. 493–507
57. M. Long, J. Wang, G. Ding, S.J. Pan, P.S. Yu, Adaptation regularization: a general framework
for transfer learning. IEEE Trans. Knowl. Data Eng. 26(5), 1076–1089 (2014)
58. L. Duan, D. Xu, S. Chang, Exploiting web images for event recognition in consumer videos:
A multiple source domain adaptation approach, in 2012 IEEE Conference on Computer Vision
and Pattern Recognition (2012), pp. 1338–1345
59. H. Song, X. Wu, W. Liang, Y. Jia, Recognizing key segments of videos for video annotation
by learning from web image sets. Multimed. Tools Appl. 76(5), 6111–6126 (2017)
60. K. Tang, L. Fei-Fei, D. Koller, Learning latent temporal structure for complex event detection,
in 2012 IEEE Conference on Computer Vision and Pattern Recognition (2012), pp. 1250–1257
61. W. Li, Q. Yu, A. Divakaran, N. Vasconcelos, Dynamic pooling for complex event recognition,
in 2013 IEEE International Conference on Computer Vision (2013), pp. 2728–2735
62. P. Over, G. M. Awad, J. Fiscus, M. Michel, A. F. Smeaton, W. Kraaij, Trecvid 2009-goals
tasks data evaluation mechanisms and metrics. TRECVid Work, 2009 (2010)
63. H.J. Escalante, I. Guyon, V. Athitsos, P. Jangyodsuk, J. Wan, Principal motion components
for one-shot gesture recognition. Pattern Anal. Appl. 20(1), 167–182 (2017)
64. J. Wan, Q. Ruan, W. Li, S. Deng, One-shot learning gesture recognition from RGB-D data
using bag of features. J. Mach. Learn. Res. 14, 2549–2582 (2013)
65. J. Wan, V. Athitsos, P. Jangyodsuk, H.J. Escalante, Q. Ruan, I. Guyon, CSMMI: class-specific
maximization of mutual information for action and gesture recognition. IEEE Trans. Image
Process. 23(7), 3152–3165 (2014)
66. D. Wu, F. Zhu, L. Shao, One shot learning gesture recognition from RGBD images, in 2012
IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
(2012), pp. 7–12
67. F. Jiang, S. Zhang, S. Wu, Y. Gao, D. Zhao, Multi-layered gesture recognition with kinect. J.
Mach. Learn. Res. 16, 227–254 (2015)
68. X. Yang, C. Zhang, and Y. Tian, Recognizing actions using depth motion maps-based his-
tograms of oriented gradients, in Proceedings of the 20th ACM International Conference on
Multimedia (2012), pp. 1057–1060
69. W. Li, Z. Zhang, Z. Liu, Action recognition based on a bag of 3D points, in 2010 IEEE Com-
puter Society Conference on Computer Vision and Pattern Recognition—Workshops (2010),
pp. 9–14
70. O. Oreifej, Z. Liu, HON4D: histogram of oriented 4D normals for activity recognition from
depth sequences, in Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern
71. X. Yang, Y. Tian, Super normal vector for activity recognition using depth sequences, in 2014
IEEE Conference on Computer Vision and Pattern Recognition (2014), pp. 804–811
72. C. Lu, J. Jia, C. Tang, Range-Sample depth feature for action recognition, in 2014 IEEE
Conference on Computer Vision and Pattern Recognition (2014), pp. 772–779
73. P. Wang, W. Li, Z. Gao, C. Tang, J. Zhang, P. Ogunbona, ConvNets-Based action recognition
from depth maps through virtual cameras and pseudocoloring, in Proceedings of the 23rd
ACM International Conference on Multimedia (2015), pp. 1119–1122
74. P. Wang, W. Li, Z. Gao, J. Zhang, C. Tang, P.O. Ogunbona, Action recognition from depth maps
using deep convolutional neural networks. IEEE Trans. Hum.-Mach. Syst. 46(4), 498–509
(2016)
75. D. Wu, L. Pigou, P. Kindermans, N.D. Le, L. Shao, J. Dambre, J. Odobez, Deep dynamic
neural networks for multimodal gesture segmentation and recognition. IEEE Trans. Pattern
Anal. Mach. Intell. 38(8), 1583–1597 (2016)
76. Y. Hou, S. Wang, P. Wang, Z. Gao, W. Li, Spatially and temporally structured global to local
aggregation of dynamic depth information for action recognition. IEEE Access 6, 2206–2219
(2018)
77. P. Wang, S. Wang, Z. Gao, Y. Hou, W. Li, Structured images for RGB-D action recognition,
in 2017 IEEE International Conference on Computer Vision Workshops (ICCVW) (2017),
pp. 1005–1014
78. D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learning spatiotemporal features

with 3D convolutional networks, in Proceedings of the IEEE International Conference on
Computer Vision (2015), pp. 4489–4497
79. S. Ji, W. Xu, M. Yang, K. Yu, 3D convolutional neural networks for human action recognition.
IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
80. V. Veeriah, N. Zhuang, G.-J. Qi, Differential recurrent neural networks for action recog-
nition, in Proceedings of the IEEE international conference on computer vision (2015),
pp. 4041–4049
81. Y. Du, W. Wang, L. Wang, Hierarchical recurrent neural network for skeleton based action
recognition, in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2015), pp. 1110–1118
82. J. Liu, A. Shahroudy, D. Xu, G. Wang, Spatio-temporal lstm with trust gates for 3d human
action recognition, in European Conference on Computer Vision (2016), pp. 816–833
83. P. Wang, W. Li, S. Member, Z. Gao, C. Tang, P. O. Ogunbona, S. Member, Depth Pooling
Based Large-Scale 3-D Action Recognition With Convolutional Neural Networks, vol. 20, no.
5 (2018), pp. 1051–1061
84. P. Ochs, J. Malik, T. Brox, Segmentation of Moving Objects by Long Term Video Analysis,
vol. 36, no. 6 (2014), pp. 1187–1200
85. R. Cucchiara, C. Grana, M. Piccardi, A. Prati, Detecting objects, shadows and ghosts in video
streams by exploiting color and motion information, in Proceedings of 11th International
Conference on Image Analysis and Processing, 2001( 2001), pp. 360–365
86. C. Beyan, A. Temizel, Adaptive mean-shift for automated multi object tracking. IET Comput.
Vis. 6(1), 1–12 (2012)
87. B. Risse, M. Mangan, B. Webb, L.D. Pero, Visual tracking of small animals in cluttered
natural environments using a freely moving camera, in 2017 IEEE International Conference
on Computer Vision Workshops (ICCVW) (2017), pp. 2840–2849
88. A. Sobral, A. Vacavant, A comprehensive review of background subtraction algorithms eval-
uated with synthetic and real videos. Comput. Vis. Image Underst. 122, 4–21 (2014)
89. T. Bouwmans, Recent advanced statistical background modeling for foreground detection—a
systematic survey. Recent Patents Comput. Sci. 4(3), 147–176 (2011)
90. V. Sharma, N. Nain, T. Badal, A survey on moving object detection methods in video surveil-
lance. Int. Bull. Math. Res. 2(1), 2019–2218 (2015)
91. A. Yilmaz, O. Javed, M. Shah, Object tracking. ACM Comput. Surv. 38(4) (2006)
92. T. Bouwmans, Traditional and recent approaches in background modeling for foreground
detection: an overview. Comput. Sci. Rev. 11–12, 31–66 (2014)
93. S. Shantaiya, K. Verma, K. Mehta, A survey on approaches of object detection. Int. J. Comput.
Appl. 65(18), 14–20 (2013)
94. B. Deori, D.M. Thounaojam, A survey on moving object tracking in video. Int. J. Inf. Theory
3(3), 31–46 (2014)
95. L. Leal-Taixé, A. Milan, K. Schindler, D. Cremers, I. Reid, S. Roth, Tracking the trackers: an
analysis of the state of the art in multiple object tracking (2017). arXiv1704.02781
96. M. Yazdi, T. Bouwmans, New trends on moving object detection in video images captured
by a moving camera: a survey. Comput. Sci. Rev. 28, 157–177 (2018)
97. P. Delagnes, J. Benois, D. Barba, Active contours approach to object tracking in image
sequences with complex background. Pattern Recognit. Lett. 16(2), 171–178 (1995)
98. C.R. Wren, A. Azarbayejani, T. Darrell, A.P. Pentland, P finder: real-time tracking of the
human body. IEEE Trans. Pattern Anal. Mach. Intell. 19(7), 780–785 (1997)
99. Hayman and Eklundh, Statistical background subtraction for a mobile observer, in Proceed-
ings Ninth IEEE International Conference on Computer Vision, vol. 1, (2003), pp. 67–74
100. Z. Zivkovic, F. Van Der Heijden, Efficient adaptive density estimation per image pixel for the
task of background subtraction. Pattern Recognit. Lett. 27(7), 773–780 (2006)
101. K.M. Yi, K. Yun, S.W. Kim, H.J. Chang, H. Jeong, J.Y. Choi, Detection of moving objects with
non-stationary cameras in 5.8 ms: bringing motion detection to your mobile device, in 2013
IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2013, pp. 27–34
102. F.A. Setyawan, J.K. Tan, H. Kim, S. Ishikawa, Detection of moving objects in a video captured
by a moving camera using error reduction, in SICE Annual Conference, Sapporo, Japan, (Sept.
2014) (2004), pp. 347–352
103. Y. Jin, L. Tao, H. Di, N. I. Rao, G. Xu, Background modeling from a free-moving camera by
Multi-layer homography algorithm, in 2008 15th IEEE International Conference on Image
Processing (2008), pp. 1572–1575
104. P. Lenz, J. Ziegler, A. Geiger, M. Roser, Sparse scene flow segmentation for moving object
detection in urban environments, in Intelligent Vehicles Symposium (IV), 2011 IEEE (2011),
pp. 926–932
105. L. Gong, M. Yu, T. Gordon, Online codebook modeling based background subtraction with
a moving camera,” in 2017 3rd International Conference on Frontiers of Signal Processing
(ICFSP), 2017, pp. 136–140
106. Y. Wu, X. He, T.Q. Nguyen, Moving Object Detection with a Freely Moving Camera via
Background Motion Subtraction. IEEE Trans. Circuits Syst. Video Technol. 27(2), 236–248
(2017)
107. Y. Zhu, A.M. Elgammal, A multilayer-based framework for online background subtraction
with freely moving cameras, in ICCV, 2017, pp. 5142–5151
108. S. Minaeian, J. Liu, Y.-J. Son, Effective and Efficient Detection of Moving Targets from a
UAV’s Camera. IEEE Trans. Intell. Transp. Syst. 19(2), 497–506 (2018)
109. M. Braham, M. Van Droogenbroeck, Deep background subtraction with scene-specific con-
volutional neural networks, in IEEE International Conference on Systems, Signals and Image
Processing (IWSSIP), Bratislava 23–25 May 2016 (2016), pp. 1–4
110. T. Brox, J. Malik, Object segmentation by long term analysis of point trajectories, in European
111. X. Yin, B. Wang, W. Li, Y. Liu, M. Zhang, Background subtraction for moving cameras
based on trajectory-controlled segmentation and label inference. KSII Trans. Internet Inf.
Syst. 9(10), 4092–4107 (2015)
112. S. Zhang, J.-B. Huang, J. Lim, Y. Gong, J. Wang, N. Ahuja, M.-H. Yang, Tracking persons-
of-interest via unsupervised representation adaptation (2017). arXiv1710.02139
113. P. Rodríguez, B. Wohlberg, Translational and rotational jitter invariant incremental principal
component pursuit for video background modeling, in 2015 IEEE International Conference
on Image Processing (ICIP) (2015), pp. 537–541
114. S.E. Ebadi, V.G. Ones, E. Izquierdo, Efficient background subtraction with low-rank and
sparse matrix decomposition, in 2015 IEEE International Conference on Image Processing
(ICIP) (2015), pp. 4863–4867
115. T. Bouwmans, A. Sobral, S. Javed, S.K. Jung, E.-H. Zahzah, Decomposition into low-rank
plus additive matrices for background/foreground separation: A review for a comparative
evaluation with a large-scale dataset. Comput. Sci. Rev. 23, 1–71 (2017)
116. I. Elhart, M. Mikusz, C.G. Mora, M. Langheinrich, N. Davies, F. Informatics, Audience
Monitor—an Open Source Tool for Tracking Audience Mobility in front of Pervasive Display
117. Intel AIM Suite, Intel Corporation. https://aimsuite.intel.com/
118. Fraunhofer IIS, Fraunhofer AVARD. http://www.iis.fraunhofer.de/en/ff/bsy/tech/bildanalyse/
avard.html
119. G. M. Farinella, G. Farioli, S. Battiato, S. Leonardi, G. Gallo, Face re-identification for digital
signage applications, in Video Analytics for Audience Measurement (2014), pp. 40–52
120. N. Gillian, S. Pfenninger, S. Russell, and J. A. Paradiso, “Gestures Everywhere: A Multi-
modal Sensor Fusion and Analysis Framework for Pervasive Displays,” in Proceedings of
The International Symposium on Pervasive Displays, 2014, p. 98:98–98:103
121. G. Tripathi, K. Singh, D. Kumar, Convolutional neural networks for crowd behaviour analysis
: a survey. Vis. Comput. (2018)
122. T. Li, H. Chang, M. Wang, B. Ni, R. Hong, Crowded Scene Analysis : A Survey, vol. 25, no.
3 (2015), pp. 367–386
123. R. Leggett, Real-Time Crowd Simulation: A Review (2004)
124. M. Hu, S. Ali, M. Shah, Detecting global motion patterns in complex videos, in 19th Inter-
national Conference on Pattern Recognition, 2008. ICPR 2008 (2008), pp. 1–5
125. X. Wang, X. Yang, X. He, Q. Teng, M. Gao, A high accuracy flow segmentation method in
crowded scenes based on streakline. Opt. J. Light Electron Opt. 125(3), 924–929 (2014)
126. S. Wu, B.E. Moore, M. Shah, Chaotic invariants of Lagrangian particle trajectories for anomaly
detection in crowded scenes, in 2010 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition (2010), pp. 2054–2060
127. R. Mehran, A. Oyama, M. Shah, Abnormal crowd behavior detection using social force model,
in 2009 IEEE Conference on Computer Vision and Pattern Recognition (2009), pp. 935–942
128. R. Mehran, B.E. Moore, M. Shah, A streakline representation of flow in crowded scenes, in
Computer Vision—ECCV 2010 (2010), pp. 439–452
129. H. Su, H. Yang, S. Zheng, Y. Fan, S. Wei, The large-scale crowd behavior perception based
on spatio-temporal viscous fluid field. IEEE Trans. Inf. Forensics Secur. 8(10), 1575–1589
(2013)
130. M. Hu, S. Ali, M. Shah, Learning motion patterns in crowded scenes using motion flow field,
in 2008 19th International Conference on Pattern Recognition (2008), pp. 1–5
131. P. Jodoin, Y. Benezeth, Y. Wang, Meta-tracking for video scene understanding, in 2013 10th
IEEE International Conference on Advanced Video and Signal Based Surveillance (2013),
pp. 1–6
132. Y. Benabbas, N. Ihaddadene, C. Djeraba, Motion pattern extraction and event detection for
automatic visual surveillance. J. Image Video Process. 2011, 7 (2011)
133. S.C. Shadden, F. Lekien, J.E. Marsden, Definition and properties of Lagrangian coherent
structures from finite-time Lyapunov exponents in two-dimensional aperiodic flows. Phys. D
Nonlinear Phenom. 212(3–4), 271–304 (2005)
134. L. Kratz, K. Nishino, Tracking pedestrians using local spatio-temporal motion patterns in
extremely crowded scenes. IEEE Trans. Pattern Anal. Mach. Intell. 34(5), 987–1002 (2012)
135. Y. Cong, J. Yuan, J. Liu, Abnormal event detection in crowded scenes using sparse represen-
tation. Pattern Recognit. 46(7), 1851–1864 (2013)
136. M. Lewandowski, D. Simonnet, D. Makris, S.A. Velastin, J. Orwell, Tracklet reidentification
in crowded scenes using bag of spatio-temporal histograms of oriented gradients, in Mexican
Conference on Pattern Recognition (2013), pp. 94–103
137. C. Kuo, C. Huang, R. Nevatia, Multi-target tracking by on-line learned discriminative appear-
ance models, in 2010 IEEE Computer Society Conference on Computer Vision and Pattern
138. S. B\kak, D.-P. Chau, J. Badie, E. Corvee, F. Brémond, M. Thonnat, Multi-target tracking by
discriminative analysis on Riemannian manifold, in 2012 19th IEEE International Conference
on Image Processing (ICIP) (2012), pp. 1605–1608
139. B. Zhou, X. Wang, X. Tang, Understanding collective crowd behaviors: learning a mixture
model of dynamic pedestrian-agents, in 2012 IEEE Conference on Computer Vision and
Pattern Recognition (2012), pp. 2871–2878
140. W. Chongjing, Z. Xu, Z. Yi, L. Yuncai, Analyzing motion patterns in crowded scenes via
automatic tracklets clustering. China Commun. 10(4), 144–154 (2013)
141. B. Zhou, X. Wang, X. Tang, Random field topic model for semantic region analysis in crowded
scenes from tracklets. CVPR 2011, 3441–3448 (2011)
142. K. Xu, X. Jiang, T. Sun, Anomaly Detection Based on Stacked Sparse Coding With Intraframe
Classification Strategy, vol. 20, no. 5 (2018), pp. 1062–1074
143. B.T. Morris, M.M. Trivedi, A survey of vision-based trajectory learning and analysis for
surveillance. IEEE Trans. Circuits Syst. Video Technol. 18(8), 1114–1127 (2008)
144. L. Brun, A. Saggese, M. Vento, Dynamic scene understanding for behavior analysis based on
string Kernels. IEEE Trans. Circuits Syst. Video Technol. 24(10), 1669–1681 (2014)
145. C. Piciarelli, C. Micheloni, G.L. Foresti, Trajectory-Based anomalous event detection. IEEE
Trans. Circuits Syst. Video Technol. 18(11), 1544–1554 (2008)
146. D. Tran, J. Yuan, D. Forsyth, Video event detection: from subvolume localization to spa-
tiotemporal path search. IEEE Trans. Pattern Anal. Mach. Intell. 36(2), 404–416 (2014)
147. S. Coşar, G. Donatiello, V. Bogorny, C. Garate, L.O. Alvares, F. Brémond, Toward abnormal
trajectory and event detection in video surveillance. IEEE Trans. Circuits Syst. Video Technol.
27(3), 683–695 (2017)
148. X. Song, X. Shao, Q. Zhang, R. Shibasaki, H. Zhao, J. Cui, H. Zha, A fully online and
unsupervised system for large and high-density area surveillance: tracking, semantic scene
learning and abnormality detection. ACM Trans. Intell. Syst. Technol. 4(2), 35:1–35:21 (2013)
149. A.R. Revathi, D. Kumar, An efficient system for anomaly detection using deep learning
classifier. Signal, Image Video Process. 11(2), 291–299 (2017)
150. O.P. Popoola, K. Wang, Video-Based abnormal human behavior recognition—a review. IEEE
Trans. Syst. Man Cybern. Part C (Applications Rev.) 42(6), 865–878 (2012)
151. Y. Yuan, Y. Feng, X. Lu, Statistical hypothesis detector for abnormal event detection in
crowded scenes. IEEE Trans. Cybern. 47(11), 3597–3608 (2017)
152. G. Xiong, J. Cheng, X. Wu, Y.-L. Chen, Y. Ou, Y. Xu, An energy model approach to people
counting for abnormal crowd behavior detection. Neurocomputing 83, 121–135 (2012)
153. S. Yi, X. Wang, C. Lu, J. Jia, L0 regularized stationary time estimation for crowd group
analysis, in 2014 IEEE Conference on Computer Vision and Pattern Recognition (2014),
pp. 2219–2226
154. Y. Zhang, L. Qin, R. Ji, H. Yao, Q. Huang, Social attribute-aware force model: exploiting rich-
ness of interaction for abnormal crowd detection. IEEE Trans. Circuits Syst. Video Technol.
25(7), 1231–1245 (2015)
155. K. Cheng, Y. Chen, W. Fang, Video anomaly detection and localization using hierarchical fea-
ture representation and Gaussian process regression, in 2015 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR) (2015), pp. 2909–2917
156. Y. Lee, Y. Yeh, Y.F. Wang, Anomaly detection via online oversampling principal component
analysis. IEEE Trans. Knowl. Data Eng. 25(7), 1460–1470 (2013)
157. B. Krausz, C. Bauckhage, Loveparade 2010: automatic video analysis of a crowd disaster.
Comput. Vis. Image Underst. 116(3), 307–319 (2012)
158. D. Lee, H. Suk, S. Park, S. Lee, Motion influence map for unusual human activity detection and
localization in crowded scenes. IEEE Trans. Circuits Syst. Video Technol. 25(10), 1612–1623
(2015)
159. C.C. Loy, T. Xiang, S. Gong, Salient motion detection in crowded scenes, in 2012 5th Inter-
national Symposium on Communications, Control and Signal Processing (2012), pp. 1–4
160. S. Vishwakarma, A. Agrawal, A survey on activity recognition and behavior understanding
in video surveillance. Vis. Comput. 29(10), 983–1009 (2013)
161. J. Xu, S. Denman, S. Sridharan, C. Fookes, R. Rana, Dynamic texture reconstruction from
sparse codes for unusual event detection in crowded scenes, in Proceedings of the 2011 Joint
ACM Workshop on Modeling and Representing Events (2011), pp. 25–30
162. Y. Cong, J. Yuan, Y. Tang, Video anomaly search in crowded scenes via spatio-temporal
motion context. IEEE Trans. Inf. Forensics Secur. 8(10), 1590–1599 (2013)
163. M. Thida, H. Eng, P. Remagnino, Laplacian eigenmap with temporal constraints for local
abnormality detection in crowded scenes. IEEE Trans. Cybern. 43(6), 2147–2156 (2013)
164. K. Yu, Y. Lin, J. Lafferty, Learning image representations from the pixel level via hierarchical
sparse coding, in 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2011), pp. 1713–1720
165. W. Li, V. Mahadevan, N. Vasconcelos, Anomaly detection and localization in crowded scenes.
166. C. Lu, J. Shi, J. Jia, Abnormal event detection at 150 FPS in MATLAB, in 2013 IEEE
International Conference on Computer Vision (2013), pp. 2720–2727
167. C. Lu, J. Shi, J. Jia, Scale adaptive dictionary learning. IEEE Trans. Image Process. 23(2),
837–847 (2014)
168. S. Han, R. Fu, S. Wang, X. Wu, Online adaptive dictionary learning and weighted sparse cod-
ing for abnormality detection, in 2013 IEEE International Conference on Image Processing
(2013), pp. 151–155
169. B. Zhao, L. Fei-Fei, E.P. Xing, Online detection of unusual events in videos via dynamic
sparse coding. CVPR 2011, 3313–3320 (2011)
170. A. Kanehira, L. Van Gool, Y. Ushiku, T. Harada, Viewpoint-aware Video Summarization
171. D. Potapov, M. Douze, Z. Harchaoui, C. Schmid, Category-Specific Video Summarization,
in Computer Vision—ECCV 2014 (2014), pp. 540–555
172. H. Yang, B. Wang, S. Lin, D.P. Wipf, M. Guo, B. Guo, Unsupervised extraction of video
highlights via robust recurrent auto-encoders, in 2015 IEEE International Conference on
Computer Vision (2015), pp. 4633–4641
173. B. Zhao, E.P. Xing, Quasi real-time summarization for consumer videos, in 2014 IEEE Con-
ference on Computer Vision and Pattern Recognition (2014), pp. 2513–2520
174. A. Sharghi, B. Gong, M. Shah, Query-focused extractive video summarization, in European
175. A. Sharghi, J.S. Laurel, B. Gong, Query-focused video summarization: dataset, evaluation,
and a memory network based approach, in The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), (2017), pp. 2127–2136
176. R. Panda, A. Das, Z. Wu, J. Ernst, A.K. Roy-Chowdhury, Weakly supervised summarization
of web videos, in 2017 IEEE International Conference on Computer Vision (ICCV) (2017),
pp. 3677–3686
177. M. Otani, Y. Nakashima, E. Rahtu, N. Yokoya, Video Summarization using Deep Semantic
Features, pp. 1–16
178. B. Gong, W.-L. Chao, K. Grauman, F. Sha, Diverse sequential subset selection for supervised
video summarization, in Advances in Neural Information Processing Systems 27, ed. by Z.
Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, K.Q. Weinberger (Curran Associates,
Inc., 2014), pp. 2069–2077
179. M. Gygli, H. Grabner, H. Riemenschneider, L. Van Gool, Creating Summaries from User
Videos, in Computer Vision—ECCV 2014 (2014), pp. 505–520
180. M. Gygli, H. Grabner, L. Van Gool, Video summarization by learning submodular mixtures
of objectives, in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2015), pp. 3090–3098
181. A. Kulesza, B. Taskar, others, Determinantal point processes for machine learning. Found.
Trends®in Mach. Learn. 5(2–3), 123–286 (2012)
182. Y. J. Lee, J. Ghosh, K. Grauman, Discovering important people and objects for egocentric
video summarization, in 2012 IEEE Conference on Computer Vision and Pattern Recognition
(2012), pp. 1346–1353
183. D. Liu, G. Hua, T. Chen, A hierarchical visual model for video object summarization. IEEE
Trans. Pattern Anal. Mach. Intell. 32(12), 2178–2190 (2010)
184. B.A. Plummer, M. Brown, S. Lazebnik, Enhancing video summarization via vision-language
embedding, in Computer Vision and Pattern Recognition, vol. 2 (2017)
185. M. Sun, A. Farhadi, T. Chen, S. Seitz, ranking highlights in personal videos by analyzing
edited videos. IEEE Trans. Image Process. 25(11), 5145–5157 (2016)
186. K. Zhang, W.-L. Chao, F. Sha, K. Grauman, Summary transfer: exemplar-based subset selec-
tion for video summarization, in Proceedings of the IEEE conference on computer vision and
pattern recognition (2016), pp. 1059–1067
187. F. Chen, C. De Vleeschouwer, Formulating team-sport video summarization as a resource
allocation problem. IEEE Trans. Circuits Syst. Video Technol. 21(2), 193–205 (2011)
188. E. Elhamifar, M.C.D.P. Kaluza, Online summarization via submodular and convex optimiza-
tion, in CVPR (2017), pp. 1818–1826
189. M. Fleischman, B. Roy, D. Roy, Temporal feature induction for baseball highlight classi-
fication, in Proceedings of the 15th ACM international conference on Multimedia (2007),
pp. 333–336
190. R. Hong, J. Tang, H.-K. Tan, S. Yan, C. Ngo, T.-S. Chua, Event driven summarization for web
videos, in Proceedings of the First SIGMM Workshop on Social Media (2009), pp. 43–48
191. A. Khosla, R. Hamid, C. Lin, N. Sundaresan, Large-Scale video summarization using Web-
Image priors, in 2013 IEEE Conference on Computer Vision and Pattern Recognition (2013),
pp. 2698–2705
192. G. Kim, L. Sigal, E.P. Xing, Joint summarization of large-scale collections of web images
and videos for storyline reconstruction, in 2014 IEEE Conference on Computer Vision and
Pattern Recognition (2014), pp. 4225–4232
193. Z. Lu, K. Grauman, Story-Driven summarization for egocentric video, in 2013 IEEE Confer-
ence on Computer Vision and Pattern Recognition (2013), pp. 2714–2721
194. B. Mahasseni, M. Lam, S. Todorovic, Unsupervised video summarization with adversarial
lstm networks, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
vol. 1 (2017)
195. Y. Song, J. Vallmitjana, A. Stent, A. Jaimes, TVSum: summarizing web videos using titles,
in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015),
pp. 5179–5187
196. Y.T. Tesfaye, S. Member, E. Zemene, S. Member, Multi-target Tracking in Multiple Non-
overlapping Cameras using Constrained Dominant Sets, pp. 1–15
197. Y. Wang, S. Velipasalar, M.C. Gursoy, Distributed wide-area multi-object tracking with non-
overlapping camera views. Multimed. Tools Appl. 73(1), 7–39 (2014)
198. Y.T. Tesfaye, E. Zemene, M. Pelillo, A. Prati, Multi-object tracking using dominant sets. IET
Comput. Vis. 10(4), 289–297 (2016)
199. A. Roshan Zamir, A. Dehghan, and M. Shah, GMCP-Tracker: global multi-object track-
ing using generalized minimum clique graphs, in Computer Vision—ECCV 2012 (2012),
pp. 343–356
200. A. Dehghan, S.M. Assari, M. Shah, GMMCP tracker: globally optimal generalized maximum
multi clique problem for multiple object tracking, in 2015 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR) (2015), pp. 4091–4099
201. C.-H. Kuo, C. Huang, R. Nevatia, Inter-Camera association of multi-target tracks by on-line
learned appearance affinity models, in Computer Vision—ECCV 2010 (2010), pp. 383–396
202. D. Cheng, Y. Gong, J. Wang, Q. Hou, N. Zheng, Part-Aware trajectories association across
non-overlapping uncalibrated cameras. Neurocomputing 230, 30–39 (2017)
203. Y. Gao, R. Ji, L. Zhang, A. Hauptmann, symbiotic tracker ensemble toward a unified tracking
framework. IEEE Trans. Circuits Syst. Video Technol. 24(7), 1122–1131 (2014)
204. S. Zhang, Y. Zhu, A. Roy-Chowdhury, Tracking multiple interacting targets in a camera
network. Comput. Vis. Image Underst. 134, 64–73 (2015)
205. Y. Cai, G. Medioni, Exploring context information for inter-camera multiple target tracking,
in IEEE Winter Conference on Applications of Computer Vision (2014), pp. 761–768
206. J. You, A. Wu, X. Li, W.-S. Zheng, Top-Push video-based person re-identification, in 2016
IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 1345–1353
207. A.D. Jepson, D.J. Fleet, T.F. El-Maraghi, Robust online appearance models for visual tracking.
208. J. Pan, B. Hu, Robust occlusion handling in object tracking, in 2007 IEEE Conference on
Computer Vision and Pattern Recognition (2007), pp. 1–8
209. A. Yilmaz, X. Li, M. Shah, Contour-based object tracking with occlusion handling in video
acquired using mobile cameras. IEEE Trans. Pattern Anal. Mach. Intell. 26(11), 1531–1536
(2004)
210. A. Senior, A. Hampapur, Y.-L. Tian, L. Brown, S. Pankanti, R. Bolle, Appearance models for
occlusion handling. Image Vis. Comput. 24(11), 1233–1243 (2006)
211. L. Hou, W. Wan, K.-H. Lee, J.-N. Hwang, G. Okopal, J. Pitton, Robust human tracking based
on DPM constrained multiple-kernel from a moving camera. J. Signal Process. Syst. 86(1),
27–39 (2017)
212. H. Zhang, G. Ananthanarayanan, P. Bodik, M. Philipose, P. Bahl, M.J. Freedman, Live Video
Analytics at Scale with Approximation and Delay-Tolerance
213. G. Ananthanarayanan, P. Bahl, P. Bodík, K. Chintalapudi, M. Philipose, L. Ravindranath, S.
Sinha, Real-Time video analytics: the killer app for edge computing. Computer (Long. Beach.
Calif) 50(10), 58–67 (2017)
214. F. Loewenherz, V. Bahl, Y. Wang, Video analytics towards vision zero. ITE 87, 25–28 (2017)
215. H. Qiu, X. Liu, S. Rallapalli, A.J. Bency, K. Chan, Kestrel: Video Analytics for Augmented
Multi-camera Vehicle Tracking (2018), pp. 48–59
216. J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: unified, real-time object
detection, in Proceedings of the IEEE conference on computer vision and pattern recognition
(2016), pp. 779–788
217. E. K. Bowman, M. Turek, P. Tunison, S. Thomas, E.K. Bowman, M. Turek, P. Tunison, R.
Porter, V. Gintautas, P. Shargo, J. Lin, Q. Li, X. Li, R. Mittu, C.P. Rosé, K. Maki, Advanced
text and video analytics for proactive decision making, no. May 2017 (2018)
218. K.P. Seng, Video Analytics for Customer Emotion and Satisfaction at Contact Centers, vol.
48, no. 3 (2018), pp. 266–278

Olatunji-Cheng2019 Chapter VideoAnalyticsForVisualSurveil

Uploaded by

Copyright:

Available Formats

Olatunji-Cheng2019 Chapter VideoAnalyticsForVisualSurveil

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Olatunji-Cheng2019 Chapter VideoAnalyticsForVisualSurveil

Uploaded by

Copyright:

Available Formats

Chapter 15

Video Analytics for Visual Surveillance

Iyiola E. Olatunji and Chun-Hung Cheng

Keywords Video analytics · Intelligent video system · Video surveillance ·

I. E. Olatunji (B) · C.-H. Cheng

© Springer Nature Switzerland AG 2019 475

15.1 Overview of Video Analytics

15.2 Theory of Video Analytics

15.2.1 Background Initialization or Estimation

Pixel-based approach extracts in-depth shape information of moving object by

W B E (x, t + 1) RG B = W B E (x, t) RG B (1 − L R (x, t)) + L R (x, t)I(x, t) RG B (3)

15.2.2 State-of-the-Art Video Analysis Model

Handcrafted features such as Histogram Of Oriented Gradients (HOG) [22], Scale-

the breakthrough of AlexNet [31]. Mathematical and architectural analysis of video

Fig. 3 Video-based person re-identification based on recurrent CNN architecture as proposed by

Convk , l = g(xk , l ∗ Wk , l + Bk , l) (9)

Subk , l = g(xk , l ∗ wk , l + bk , l) (10)

in the architecture in order to avoid overfitting by penalizing network weight. The

l(i, xi T ) = M(ei , f (xi T )) (11)

15.3 Algorithmic Domains and Task Involved in Video

15.3.1 Video Segmentation

Complex event is defined as a combination of several actions, objects and scenes

15.3.1.1 Extracting Video Segments for Action or Event Detection

Action recognition in videos involves both segmentation and classification. This

15.3.2 Moving Object Classification and Detection

15.3.2.1 Object Tracking

Object tracking in video surveillance involves the process of locating moving

15.3.2.2 Motion Detection

silhouette tracking as well as feature-based, region-based and contour-based was

15.3.3 Behavior Analysis

Interpreting behaviors from video footages is relatively fascinating as context needed

15.3.3.1 Motion Feature Representation in Crowded Scenes

Trajectory or tracklet represents motion by computing individual tracks. Motion

15.3.4 Anomaly Detection

Anomaly detection is an application of crowd behavior analysis. Detecting anomalies

Table 1 Summary of Reference Methods

15.3.5 Video Summarization

Video summarization is the process of compressing a video by extracting only the

Table 2 Taxonomy of video summarization methods

15.3.6 Information Fusion from Multiple Camera

In this section, we review multi-camera surveillance systems for wide-area video

15.3.7 Handling Occlusion

Table 3 Recent works on object tracking using non-overlapping multiple camera

General algorithms include facial recognition, vehicle tracking, motion detection,

communication (DSRC) can be used to communicate with other approaching self-

customer’s perception about a product in terms of the presentation and interaction

15.5 Conclusion and Research Direction

32. Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document

78. D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learning spatiotemporal features

You might also like