Video Analysis Application Centric View
Video Analysis Application Centric View
Abstract
This paper presents an end-user application centric of video data, some of which is analyzed in real-time. In
view of surveillance video analysis and describes a addition to traditional surveillance video, in recent years a
flexible, extensible and modular approach to video new source of video data has surpassed any kind of
content extraction. Various detection and extraction volume expectations by individuals recording video
components including tracking of moving objects, during public and private events with their cameras and
detection of text, faces, and face based soft biometric for smart phones and posting them on the websites such as
gender, age and ethnicity classification are described YouTube for public consumption. These unconstrained
within the general framework for real-time and post event video clips (i.e., “other people’s video” the term coined by
analysis applications Panoptes and VideoRecall. Some [2]) may provide additional content and information that is
end-user applications that are built on this framework are not available in typical surveillance camera views; hence
discussed. they became part of any surveillance and investigation
task. For example, surveillance video cameras around a
stadium typically have views of entry/exit halls, snack
courts etc. which may provide partial video coverage of
1. Introduction some excited fans being destructive with no means to get a
good description of the individuals; but an inspection of
Since their early deployment in mid-1960s [25], of the other fan videos posted on public sites soon after the
video surveillance systems held the promise for increased event may make available crucial information that would
security, safety and they facilitated forensic evidence otherwise be missed. The unconstrained surveillance
collection. The potential of video comes with its video is now gaining more importance as a surveillance
challenges; with the large and ever increasing volumes of source and the automated video analysis systems have to
data that is collected and needs to be analyzed to utilize this emerging data source.
automatically to understand the video content, detect Detection and classification of objects in video,
objects and activities of interest and provide relevant tracking of moving objects over multiple camera field of
information to end users in a timely manner. views, understanding the scene context, peoples activities
Over the last two decades, video surveillance systems, and some level of identification information on
from cameras to compression and storage technologies individuals all have been active areas of research [16].
have undergone fundamental changes. Analog cameras Despite all the advances in the field and availability of
have been replaced with digital and IP network cameras, multiple real-time surveillance applications the acceptance
tape based storage systems with no efficient way to find of these automated technologies by security and safety
content have been replaced with Digital and Network professionals have been slower than several initial market
Video Recorders with video content analysis for efficient estimations. This is partly due to the fact that most
storage, indexing and retrieval. With the availability of automated systems are designed to provide a solution to
more and more capable computing platforms the large some of the areas mentioned above, which effectively is
body of research and development in the vision-based only a part of the solution the end-users require. End users
video surveillance field video content analysis and are looking for tools that help to answer the questions
extraction algorithms found their ways into several “What, Where, When, Who, Why and How (w5h)” using
surveillance applications. The surveillance systems video and other data that is available. The answers to w5h
currently deployed at public places collect a large amount questions describe an event with all its details. What
happened? Where did it take place? When did it happen? 2.1. Modular Video Analysis Architecture
Who was involved? Why did it happen? And how did it
happen? In most cases the answers to most of those Figure 1 depicts the intuVision video content extraction
questions can be found in video data at multiple levels of framework. At the heart of our framework lie multiple
detail and the automated video surveillance applications of detectors that operate on the video to detect moving
tomorrow need to be designed to answer as many of these objects and classify them (for fixed camera event
questions by analyzing both in real-time and post event detection); or the frame based detection of face, text,
video. person and vehicles (for analysis of moving platform
“What” describes the event such as “entry from the exit video or post event analysis). Once the objects are
door” that will be detected from a surveillance camera detected and tracked (faces of tracked people can be
video by tracking a person entering into an area from a detected in real-time). Panoptes video event detection
restricted direction doorway. “Where” is the location or module looks for pre-specified or anomalous events. The
locale where the event took place such as “Building-A Soft Biometry module then classifies faces for
lobby” or “Terminal C, Logan Airport, Boston” which is gender/ethnicity and age. All extracted content
available information for surveillance cameras and maybe information is fed into a database to support downstream
in the user created tags for the web videos. “When” is the analysis tasks, queries and retrieval. While Panoptes
time of the event which again is available information for analyzes the video content in real-time, VideoRecall is
surveillance cameras and a date might be available in intended for forensic analyses, identifying items of interest
video tags for web videos or it may just be the time the and providing a reporting utility to organize the findings
web video was posted. “Who” is/are the actors in the and results. Panoptes and VideoRecall form the backbone
event such as the person or people who entered the wrong of a generic video surveillance application and allow for
way. Depending on the video view and the resolution the quick development of more specific custom surveillance
analyses may indicate a single person versus a group of applications (as exemplified in Section 4). These
people; if a good quality face image is available it can be components are discussed in further detail in the following
matched against a watch list or gender/age/ethnicity sections.
maybe detectable even from lower quality face images
using soft biometric classifiers. “Why” describes the
intent of an event which is not directly observable from
video but maybe revealed to the end user by the unfolding
of events following. “How” describes the way in which
this activity has happened and maybe explained by the
event detection such as the person idling and another
person exiting; activities preceding the wrong way entry
will indicate that the person waited around to catch the
door open while someone exited.
Our approach to address all w5h questions is to exploit
video content and metadata at multiple layers with
flexible, expandable and easily re-configurable video
content extraction architecture. The layered architecture
allows for easily configuring various content extraction
tasks to answer as many of these questions as possible.
After this introduction we describe our general video
content extraction framework and its components in
Section 2. We introduce our face based soft video
biometry classification in Section 3. In Section 4 we
introduce few applications based on this framework. We
briefly describe the technology evaluations where parts of Figure 1: intuVision Video Analysis architecture
this framework were evaluated in Section 5, followed by
the Concluding Remarks in Section 6.
2.2. Real-Time analysis with Panoptes
2. Flexible video analytics framework Our approach to real-time video analysis is based on
In this section we introduce our real-time and post- exploiting space and object-based analysis in separate
event content extraction architecture and explain the parallel but highly interactive processing layers that can be
underlying algorithms for the major components.
included in processing as required. This methodology has that are dictated by traffic rules and lights, people walk
its roots in human cognitive vision [36]. Object based along convenient paths or interact with interest points
visual attention; visual tracking and indexing have also such as a snack stand, park their cars and walk towards a
been one of the active fields of study in cognitive vision. building etc. Learning the normal scene activity for each
Models of human visual attention use a saliency map to object type facilitates the detection of the anomalous
guide the attention allocation [1]. Studies on human visual actions of people and vehicles. Similar to [28], we model
cognition system point to a highly parallel but closely the scene activity as a probability density function (pdf).
interacting functional and neural architecture. Humans Kernel Destination Estimation computes the model and a
easily focus their attention on video scenes exhibiting Markov Chain Monte-Carlo framework generates the most
several objects and activities at once and keep track of likely paths through the scene. The model is trained with a
object states and relationships. This indicates allocation of set of normally observed tracks that consists of observed
attention to different processes that deal with various transition vectors defined by the destination location,
dynamics of the scene and using correspondences and transition time and the bounding box width and height of
communication between these processes [21]. the object of interest. The pdf of transition vectors at each
At the first level, the peripheral tracker is responsible pixel location in the background scene is modeled as a
for providing and maintaining a quick glance of the mixture of Gaussian distribution and a Gaussian Mixture
moving object in a scene, it functions as an abstracted Model (GMM) is learned for each pdf at a given location
positional tracker to quickly establish the rough spatial using a modified EM algorithm. This framework
area and the trajectories for the moving objects. Peripheral circumvents the need for explicitly clustering the user
tracker primarily maintains motion based information for tracks. For a given new track the transition vectors are
the objects. This coarse tracking information is all that is calculated and the probability of its being normal is
needed for some higher level processes such as detecting estimated using the trained GMM’s at every pixel
instances of objects entering into a specific zone or high location.
number of objects in the scene. But other detection tasks Object Layer provides detailed object-based analysis
require a more detailed analysis such as classification of for classification. We use Support Vector Machine (SVM)
objects or knowledge of the scene background method for object classification. SVM is a supervised
information to detect and track objects in more learning method that simultaneously minimizes the
challenging environments. The tracking information from classification error while maximizing the geometric
the peripheral tracker is made available to other layers to margin. Hence SVMs perform better than many other
enable maintaining the scene background and context classifiers and have been used in a wide range of
models, stationary object model and detailed object based computer vision applications for object classification,
analysis for classification of tracked objects. object recognition [23,24] and action recognition [29,11].
These layers, described below, facilitate comprehensive We use object’s size, oriented edges [4], silhouette and
analysis of the objects and the scene at multiple levels of motion based features for SVM object classification.
detail as needed. Oriented edges have shown to be robust to changes in
Peripheral Tracking Layer performs a coarse illumination, scale and view angle [6].
spatially-based detection and tracking of moving objects In most surveillance applications, robustly classifying
using a computationally light algorithm based on frame objects without having a large set of sample data to train
recency and produces rough connected regions that the classifier is of particular interest. SVM is well suited
represent the moving objects. The detected moving object for such applications as they can be trained with few
regions are tracked from frame to frame by building samples and generalize well to novel scenes. Figure 2
correspondences for objects based on their motion, illustrates the Panoptes object classification training
position and condensed color information. interface where a new object type is being trained for a
Scene Description Layer maintains models for general person carrying a long object.
scene features to facilitate detailed analysis in other layers. In aerial or other moving camera views (such as
Currently the Scene Description Layer includes an panning and zooming of fixed cameras) scene background
adaptive background model [26], a dynamic texture is not consistent over multiple frames. To detect or track
background model [20], a background edge map [11] and objects in such scenes, using trained detectors based on
a scene activity model [28]. The Scene Description Layer appearance features in each single frame works well. For
or some of its components may not be necessary for all these types of environments we employ the Haar
detection tasks; rather it is designed to support other tasks Transform based feature detection framework as
like stationary object or anomalous activity detection. suggested by [19, 33] for detection of people and vehicles.
Scene Activity Model Every scene has a normal activity Haar features are used to train a cascade of classifiers
and flow associated with it, vehicles usually follow paths using the Adaboost framework. The trained classifier
Real-time Event Detection: Understanding of
activities and events from video sequences have been
studied widely and approaches ranging from expert
system like rules based methods to probabilistic models
have been employed [9,13,14,15]. Our event detection
algorithms use various methods appropriate for different
event detection tasks [11]. We employ rules based
approach for pre-defined events, trained SVM models for
person activities and scene activity model for detecting
anomalous behaviors.