A General Network For Managing and Processing Live Video Data With Privacy Protection
A General Network For Managing and Processing Live Video Data With Privacy Protection
1007/s00530-011-0245-x
REGULAR PAPER
A general framework for managing and processing live video data with privacy protection
Alexander J. Aved Kien A. Hua
Springer-Verlag 2011
Abstract Though a large body of existing work on video surveillance focuses on image and video processing techniques, few address the usability of such systems, and in particular privacy issues. This study fuses concepts from stream processing and content-based image retrieval to construct a privacy-preserving framework for rapid development and deployment of video surveillance applications. Privacy policies, instantiated to as privacy lters, may be applied both granularly and hierarchically. Privacy lters are granular as they are applicable to specic objects appearing in the video streams. They are hierarchal because they can be specied at specic objects in the framework (e.g., users, cameras) and are combined such that the disseminated video stream adheres to the most stringent aspect specied in the cascade of all privacy lters relevant to a video stream or query. To support this privacy framework, we extend our Live Video Database Model with an informatics-based approach to object recognition and tracking and add an intrinsic privacy model that provides a level of privacy protection not previously available for real-time streaming video data. The proposed framework also provides a formal approach to implement and enforce privacy policies that are veriable, an important step towards privacy certication of video surveillance systems through a standardized privacy specication language. Keywords Query language Privacy framework Video database system Real-time Object recognition Object tracking
A. J. Aved (&) K. A. Hua University of Central Florida, Orlando, FL, USA e-mail: [email protected] K. A. Hua e-mail: [email protected]
1 Introduction Camera networks have been the subjects of intensive research in recent years, and can range from a single camera to a network of tens of thousands of cameras. Usability is an important contributing factor to the effectiveness of such networks. As an example, a camera network in London costs 200 million over a 10-year period. However, police are no more likely to catch offenders in areas with hundreds of cameras than in those with hardly any [29]. This phenomenon is typical for large-scale applications and can be attributed to the fact that the multitude of videos is generally not interesting and constant manual monitoring of these cameras for occasional critical events can become fatiguing. Due to this limitation in real-time surveillance capability, cities mainly use video networks to archive data for post-crime investigation. To address operator fatigue and increase the effectiveness of the camera network, automatic video processing techniques have been developed for real-time event detection under various specic scenarios. These systems, such as the live video database discussed in this study, can monitor live video streams in real time for events of interest and can alert human operators upon their detection. However, pervasive monitoring by corporate and governmental entities can lead to privacy concerns. For example, archived video from a police camera network could later be used for purposes other than what for which it was originally collected. Public information laws could make video footage collected for legitimate purposes available to anyone who requests it. In a corporate setting, cameras deployed to record customer behavior might capture employees after their work shift has ended. Deploying a sizable camera network entails a signicant monetary investment. For aesthetic reasons it is also
123
A. J. Aved, K. A. Hua
desirable to minimize the number of cameras deployed (e.g., in a historic district of a town). For reasons such as these, it is would be benecial if multiple entities could share access to the cameras. Police could monitor for suspicious activity and crime scene evidence, utility companies could gauge outages after inclement weather, and sanitation departments could assess the productivity of new employee vehicle operators, to name just a few possible collaborators. Possible benets entail shared deployment costs, and the possibility of providing service to stakeholders who otherwise could not justify the expense of a single-purpose camera network deployment on their own. However, shared usage and monitoring by indeterminate or changing interests could lead to signicant privacy concerns. Thus, to address the usability and privacy concerns of a general-purpose camera network, three factors are important: 1. The software system must support ad hoc monitoring tasks. An event of interest in one domain such as transportation monitoring is generally different from another domain such as crime prevention. Events of interest can also vary signicantly between individual users. It is desirable to provide the capability to enable rapid development of customized applications for domainspecic users, such as counting cars along a section of highway, or informing an employee when a nearby conference room becomes available early. People are concerned about privacy and there are increasing objections to pervasive monitoring. For applications that run atop a camera network, there is a need for policies which specify the level of privacy they adhere to, and mechanisms which implement said policies.
2.
3.
using low-level operators. In contrast, our query language is declarative. The user only denes the event and the system automatically generates the corresponding query processing procedure. To address the third factor mentioned previously, we present in this article a privacy framework for the LVDBMS. This framework implements a privacy specication language (PSL) that permits privacy policies to be specied, and enforced by removing identiable information pertaining to objects from the video streams as they are made available externally by the system. Example consumers of streaming video could be a le for storage, an operators video terminal, or a live video feed captured for use in a trafc report on a television news broadcast. This facility allows the user to specify various privacy views atop the raw video stream to remove the objects from the output stream, which has the aim of protecting individual privacy while retaining general trends evidenced in the video stream such that further scene analysis is meaningful. Here objects refer to a person, animal, or vehicle that is not part of the video background. These objects are characterized using a multifaceted object model. As part of the new extension to our prototype, we also introduce in this article an informatics-based approach to cross-camera tracking technique. This scheme permits queries to be dened that span multiple video streams. The remainder of this article is organized as follows. In Sect. 2, we give an overview of our LVDBMS to make the article self-contained. The proposed privacy framework is introduced in Sect. 3, with experimental results in Sect. 4. Related work is discussed in Sect. 5. Finally, we conclude this article in Sect. 6.
To achieve the rst two factors, we have designed and implemented a general-purposed Live Video Database Management System (LVDBMS) [25] as a platform to facilitate live video computing. It allows automatic monitoring and management of a large number of live cameras. These cameras in the network are treated as a special class of storage with the live video streams viewed as database content. The user is able to specify an ad hoc video monitoring task by formulating a query to describe a spatiotemporal event. When the event occurs and is detected by a monitoring query, an action associated with the query is executed. This general-purposed LVDBMS also enables rapid development of live video applications much like database applications are developed atop standard database management systems today. Another work that allows the user to specify semantically high-level composite events for video surveillance is presented in [31]. However, this technique requires the user to formulate queries procedurally
2 LVDBMS environment In this section, we briey introduce LVDBMS, a distributed video stream processing Live Video Database (LVD) environment, and refer the reader to [25] for further details. Large networks of cameras provide a proliferation of multimedia information, and there is a profound need to manage and organize the volumes of video data into proportions relevant for human consumption. In an environment with numerous video cameras, the goal is to provide human operators with a facility to specify relevant scenes of interest, minimizing exposure to uninteresting and irrelevant scenes. However, today there is a technological gap between what state of the art software technologies can provide in terms of identifying rich content, and consumer expectations. The LVDBMS allows users to mine numerous video streams for events of interest and perform actions when the specied scenarios are encountered.
123
A general framework for managing and processing live video data Fig. 1 LVDBMS hardware architecture
2.1.1 Camera layer The camera layer, Fig. 3 left, consists of physical devices that capture images. Each camera is paired with a camera adapter which runs on a host computer. We do not assume that the cameras have built-in analytical capability other than capturing images and making them available to a corresponding camera adapter. The camera adapter allows any type of camera to be used with the LVDBMS using relevant drivers. When processing a scene, the camera adapter rst performs scene analysis on the raw image data to determine background pixels and foreground objects. The segmented objects then ow into the frame-to-frame tracking module, which tracks objects within consecutive frames in a cameras eld of view and assigns them an object number (unique within the camera adapter). For each image a bag of feature vectors is calculated (a bag is similar to a set but allows for duplication). Once each frame from the camera is processed, it is bundled as an image descriptor and sent to the spatial processing layer for query evaluation. The image descriptor may contain the actual image bitmap if specically requested by the spatial processing layer host, but otherwise it contains only image meta data, such as the identiers and locations of objects identied within the frame, the corresponding feature vectors, frame sequence number, etc. 2.1.2 Spatial processing layer Spatial processing layer hosts evaluate spatial and temporal operators over the streams of image descriptors and provide
Fig. 2 The LVDBMS is logically grouped into three layers plus a client layer
2.1 LVDBMS architecture The LVDBMS is a distributed video database management system. Operators interact with the highest layer, and view video streams and query results through computer terminals (Fig. 1). Figure 2 illustrates the four logical tiers implemented in the LVDBMS architecture which communicates through web services interfaces. Multiple cameras in the camera layer may be associated with a single host in the spatial processing layer. Spatial processing layer hosts perform real-time motion imagery processing techniques for abstract object recognition and partial query evaluation (dependent upon available data at the host). Intermediate query results are streamed to a host in the stream processing layer which periodically computes the nal query result which is made available to clients in the client layer. We discuss these software layers in more detail in the following sections.
123
A. J. Aved, K. A. Hua
Fig. 3 Software layers of the LVDBMS and major components contained therein
Fig. 4 The LVDBMS client allows users to browse cameras, construct queries, and send system commands
a result stream to the stream processing layer (Fig. 3 middle). A server hosting the spatial processing layer will service many cameras, but a camera adapter will be associated with only a single instance of the spatial processing layer. Server replication at this layer allows the LVDBMS to scale to an arbitrarily large number of video streams. 2.1.3 Stream processing layer The stream processing layer (Fig. 3 right) accepts queries submitted by clients and partial query evaluation results from the spatial processing layer. It does not interact directly with cameras or their adapters. We note that we can have replication at the stream processing layer for and fault-tolerance. Queries are decomposed into an algebraic tree structure which is partitioned by the host and pushed down to relevant servers in the spatial processing layer. As
sub queries are evaluated, results are streamed back to the Stream Processing Layer where they are combined and the nal query result computed. Sub query results may arrive out of order, get lost in the network, or a camera may unexpectedly go ofine and must be gracefully handled. 2.1.4 Client layer Users connect to the LVDBMS and submit queries using a graphical user interface, depicted in Fig. 4. The client allows users to browse available video cameras, and dene and submit queries and review results. 2.2 LVDBMS data model A query is a spatiotemporal specication of an event, posed over video streams, and expressed in Live Video SQL
123
(LVSQL), a structured query language. A query denes which streams will be accessed and what information will be returned. In an LVD, information is contained in live video streams which are inputted to the LVDBMS in real time. The fundamental construct in an LVD is an object, which is either indicated by the user (a static object) or automatically detected (dynamic object). We provide a brief description of the LVD data model and refer the reader to [25] for additional details. A video stream consists of temporally ordered frames where each frame represents a snapshot of what was detected by an image sensor at a particular time. An object, then, is some real-world physical entity whose image was captured and is represented in the frame. (For our denition of object, we do not consider the background to be an object it is specically indicated in that query.) There are two types of operators in LVDs: spatial and temporal. Spatial operators are formulated over objects that are visually captured in video streams (i.e., overlaps, meets, disjoint, exists, etc.). Temporal operators evaluate the temporal relationships between spatial events (i.e., before, during, etc.). When constructing LVSQL statements, there are three types of objects that may be referenced: 1. Static objects are indicated by the user by drawing an outline on a frame captured from the video stream, at query submission time. Dynamic objects the objects that appear in a video stream and are not part of the background. They may be specied as an asterisk (*) in the query. Cross-camera dynamic objects dynamic objects detected in a camera and matched with an object that was (or is) viewed in another camera. In the query language, these objects are denoted with a pound sign (#), e.g., Before (Appear(V1.#), Appear(V2.#), 120), which queries for an object that appears in stream V2 within 120 s of appearing in V1.
data that must be retained for query evaluation is always contained to a temporal sliding window. The resolution of a query refers to the frequency with which a query is evaluated. For example, a query that is evaluated ve times each second is of ner resolution than a query evaluated each second. 2.3 LVDBMS query language The LVSQL query language is used to pose continuous queries over events that occur in live video streams in real time. The essential form of an LVSQL query is as follows: ACTION \Action [ ON EVENT \Event [ where \Event[ syntax is expanded upon in Fig. 5. In LVSQL, spatial operators take objects as arguments, and temporal operators take output of spatial operators (i.e., bit streams) as arguments. Boolean logic combined with the various temporal and spatial operators results in a very expressive language capability. All queries must involve a spatial operator; the simplest query expressible could check for the existence of any object appearing in the eld of view of a particular camera. This spatial query could then be enhanced with a temporal component, for example, duration: trigger an alarm if an object appears and persists for longer than 10 min; or two spatial operators could be combined with a temporal operator: alert if an object contacts a particular desk, then walks through a door.
2.
3.
3 Privacy lter framework In this section, we introduce our privacy framework and explain how it is applied. The implementation details are discussed in Sect. 4. While members of the public generally accept having their image recorded by cameras, it is a violation of their trust to use their data for purposes people may nd intrusive, or to have their image used for reasons contrary to the known usage of the cameras. Examples of
As events occur in real time, queries must be resolved in real time as well. Queries that require temporarily storing historical video data are always parameterized such that the
123
A. J. Aved, K. A. Hua
intrusive uses could be a security guard observing shoppers for personal edication during a rest period, or corporate mining of the behavior of individuals in a store so they may have items marketed to them the next time they enter that store, etc. As cameras become ubiquitous in public locations, camera networks become smarter and storage capacity increases such that larger volumes of video data may be retained for lengthier periods of time, it is an increasing concern that video data collected for one purpose may be used for other purposes. If an intrusive usage was known to the target individual, they might have chosen to not participate, by shopping in a competing mall. In this article, we introduce privacy lters as the framework which could anonymize the people who are observed by networked video cameras in accordance with a privacy policy. Depending upon the specics of the privacy policies being enforced, global trends in the videos could still be visible, such as people going into and out of a room, while the appearances of the individuals could be redacted, thus minimizing the potential for misuse of the video and unintended consequences for the people being observed. We endeavor to protect the identity of innocent individuals. However, some users need the option of investigating and identifying individuals (if they have the authority to do so). In this scenario, the LVDBMS is designed to allow someone with the proper access to investigate the identity of individuals via unperturbed video streams, while applying privacy lters to protect individual privacy by blocking video stream consumers without sufcient access privileges. Thus, privacy enforcement does not affect the intended utility or intelligibility of the video. The challenge is to accomplish this in real time without restricting who consumes the stream so that actions triggered by events occurring in the stream are timely and relevant. 3.1 Scope and assumptions This section describes the objectives of the transformation that privacy lters induce upon corresponding video streams. It also denes the scope of implementation assumptions inherent to our LVDBMS prototype. 3.1.1 Scope of privacy applied to video stream output The proposed privacy framework is based on the concept of privacy lters. A privacy lter implements a privacy policy, which species under what circumstances the appearance of objects appearing in video streams passing through the lter may be observed or must be redacted from the stream. The primary goal of privacy lters is to obfuscate the appearance of qualifying objects such that objects will become unidentiable after passing through
the privacy lter. More precisely, a privacy lter denes a set of criteria. This criteria is matched with objects that are identied in video streams. This criteria can be specied to be very precise (e.g., objects that satisfy a particular query condition based upon temporal and spatial location), or general (e.g., applied to all objects observed by a particular camera). Thus, the scope of privacy lters are the salient objects appearing in the video stream; not the environment, such as the scene background observed by a camera, or other conditions that can be observed such as time of day, conjectures based upon knowledge of the location of the camera from which the video stream originates, etc. Furthermore, privacy lters are applied to video streams as a nal step before the streamed data is externalized from the system. In order to maximize query accuracy, queries and internal indices for object tracking are based upon metrics calculated from non-obfuscated data. This raw metadata is never externalized by the LVDBMS or explicitly saved to persistent storage. 3.1.2 Scope of system prototype implementation The focus of this research is privacy policies, the realization of those privacy policies as privacy lters, and the corresponding transformations privacy lters have upon corresponding video streams. We also do not consider as a part of this work aspects of system security that must be considered in an actual physical deployment to a public area. For example, we do not consider the physical aspects of the system, such as the physical security of servers hosting LVDBMS software, cameras and the communications channels between cameras and LVDBMS hosts. However, we note that such things can be accomplished by other means such as purchasing xtures to hold the cameras and enabling encrypting communication tunnels via the operating system or through virtual private networks, etc. Furthermore, we do not attempt to detect and thwart privacy attacks against the system, such as through a series of specically crafted queries issued by a user designed to leak unintended information. Although we do implement certain safeguards, such as providing a mechanism to restrict which cameras and video streams a user can observe, we assume a user is who she presents herself to be, and not a malicious user masquerading as a legitimate system user. 3.2 Framework overview Privacy lters may be applied at different levels in the LVDMBS system hierarchy, and video streams may be affected by multiple privacy lters. This cascade of privacy lters is conceptually similar to how views may restrict columns in a traditional relational database (Fig. 6
123
left). When a video stream passes through multiple privacy lters, the effect is that the most stringent privacy level is applied (note that a privacy lter does not necessarily apply to each object appearing in a particular frame of video in a passing stream; Fig. 6 right). Similarly, in a relational database a user may be allowed to only access views, and relational views may be built upon other views, which may themselves reference the physical tables or yet other views. In the LVDBMS hierarchy, privacy lters are associated with cameras, queries, user groups, and view objects: Camera: Any camera with the system can have a privacy lter associated with it. Applied at this level, the privacy lter has the broadest impact as it affects all consumers of this camera. Query: A privacy lter at this level has a moderate impact, as it is associated with a specic query. It affects only the consumers of the querys output. User Group: A privacy lter at this level has the narrowest impact. Only the users in this group are affected. View: A view, implementing a privacy lter, may be dened over a stream or a previously-dened view. Queries and users may access the underlying video stream through the view with the constraint that the privacy lter will be applied to the views output.
stream processing layer is a rst in rst out (FIFO) sequence of frames S ffi ; fi1 ; . . .; fik1 g where k is the lesser of the maximum number of frames required to resolve any active query q 2 Q and a system dened maximum. (Note that frame fi 2 S for any S represents the most recent image captured by a camera, after a negligible processing and communication delay, and S is maintained in real time as frames are received from spatial processing layer hosts.) Frames in S are retrieved via a frame access function: AccS; k ! frame which retrieves the frame and corresponding metadata in the kth position in S. When a stream is externalized from the LVDBMS (such as for display on a users terminal, saved in a le, etc.), it is passed through a sanitizer function: SanS; f ! AccS; 1 Zf where Z returns a mask that indicates which regions of the frame to obscure in accordance with the privacy lter f and obfuscates the image bitmap contained in the frame with the mask, perturbing the output of San. When selecting from a view with a privacy lter f 0 San becomes: SanS; f ! AccS; 1 Zf f 0 where * combines the lters as described in the previous section. In literature, some sanitizers chose to not answer queries or add noise using a statistical distribution. However, Z is deterministic with its parameter f. Furthermore, detecting an attack (that is, a determination of information the system attempts to redact with the mask) is beyond the scope of this study, and all queries are assumed to be legitimate.
3.3 Filter output sanitation model While the previous section provided a conceptual denition of privacy lters, for clarity this section gives a more precise treatment. Let Q be the set of active queries posed over streams S in the LVDBMS. A stream S 2 S in the
Fig. 6 Relational database view (left) compared with illustration of a cascade of privacy lters (right)
123
A. J. Aved, K. A. Hua Fig. 7 Privacy Specication Language; uppercase represents a keyword and italics a usersupplied parameter
{CREATE | UPDATE | DELETE} FILTER filter_identifier [TARGET = {QUERYTARGETS | NONQUERYTARGETS | PREVIOUSLYMASKED}] [TEMPORALSCOPE = {QUERYNONACTIVE | QUERYACTIVE | PERMANENT}] [OBJECTSCOPE = {STATIC | DYNAMIC | CROSSCAMERADYNAMIC}] {CREATE | UPDATE | DELETE} VIEW view_identifier OVER stream_identifier [WITH filter_identifier] {ASSOCIATE | DISASSOCIATE} GROUP group_identifier WITH {FILTER | VIEW} filter_identifier {CREATE | DELETE} USERGROUP group_identifier {ASSOCIATE | DISASSOCIATE} USER user_identifier WITH group_identifier
3.4 PSL The PSL allows a system administrator to implement privacy policies by constructing privacy lters, and to manage system user access. Privacy lters can be associated with groups, and thus with individual users through their usergroup membership association. When applied to a group, a privacy lter affects all people in the group. Privacy lters may also be associated with views, in which case they apply to any accessors of the view. (Privacy lters may also be associated with cameras, but this is specied in the conguration le associated with the cameras adapter.) All privacy lters are cumulative; the system does not provide a way to reduce privacy by adding a new privacy lter. When a user creates a query, their privacy lters (via group memberships) are in turn associated with the query. However, a system administrator to create canned queries which users may run unaffected by the privacy lters associated with the executors account. For example, such a query could save an unperturbed video stream to a persistent storage location the user does not have access to, if they feel a crime is occurring and observed in a camera that they do not have unrestricted access to. The syntax for the PSL is provided in Fig. 7. 3.5 Design and implementation of privacy lters This section presents operational and implementation details of privacy lters in the LVDBMS. When two or more privacy lters apply to a stream, they are combined into an effective privacy lter. An active query is a currently running query in the LVDBMS system. Each query has operators that specify object(s) from cameras (i.e., spatial operators) as input. A relevant object is an object that appears in a video stream referenced in an active query, and can potentially contribute to the query evaluating to true. If the query becomes true, the contributing relevant object is called a target
Fig. 8 QueryTarget versus NonQueryTargets: D121 satises the query condition and is a target
object; otherwise, it is a non-target object. Consider the query Contains (C1.S1,C1.*) illustrated in Fig. 8. This query determines if there is any dynamic object detected within the static object S1 represented by the dashed rectangle. Since the dynamic object D121 is contained within S1, D121 is a target object. On the other hand, another dynamic object D102 is not contained in S1, D102 is a nontarget object. A privacy lter can be specied so that only target objects, non-target objects, or all relevant objects should be protected (i.e., have their appearance obscured). If a protected object no longer satises a privacy lter specication, this object obtains the status previously masked. In this example, if the privacy lter is to blur all the target objects, then the dynamic object D121 is a previously masked object after it leaves the boundary of S1. We note that displaying a live video stream from a given camera is a similar to a query which always evaluates to true, and the output video stream is the same as the input video stream.
123
3.6 Dening privacy lters Privacy lters are created in the LVDBMS in one of two ways: they are specied in conguration les at system startup, or are created with the create lter PSL statement. A privacy lter is specied by the attribute 3-tuple: {Target, TemporalScope, ObjectScope} where Target, TemporalScope, and ObjectScope are quantied in Tables 1, 2, and 3. Not all attributes are applicable to every use of privacy lters, and may be set to None. The Target parameter species if the lter applies to only target objects, non-target objects, all relevant objects dened over the camera (static objects dened by users and dynamic objects detected in the video stream), or None indicating that this attribute should not be considered when determining if a privacy lter applies to a particular object. ObjectScope refers to the type (scope) of objects the lter applies to; static or dynamic objects, cross-camera dynamic, none or all objects appearing in the stream. TemporalScope indicates when the lter will be applied; always, never (the lter is currently inactive), or when the stream is or is not being accessed by a query. 3.7 Privacy lters applied to cameras Logically, we may associate privacy lters with physical cameras, but to make the LVDBMS software more exible
Table 1 Privacy lter values for parameter type Target Attribute None QueryTargets NonQueryTargets Description No privacy Targets of active queries are obscured Objects that are not targets of active queries are masked. An active query may obscure their identity Species that objects that were previously masked will continue to be masked All object identities are masked, regardless of query status Priority 1 2 2
Table 3 Privacy lter values for parameter type ObjectScope Attribute None CrossCameraDynamic Dynamic Static All Description No privacy (no relevant objects qualify) Objects that are rst detected in another camera Dynamic (automatically detected) objects Static (user dened) objects All classes of objects qualify 3 Priority 1 2 2
PreviouslyMasked
All
Table 2 Privacy lter values for parameter type TemporalScope Attribute None QueryNonActive QueryActive Description No privacy Privacy settings apply only when a query is not active Privacy settings apply only when a privacy-enabled query is active (in the case of privacy applied to a camera, for example) Privacy settings apply for the lifetime of the object or camera or query Priority 1 2 2
Permanent
with respect to the types cameras that can be used with the system, privacy lters are actually evaluated by the camera adapter. When the camera adapter is initialized, it takes its initial conguration from a conguration le. The initial state of its privacy lter can be specied in this le and will persist for the lifetime of the adapters state. (When the LVDBMS is in operation, an operator may specify new default privacy lter settings.) A cameras privacy lter is maintained in the camera adapter. Example 1 if a camera has a privacy lter with an attribute set to None, then by itself, that lter will have no effect. However, when combined with the privacy lter of an active query, it can combine to elevate the privacy state. For example, given a camera with a default privacy lter of {All, None, None} will not result in any effective privacy state when a query is evaluated with its images. However, given a query with privacy lter {QueryTargets, QueryActive, Dynamic} will evaluate to the effective privacy state {All, QueryActive, Dynamic}. The difference is that all dynamic objects will be obscured, instead of only ones that are query targets. The attribute values of a privacy lter apply at the camera level as follows. The Target parameter species whether it applies to objects that are the target of active queries (QueryTargets), objects that are not targets of active queries (NonQueryTargets), all objects dened over the camera (static, dynamic and cross-camera dynamic) (All), or no objects (None). PreviouslyMasked refers to objects that previously qualied to be included in a Privacy Filter. (i.e., they were a non-query target in the camera with a NonQueryTargets attribute set, were a query target with the QueryTargets attribute, etc.) We note that from the perspective of a camera, an active query is a query that has (1) been issued to the LVDBMS system; (2) is not expired; (3) has not evaluated to a condition that executes an action that causes the query to terminate; and (4) the query has an operator that species as input object(s) from said camera. A query target is an object that appears in the eld of view of a camera, and satises two conditions: (1) it is a static object, dynamic object, or cross-camera dynamic
123
A. J. Aved, K. A. Hua
object associated with said camera, and (2) it is referenced (as an operand; directly in the case of a static object or indirectly in the other two cases) by an operand in an active query over the camera in which it appears. In Fig. 8, Dynamic object D121 is contained within an active query and is a QueryTarget. D102 is not associated with a query and has the status NonQueryTargets. The ObjectScope Privacy Filter attributes are the different classes of objects identied by the LVDBMS and explained previously. The TemporalScope attribute of a Privacy Filter applies at the camera level as either None or Permanent; we currently do not support a more granular temporal operator. 3.8 Privacy lters applied to queries Privacy lters may also be associated with queries. Default privacy lters may be congured at the system level (in the conguration corresponding to the Stream Processing Layer host), or it may be specied when the query is instantiated. Once a query is associated with a privacy lter, that privacy lter is retained with the query for the lifetime of that query. A querys privacy lter is kept in the stream processing layer Query Executive along with other query metadata, such as the number of sub-queries a query has been decomposed into, which stream processing layer hosts the query has been sent to, etc. When applied to a query, the TemporalScope attributes (Table 2) have only two distinct behaviors; None and Permanent (QueryActive is treated equivalent to Permanent, since both refer to the lifespan of the query). Example 2 a Department of Transportation (DOT) Trafc Management Center (TMC) makes available live video feeds from cameras mounted along major roadways for broadcast on nightly news television news segments.
The TMC provides these video feeds to allow the public to observe trafc pattern trends in real time (such as how quickly trafc is moving across a particular bridge), and for news broadcasters to announce trafc incidents causing lane blockages. However, TMC personnel do not want to embarrass individuals involved in specic trafc incidents, or to broadcast identiers such as license plate numbers. The TMC objective is to provide real-time video with a blur applied to all objects. Object types (e.g., car, truck, and pedestrian) and color can be distinguished by observers, but not individual identiers such as faces and license plate numbers (Fig. 9). (Note that only TMC personnel can control camera functions such as zoom.) This is accomplished by creating a query with an Appear() operator with corresponding privacy lter {None, None, All}. A privacy lter must be associated with the query because by default the query would run with the privacy lter of the user who created the query. In the case of a TMC operator who does not have her view restricted, the default privacy lter for query from a TMC operator would not have a privacy lter applied to it. 3.9 Privacy lters applied to users Default privacy lters may also be dened that apply to users of the system who connect through the client GUI and can browse video cameras. When a users GUI connects to the Stream Processing Layer host, the connection is registered with the Session Manager which records connectivity information such as client IP and port, starts a heartbeat service, and associates a privacy lter with the registration if necessary. The heartbeat service keeps track of clients who are connected to the system and deallocates resources for clients who disconnect. Clients can run queries which do not have actions specied, but continuously return evaluation results to the GUI for the user to watch. Such queries will be aborted if the corresponding client remains disconnected for a period of time. When applied to a user, the TemporalScope attributes (Table 2) have only two effective behaviors; None and Permanent (where Permanent refers to the time the client is connected to the LVDBMS). The other attributes behave as described previously. Example 3 a security monitoring application is written for the LVDBMS. It has a predened set of queries the security guards, can select from, and view on their GUI. They can also monitor any camera associated with the system, but only see the identities of people who satisfy the query conditions (for example, someone who has been standing in the same place for more than 5 min). If a security guard watches that video feed, the person who has not moved will be unobstructed, but people walking nearby will have their image masked. After 5 min, a query action
Fig. 9 The two dynamic objects depicted here have their details obscured by the privacy lter
123
is triggered that records the video, this query is not associated with the security guards privacy lter and records the entire camera view without obstructions. 3.10 Combining privacy lters Privacy lters associated with cameras, users, and queries must all be combined to determine which objects in which frames of the video will have their identity obscured when it is viewed by a user or saved to a persistent le. When a user requests to view video from a camera, the users privacy lter is sent to the corresponding camera adapter. It is combined with the cameras and views privacy lters (if applicable) and then the users GUI connects directly to the camera adapter to receive live video. In the case of a query, the cameras privacy lter is sent as metadata along with the image descriptor, to the corresponding spatial processing layer host. If a query action requires a live video stream (e.g., to save it to disk or direct it to a video monitor), then the querys privacy lter will be pushed down to the spatial processing layer host which is connected to the camera adapter. When two objects that have active privacy lters interact, the effective privacy settings must be calculated. As we have discussed, when privacy lters interact at multiple levels, the effective privacy lter must be calculated. Each attribute in the privacy lter 3-tuple has a value which is assigned a priority (or no value is specied in which case that attribute is not factored into the privacy calculation). When combining attributes, the highest priority attribute is taken, where a higher priority corresponds to more object observations being redacted from the output video stream. Attribute priorities are specied in the Priority column in Tables 1, 2, and 3. When combining privacy lters, if they have different attribute values with the same priority for the same attribute, then the effective attribute value chosen is the value at the next higher priority for that attribute. This procedure must be repeated each time a new query becomes associated with a camera object, or a query expires. Example 4 given a camera object with privacy lter {NonQueryTargets, QueryActive, CrossCameraDynamic} and a query object associated with the camera {QueryTargets, QueryActive, Dynamic}, the effective privacy lter will be {All, Permanent, All}. That is, the priorities are {2,2,2} and {2,2,2}. Where the tuples have different attribute values that are of equal priority, this is reconciled by giving the attribute the value of the next higher priority parameter. 3.11 Tracking based upon a multifaceted object model In order to implement the cross-camera dynamic object operand in LVSQL, we developed a camera-to-camera
tracking technique based upon constructing an appearance model of the objects appearing in video streams. Objects are tracked from frame-to-frame using a traditional tracking technique (e.g., [32]), which we refer to as a frame-toframe tracker since it tracks objects within a single video stream. When an object appears in a consecutive sequence of frames, the frame-to-frame tracker assigns a unique identier to the object as a part of the tracking process. A feature vector based upon the objects appearance is also calculated. An object is represented as a bag of multiple instances [6, 7], where each instance is a feature vector based upon an objects visual appearance at a point in the video stream. Therefore, an object can be viewed as a set of points in the multidimensional feature space, referred to as a point set (Fig. 10). Note that the k instances in a bag are derived from k samplings of the object, which may not necessarily be taken from consecutive frames. A FIFO database to hold the multiple-instance bags of objects recently detected by the different cameras in the system. As new observation becomes available, the bag itself is updated by adding the new instance and removing the oldest instance. As surveillance queries generally concern real-time events that have occurred recently, the FIFO database is typically very small, and in our prototype we implemented it as a distributed in-memory database system (distributed among spatial processing layer hosts). Cross-camera tracking is performed as follows. When an object is detected by a camera, its multiple-instance bag is extracted from consecutive frames in the video stream and used as an example to retrieve a sufciently similar bag in the distributed object-tracking database. If there exists another bag sufciently close, based upon the squared distance metric, then the two bags are considered to correspond to the same object. On the other hand, if the system does not nd a sufciently similar bag, the occurrence of
Fig. 10 Multifaceted object representation model in which an object is represented by its point set (i.e., feature vectors)
123
A. J. Aved, K. A. Hua
this newly detected object is considered as the objects rst appearance in the system. To support the retrieval operations, the distributed inmemory database needs to compute similarity between bags. Given two bags of multiple instances: n o 0 0 0 X !1; !2 ; . . .; !k and X 0 !1 ; !2 ; . . .; !k ; x x x x x x where k is the cardinality of the bags, we compute their similarity as follows: m X 0 2 ! dm X; X 0 minsi ; si x x si !si ; 0 2 x where m B k is a tuning factor and !si !si0 x 0 2 ! x x si !si0 is the squared distance between the two vectors. This distance function computes the smallest sum of pairwise distance between the two bags. Although we can set m = k, a smaller k value is more suitable for realtime computation of the distance function. For instance, if m = 1, two objects are considered the same if their appearances look similar according to some single observation. We set m = 5 in our study. Traditionally, each object is represented as a feature vector, i.e., a single point, instead of a point set, in the multidimensional feature space. This representation is less effective for object recognition. For claritys sake, let us consider a simple case in which two different persons currently appear in the surveillance system. One person wears a 2-color t-shirt with white in the front and blue in the back. Another person wears a totally white t-shirt. If the feature vectors extracted from these two persons are based on their front view, the two would be incorrectly recognized as the same object. In contrast, the proposed multifaceted model also takes into account the back of the t-shirt and will be able to tell them apart. The bag model is more indicative of the objects. The FIFO database is implemented as a series of queues and hash tables residing in spatial processing layer hosts. Each host maintains indices (hash tables) of the bags of objects observed in video streams from corresponding camera adapters. Indices associate objects with specic video frames in which they were observed, video frames with video streams, objects to bags, objects with queries over the objects corresponding video streams, etc. Objects appearing in two separate video streams will have two separate bags in the index and two separate identiers (camera adapter identier, local object tracking number, and host identiers concatenated into a string). If the two objects are determined to be the same object, their bags are merged and the index updated such that both object identiers point to the same (merged) bag of observations. Example 5 Cross-camera tracking allows queries to be issued that consider multiple cameras to determine if an
i1
event has occurred. For example, consider employees who work in a building that does not permit smoking inside, but has a back door and a bench next to the door for smokers to sit. When someone comes out of the building and sits at the bench, we assume it is an employee. When someone comes down a nearby street and waits by the door, their motives are unknown to us and a security guard should be notied to observe the situation. In the LVDBMS, this can be accomplished by creating a query over both cameras with a cross-camera dynamic operand, to detect when someone appears in the street camera (which does not observe the bench or door) and then appears in a camera observing the back door and bench. Thus, query targets are objects appearing in rst the one camera and then the second. An associated privacy lter would obscure non-query targets (object appearing in only one camera, or appearing in the back door camera and later in the street view camera). The privacy lter would be {NonQueryTargets, None, None} and the query is Before(Appear(V1.#), Appear(V2.#)) where V1 shows the street and V2 the back door.
4 Evaluation This section describes the experimental conditions in which the LVDBMS software was evaluated and the experimental results. The focus of the LVDBMS is in realtime environment, as opposed to a system that operates only with pre-recorded video. Thus, periodic activities such as the amount of time required to evaluate a query must be less than the frequency in which queries are evaluated. Otherwise query evaluation queue will grow unboundedly and query results will not be returned in a timely fashion. In addition, because video streams entering the system from cameras are unbounded (the camera can always be turned on and transmitting video), only a small amount of data can be retained within a sliding window before it must be discarded to provide room for new data. Thus, implementing privacy protection in real time is different and challenging from doing so off-line because (1) the time required to carry out data processing operations is bounded. And (2) due to storage limitations, only a small portion of the video data may be retained at any particular time in its raw (unsummarized) format and must be processed in one pass through reading through the data. In off-line processing, the video data is stored and can be processed with multiple passes over the data, for example, to create an index structure to be used in a later processing stage. 4.1 Experimental setup To test the effectiveness of the LVDBMS, we utilize three sets of videos, where each video set satises a different
123
objective. We number the data sets from 1 to 3 as follows: (1) We created a series of reference videos by placing cameras in three locations in a campus building, inside two laboratories (rooms) and in a hallway. Each laboratory has slightly different lighting with no external windows, and the hallway has exterior windows along one wall. This provides reference videos with changing lighting conditions, and the subjects at times are obscured by desks, chairs, and tables. This creates a challenging environment to track objects from one camera to another. This series of videos involved 5 people, with on average 2 or 3 people appearing in the eld of view at any particular time. (2) Videos from the CAVIAR project (http://homepages.inf.ed. ac.uk/rbf/CAVIAR/) are utilized. These are low-resolution videos that provide video coverage of the same scene from two different views, font and side. This is challenging because the video resolution is small by todays standards, and the objects appearing in the videos have relatively few pixels to contribute toward building the appearance model (bag of feature vectors). (3) We created a series of videos recording trafc on roads (cars, trucks, and a few pedestrians were observed). Automobiles are rigid objects that do not change shape while we observed them driving and move in patterns (constrained by the road). This scenario provides an excellent testbed to test spatial operators, such as Appears, North, West, etc. with approximately perfect tracking accuracy within a video stream. We evaluated the LVDBMS with pre-recorded videos so that the same conditions could be simulated with different conguration parameters. In our LVDBMS implementation, image frames are presented to the camera adapter in one of two ways: an initial processing thread either reads the image data from a memory buffer that is written to by a device driver for the camera hardware, or the image data are extracted from a video le by a video codec. Once the frame is extracted, it is enqueued in a new frame queue. A second processing thread retrieves frames from the new frame queue and proceeds to identify foreground pixels from background pixels, etc. Once a frame of image data has been enqueued on the new frame buffer, throughout its lifetime it is indistinguishable whether or not the frame was extracted from a live camera or from a pre-recorded video le. Therefore, once the frame has been received from its source, the frames original video source is indistinguishable to the rest of the system processing pipeline and has no effect on query processing or other system behavior. For these experiments, the frame-to-frame tracker is congured to ignore detected objects less than 200 pixels in area, which we will consider noise (but this parameter is congurable at the camera adapter level). Software in all tiers takes conguration settings from XML les, and to facilitate scripting, also accepts command line arguments. LVDBMS core components are implemented in C# and
utilize Language Integrated Queries (LINQ) to maintain some internal queues and hash indexes. For the experimental results presented in this article, the LVDBMS server layers ran on a Windows 7 computer with 3 GHz Pentium IV with Hyper threading CPU and 3GB RAM (Dell Precision 370); and the camera adapters on a 2.54 GHz Core 2 Duo Latitude E6500 with 4GB RAM. For the eight camera experiment discussed in Sect. 4.4.1, the camera adapters were hosted on a Windows 7 HP Pavilion laptop with a 2.3 GHz Quad Core CPU and 4GB RAM. We use Emgu CV, a .NET wrapper for the Intel OpenCV library, which is utilized for low-level visual operators in conjunction with the Intel Integrated Performance Primitives (IPP) library. 4.2 Effectiveness of privacy lters The purpose of privacy lters are to obscure qualied objects in videos from being identied. In this section, we provide an example of how an object that is obscured by a privacy lter is displayed to users via the LVDBMS client. Figure 11 illustrates two separate situations where privacy lters obscure the identication of detected objects. On the left-hand side, a person is walking toward a door, and on the right-hand side a vehicle is traveling down a street in a trafc-counting application. In these examples, objects are obscured by blurring pixels contained in the bounding boxes. Applying a blur maintains a visually appealing image in which obscured objects do not signicantly stand out, but other options, such as an adaptive blur based upon the side of the bounding box, or simply setting the entire rectangle to a solid color such as black, are other options depending upon how much the appearance of the image should be obscured from the video stream (and is not a focus of this work). Additional privacy-preserving measures such as increasing the size of the box to mask the size and shape of the object being protected is another option with the tradeoff of decreasing the utility of the video (as more of the video is obscured to the viewer). Privacy lter effectiveness is a function of the effectiveness of the object detection logic, and depending upon the query, the tracking logic. In the left image in Fig. 11, the person walking satises query condition Appears(), as well as false positives (FPs) identied by the background segmentation algorithm due to a person walking through a door and the closing. In this case, the relevant objects identity is obscured, as well as four FP areas (which can be reduced by adjust camera parameters), and the privacy condition is satised. In this section, we do not provide a separate table of privacy evaluation results, because had the queries presented in Table 4 had privacy lters enabled, the privacy lter effectiveness would have exactly the same
123
A. J. Aved, K. A. Hua Fig. 11 Examples of objects identities obscured by privacy lters. The left image is from data set (1) and right (3)
Table 4 Continuous query evaluation results Query name Appear Description True if objects with area greater than 100 pixels appears in the frame, else false True if there exists an object is moving with downward motion. Before operator has a window size of 20 frames; if the object stops or changes direction for less than 20 frames it is still considered true A person appears in camera 1 and then is recognized when they appear in a second camera An object appears in camera 1 then goes through a door (outlined by a static object) in the second camera Accuracy 100%
100%
without an additional aid such as a security ID card with an RFID. In order to maximize the privacy in these later situations, one could construct a privacy lter that obscures all visible objects (rather than ltering based upon query target or non-query target), thus minimizing the effect of query accuracy. If an object has an appearance that is sufciently similar to the background, then it would not be detected by the background segmentation algorithm, and would not have a privacy lter applied to it. As soon as it were to move or its appearance change such that it looked sufciently distinct from the background around it, it would be recognized as a salient object by the LVDBMS and any applicable privacy lters would apply to the object. 4.3 Privacy lter demonstration This section provides several scenarios showing the application of privacy lters. The rst demonstrates a transportation application in which a Transportation Management Center (TMC) operators terminal and a live news feed originate from the same trafc camera. The query, Appears(v1.*, 250), which monitors for objects in the video stream sized 250 pixels or greater, is active on the camera, v1. The live news feed is served through a view which has associated with it a privacy lter that species that query targets should be obscured, illustrated in Fig. 12. Figure 13 shows a screenshot of the trafc camera video feed as the TMC operator would observe it. Through the LVDBMS they are viewing images from the camera and are not associated with any privacy lters. The live video provided for television, however, has access only to obtain video from the view, view1. This view has a privacy lter associated with it, which applies to all objects that are query targets, that is, which might contribute to a query evaluating to true. This privacy lter has an effect only when a query is active (in this example the query monitors for objects appearing in the video stream which are larger than a specied area in pixels). Figure 14 provides an example of video observed through view1 with the query active.
result presented in the third column of Table 4, tracking accuracy. The accuracy of privacy lters correctly obscuring the appearance of an object from a video stream is related directly to the specication of the privacy lter, and if it is associated with a query, the query. For example, a privacy lter that obscures all objects in a video stream is dependent upon the object segmentation algorithm to correctly identify objects from the video background. Todays background segmentation algorithms are very accurate, and in the cases where they error (such as complex moving backgrounds), the errors would be incorrectly considered objects and have their appearances obscured. Likewise, tracking algorithms which track objects viewable in consecutive frames of video in a single camera are also accurate, but less accurate than the simple foreground/ background extraction. Similarly, tracking objects from one camera to another is a more difcult problem, and as indicated in Table 4, yet less accurate. If two objects visually appear the same in two cameras, it is difcult problem to determine if they are in fact the same object,
123
A general framework for managing and processing live video data Fig. 12 Transportation example showing a video source, v1, providing live video which is consumed by a TMC operator and live television. The table (right) shows privacy lters associated with various objects in this scenario
Relevant Object Camera User 1: TMC Operator User 2: News station live video feed View1
Fig. 13 Unmodied video originating from camera, v1, and as viewed by the TMC operator who does not have a privacy lter
Fig. 14 Live video as viewed through the view view1 with privacy lter and active query
camera-level privacy lter will be propagated to all consumers of this video stream and factor in the effective privacy lter computation for each consumers video stream. The camera is accessed by two users, User 1 and User 2. User 1 is not explicitly assigned a privacy lter, and User 2 has been assigned a privacy lter that is applicable to all objects of type dynamic. (For example, User 2 might only need to recognize general activity, and notify a supervisor, User 1, when a closer review is required.) When the video is viewed from User 1s terminal, the camera-level privacy lter is propagated and applied to the static object drawn around the plaque. The result is the plaque is obscured in the video output on User 1s terminal, Fig. 17. User 2 is explicitly assigned a privacy lter that applies to all dynamic objects in the video stream. The privacy lter on dynamic objects is propagated to User 2 by the LVDBMS automatically. When the frames are rendered into the video stream for User 2, the privacy lters are combined (per the discussion Sect. 3.10) and the effective privacy lter applies to all objects identied in the video stream, as illustrated in Fig. 18. Both terminal images in Fig. 18 are from the same video stream, but illustrate two different system conguration settings, depending upon how much detail might want to be revealed about obscured objects. The upper image has a blur operator applied, which would blur identifying features while providing the operator substantial visual information to observe behaviors. The lower terminal image simply applies a bounding box that is the average pixel color for the region to obscure. 4.4 Query evaluation accuracy for event monitoring tasks Two important aims of the LVDBMS are overall usability and the ability to specify privacy policies in terms of
The next example (Fig. 15) demonstrates privacy lters at multiple levels of the LVDBM hierarchy. A name plaque is mounted in a corridor and must be obscured in video streams sent to all consumers of this video source. To accomplish this, a static object is dened over the plaque and a privacy lter is associated with the camera (Fig. 16). This
123
A. J. Aved, K. A. Hua Fig. 15 Surveillance scenario illustrating the application of multiple layers of privacy lters to different types of objects
Fig. 18 Video stream, as observed by User 2, with blur (upper) and solid (below) patterns obstructing objects from the video stream
Fig. 17 Video stream, as observed by User 1, with static object obscured with a solid pattern. The manor in which an object is removed from the video stream (solid box or blur) is congurable
privacy lters. To be relevant to surveillance applications, the ability to dene queries which detect noteworthy events is important. In addition, privacy, and the ability to maintain some level of privacy for objects (i.e., people, identiable automobiles, etc.) as surveillance systems become more automated and pervasive, is also important. Thus, in
this study, we redesigned the LVSQL query language to be more concise and easier to use. Query accuracy is the accuracy of detecting user-posed events of interest by the LVDBMS. In the experiments in this section, we test the LVDBMS to correctly interpret and evaluate four continuous queries their results, i.e., if the conditions in the video are true when the query indicates a true condition. Query results are tabulated by manually monitoring the videos taken from dataset (1) and query result in the LVDBMS GUI and indicating the result every 5 s (by incrementing TP, TN, FP, FN). Each query is evaluated over a 2-min period.
123
A general framework for managing and processing live video data Fig. 19 Cost to evaluate ve simultaneous queries in terms of CPU time
180
29
36
43
50
57
64
71
78
85
92
99
106
113
Walk2 WalkByShop1cor
As expected, the accuracy for queries involving a single video stream is extremely high. The accuracy of queries that correlate objects across multiple camera views is related to the accuracy of the underlying cross-camera tracking infrastructure, as reected in the two cross-camera query experiments. The short (2 min) experiments allow only a few instances of each object to be observed and reected in the index, however, with only a few bags representing objects in the index, there were no FPs or false negatives (FNs) that could be attributed to mis-associating an object in one video with the wrong object in the other video. To determine query evaluation performance, we constructed four queries, two single-camera queries, and two multi-camera queries, and present the results in Table 4. 4.4.1 Query processing performance The resolution of a continuous query is the frequency in which it is evaluated. In order to be usable in a real-time system, query processing must be completed within a bounded amount of time for query evaluation to not become backlogged (and thus out of sync with the video image a user might observe) with respect to frames from streaming video, index updates, etc. We evaluate the performance of the query processing engine by simultaneously evaluating ve queries for a period of 120 s over a random selection of ten videos. Figure 19 provides the results of evaluating the ve continuous queries over each video with results combined into a single plot. Table 5 presents summary metrics for the data plotted in Fig. 19 normalized for each query (divided by ve). The dataset the video came from is indicated after the videos name in the table. Note that the cost to evaluate a query is a function of the input to its respective operators; some operators, such as AND and OR implement short-circuit evaluation and only evaluate the second argument if the value of the rst is insufcient to determine the operator result. The data
Table 5 Average query evaluation cost in terms of CPU, per query in milliseconds Movie Performance Min SR436_M2U00040 (3) OneShopOneWait1front (2) ShopAssistant2cor (2) TwoEnterShop1cor (2) TwoEnterShop1front (2) TwoEnterShop3cor (2) TwoLeaveShop1cor (2) TwoLeaveShop2cor (2) Walk2 (2) WalkByShop1cor (2) 0.40 0.40 0.40 1.60 3.40 0.40 1.40 0.40 0.40 0.40 Max 5.60 30.81 21.24 12.00 26.60 7.00 15.60 14.00 1.80 3.00 SD 0.73 6.26 2.72 1.68 3.05 0.90 2.75 1.86 0.40 0.49 Avg 0.78 4.58 2.49 2.42 5.97 0.87 3.47 1.38 0.72 0.73
reported in Table 5 is based upon ve queries and one video. We repeated this experiment with eight simultaneous videos, and the performance results were relatively unchanged from those in Table 5. For the resolution of video utilized for this experiment, 8 is the maximum number of camera adapter instances that could be ran on a 4-core host without the frame processing rate dropping to an unacceptable level (we consider approximately ve frames per second manageable, but lower processing frame rates could lead to object segmentation and tracking errors, for example). Once the video stream has been processed by the camera adapter, the corresponding spatial processing layer host receives a stream of object size and position updates, and video frames. The frame-to-frame tracking and background segmentation processes which occur in the camera adapter processing pipeline is the most CPU intensive stage in the data ow in the LVDBMS system. Compared to the video data received by the camera adapters, the quantity of data that ows to the spatial processing layer, and then to the stream processing layer, is substantially less at each phase. The spatial processing
123
A. J. Aved, K. A. Hua
layer performs index updates and query evaluations, which is not CPU intensive, and in turn sends sub-query evaluations to the stream processing layer at the resolution of each query (e.g., once each second). Data presented in Fig. 19 and Table 5 show results from a mixture of queries which evaluated with within a period of time well below the query resolution of 1 s, for the evaluation period. Query evaluation entails computing operator values which require operand lookups within index structures and nally updating metadata for objects to indicate query targets. What we want to emphasize with these results are that on average, over a wide variety of input videos, query execution on average is well below the 1 s query resolution. Had query execution exceeded 1 s, query results would be out of sync with video frames presented to the user via the client. 4.5 Multi-camera object tracking for privacy lter correctness This section provides camera-to-camera tracking results from our tracking technique based upon a multifaceted object model based upon an objects appearances in video streams. An essential feature of the privacy framework is the ability to construct a query and use it to either include or exclude dynamic objects from a privacy lter. Thus, object tracking and cross-camera object tracking (when a privacy lter or corresponding query is formulated to make use of such functionality) correlate positively with privacy lter accuracy. Figures 20 and 21 present accuracy results from two sequences of videos, from dataset (1), that involve tracking people across a three cameras setup in a laboratory environment as described in Sect. 4.1. In order to maximize the number of results to present, for this section we query the index for each observation of each object in each frame of video. That is, for each frame, we query the index for the 1st nearest neighbor from the query point (i.e., the objects feature vector) and return the result if it is sufciently close else nothing is returned. If a result is returned, if the result is the correct object, true positive (TP) is incremented, else false positive (FP) is incremented. Likewise if no result is returned, true negative (TN) is incremented if the object is not in the index else we increment FN. Next, the bag corresponding to the object is updated to include the currently queried instance (based upon the cluster identier assigned by the frameto-frame tracker). This process is repeated for each frame in the video. We present accuracy, measured in terms of precision and recall: Accuracy TP : FP FN TP
(The accuracy equation does not consider TN because if an event does not occur it will not be detected. Furthermore, if an event does not occur and we claim that it did occur, that is considered a FN which is a factor in the equation.) As we see from the accuracy indicated in Figs. 20 and 21, initially the feature space is sparse and the bag representations contain few points (and thus small corresponding standard deviations along the various dimensions). As more observations are added to the bags in the index, the bag representations become more indicative of what we are likely to observe of a particular object in the future, and the accuracy stabilizes. The object-tracking technique we present is based upon the visual appearance of an object, and when more objects. Even though a FIFO queue is utilized to limit the duration of time an object taken into consideration, for tracking purposes, when many objects appear in video streams simultaneously, the likelihood increases that some of the objects will look sufciently similar that they may be mistaken for one another, resulting in decreased accuracy. 5 Related work An LVDBMS encapsulates work from a multitude of domains including continuous query languages development,
123
computer vision techniques such as object detection and tracking. For completeness, we include a review of recent video surveillance related topics. 5.1 Privacy considerations As cameras become pervasive, improved video surveillance systems will be required to overcome the limitations imposed by direct and continuous human monitoring. This will result in increasing volumes of video that is processed, published, monitored, and stored. References [5, 10] suggest that privacy is a function of what is deemed culturally and socially acceptable by society. Several privacy-aware systems have been developed which can detect movement and mask it for privacy considerations. For example, in [27] pedestrians are obscured with multicolored blobs, where color species a status, such as crossed a virtual trip wire. Reference [15] develops an MPEG-4 transcoder and decoder to mask objects in a video stream based upon movement. While these systems increase privacy by masking the objects identity, they are not helpful in ghting crimes because the obscurity is irreversible. Furthermore, they do not provide functionality to determine if an object should indeed be masked in the output video stream. Large collections of data provide data mining opportunities for discovering global trends, decision making, capacity planning, building machine learning classiers, etc. Data in its original form, such as hospital patient demographic data, contains information that violates individual privacy. Privacy-preserving data publishing (PPDP) proposes algorithms to make data available for mining global trends while preserving individual privacy. These techniques range from monitoring the individual queries issued, to perturbing the data in various ways. For example, an attacker might try to identify a patients record in a public data set. The majority of research on privacy control methods focuses on statistical databases containing tabular data. Security control methods generally entail query restrictions, data perturbation, and output perturbation [2]. Query restrictions entail monitoring queries, e.g., the number of queries submitted by a particular user, the amount of overlapping data that is queried by user, etc. Data perturbation entails modifying data values stored in the database, such as replacing the age values of people with the average age by zip code. Output perturbation involves injecting error into the query result. Thus, there is a tradeoff between accuracy and condentiality, inducing higher error results in a lower likelihood of identifying particular data values but results in more skewed aggregated results. When privacy lters are applied to video streams, the effect is a type of data perturbation. In an ideal scenario, the modied streams should not reveal anything about the
individuals appearing therein [5]. However, [8] has shown that an absolute guarantee of privacy is unachievable in the presence of external auxiliary information. Some recent works, such as [26], investigate identity leakage through implicit inference channels, such as time of day combined with camera location. For example, if a camera shows an ofce door and one observes a blurred gure entering at 8 a.m. and leaving at 12 p.m. one can assume the obscured person and the person assigned to that ofce are the same. Thwarting this type of attack on privacy is beyond the scope of the method we propose in this study. Our primary aim is to make objects appearing in a video stream indistinguishable from one another in accordance with the current privacy specication. We note, however, that with our framework, identiers such as ofce door numbers and placards can be dened as static objects and an appropriate privacy specication can be dened to redact them from the output video stream. In this study, we present a exible privacy framework which has the goal of protecting individual privacy while providing data streams that can be queried for events as accurately as possible. Thus, we choose to perturb the output data in some ways (i.e., obscure objects with bounding boxes of varying degrees of tightness) but not others (such as skewing the video in the time domain, adding ghost objects to hide when real ones appear, etc.). 5.2 Object detection and tracking There are many existing video surveillance systems over networked cameras, e.g. [3, 12]. In particular, object recognition and tracking is a core component of these systems, forming a basis for high-level analytic functions for scene understanding and event detection. Since cameras have a limited resolution and eld of view, multiple cameras may be required to provide coverage over the area of interest. Typically, the elds of view of adjacent cameras may not overlap due to economics, the environment, or computation constraints. These practical factors place a great challenge on tracking objects moving across multiple cameras. Existing multi-camera tracking environments [14, 18, 19, 21, 22, 28, 30, 32] require various types of calibrations and/or information on the spatial relationships between the various cameras, to be congured into the system as known parameters. They assume overlapping elds of view of the cameras, or non-random movement patterns. In the latter scenario, when an object moves from the eld of view of a camera into that of the next camera, this object can be recognized in the second camera by taking into consideration the speed and trajectory of the object when it exits the eld of view of the rst camera [21, 22]. This strategy is only applicable to non-random movement patterns such as
123
A. J. Aved, K. A. Hua
objects constrained by roads, walls, etc., and cannot be used for general-purpose applications. Considering cameras deployed in the rooms of a building. When a person leaves a room, this person can be going to any of the other rooms including simply returning to the original room. Using the aforementioned techniques would fail to recognize the object when the person reappears in another camera. One technique, presented in [9], can detect and track multiple objects within a large environment, but it is trained to only work with automobiles. To support a general-purpose video surveillance system, we consider an information fusion approach in this article, in which the cameras can share signature information about the objects they detect but do not recognize. In this environment, each moving object is represented by a collection of feature vectors, each capturing the visual characteristic of the object from a certain viewing angle (i.e., a multifaceted object model). When a camera detects an object, it can query similar objects recently captured by the content-based image retrieval (CBIR) system. If similar objects are found, the object under consideration is recognized and the information can be used for object tracking purposes. This informatics-based approach provides a exible framework for a general-purpose object-tracking environment. There have been extensive research activities in CBIR [13, 24], mainly dealing with low-level image features and preliminary semantic features such as keywords. Additional CBIR techniques with enhanced query capabilities on spatial relationships between objects in images have also been proposed [1, 13, 17, 20, 23]. Such techniques are not suitable for the proposed CBIR framework for object tracking purposes for two important reasons. First, feature vectors used in traditional CBIR are relatively simple. In contrast, different facets of an object can be captured by a camera in a video surveillance application, and we need a more advance, yet computationally efcient, object representation technique to better facilitate object matching and recognition. Second, instead of dealing with stored data as in traditional CBIR systems, our object recognition framework needs to manage live video data in real time and handle continuous updates to the image database. We address this challenge by implementing the image database as an in-memory FIFO queue. 5.3 Multimedia database systems In recent years, many video database management system (VDBMS) prototypes have been designed and implemented [1, 13, 16, 17]. In [11], the Advanced Video Information System (AVIS) is based on a frame segment tree and arrays to represent objects and their association. However, this model lacks the support for spatial and spatiotemporal queries. The VDBMS presented in [1]
allows the user to search video based on a set of visual features and spatiotemporal relationships. The automatic feature extraction in this system is done ofine so it is not suitable for the real-time needs of our LVDBMS. Very few works have targeted a general-purpose LVDBMS that provides real-time spatiotemporal query support over a camera network. All the video database systems presented in the literature thus far address only captured videos in the sense that they require a prior processing and indexing of the entire video collection. In the context of live video databases, the video data is generated online, and very little prior processing of the live video data is possible. Furthermore unlike conventional VDBMSs, LVDBMS has a performance concern associated with segmenting and tracking salient object in real time with high efciency. Our LVDBMS approach manages live video streams coming from hundreds of video cameras, much like a traditional database management system managing captured data sets stored on a large number of disk drives. The LVDBMS provides an application-independent query language to facilitate ad hoc queries, and potentially a programming environment to simplify the development of a variety of applications for distributed camera networks.
6 Conclusions Camera networks have many important applications in diverse areas including urban surveillance, environment monitoring, healthcare, and battleeld visualization. In this article, we describe the LVDBMS as a generic framework to support general-purpose camera networks. We presented the query language LVSQL, the system architecture, the object recognition and cross-camera tracking technique, and privacy lters. LVSQL is a declarative query language that does not require the user to deal with low-level operations. Unlike existing object-tracking techniques that rely on various restrictions and assumptions, our general-purpose in-memory FIFO database approach is based on an information fusion framework. This approach, based on a multifaceted object representation model, allows general spatiotemporal queries on objects in the live video streams. The primary contributions of this study is a rened LVSQL query language, the implementation of privacy lters to obscure the identity of objects appearing in videos, and a cross-camera tracking operator that is integrated into the query language. The LVSQL query language facilitates rapid software development of live video database applications and allows for rapid application development that does not require low-level image processing routines to be implemented. We evaluated our system with experiments and the results indicate that the proposed techniques and the LVDBMS are effective.
123
References
1. Adali, S., Candan, K.S., Chen, S., Erol, K., Subrahmanian, V.S.: Advanced video information systems: data structures and query processing. ACM Multimedia Syst. 4, 172186 (1996) 2. Adam, N.R., Worthmann, J.C.: Security-control methods for statistical databases: a comparative study. ACM Comput. Surv. 21(4), 515556 (1989) 3. Ahmedali, T., Clark, J.J.: Collaborative multi-camera surveillance with automated person detection. Paper presented at the Canadian conference on computer and robot vision (2006) 4. Benjamin C.M.F, Wang, K., Chen, R., Yu, P.S.: Privacy-preserving data publishing: a survey of recent developments. ACM Comput. Surv. 42(4), Article 14 (June 2010) 5. Caloyannides, M.A.: Society cannot function without privacy. IEEE Secur. Priv. 1(3), 8486 (2003) 6. Chen, X., Zhang, C., Chen, S., Chen, M.: A latent semantic indexing based method for solving multiple instance learning problem in region-based image retrieval. Seventh IEEE Int. Symp. Multimedia 4(8), 1214 (2005) 7. Cheng, H., Hua, K.A., Yu, N.: An automatic feature generation approach to multiple instance learning and its applications to image databases. Multimedia Tools Appl. (Springer) (2009) 8. Cynthia, D.: Differential privacy: a survey of results. In: Agrawal, M., Du, D., Duan, Z., Li, A. (eds.) Proceedings of the 5th International Conference on Theory and Applications of Models of Computation (TAMC08), pp. 119. Springer-Verlag, Berlin, Heidelberg (2008) 9. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR 2005 (2005) 10. Danielson, P.: Video surveillance for the rest of us: proliferation, privacy, and ethics education. Int. Symp. Technol. Soc. 1(1), 162167 (2002) 11. Dixon, M., Jacobs, N., Pless, R.: An efcient system for vehicle tracking in multi-camera networks. In: Proceedings of the ICDSC 2009 (2009) 12. Donderler, M.E., Ulusoy, O., Gudukbay, U.: A rulebased video database system architecture. Inf. Sci. 143(14), 1345 (2002) 13. Donderler, M.E., Saykol, E., Ulusoy, O., Gudukbay, U.: BilVideo: a video database management system. IEEE Multimedia 1(10), 6670 (2003) 14. Du, W., Piater, J.: Multi-camera people tracking by collaborative particle lters and principal axis-based integration. In: Asian Conference on Computer Vision, Hyderabad (2007) 15. Dufaux, F., Ebrahimi, T.: Scrambling for privacy protection in video surveillance systems. IEEE Trans. Circuits Syst. Video Technol. 18(8), 11681174 (2008) 16. Flickner, M., Sawhney, H., Niblack, W., Ashley, J., Huang, Q., Dom, B., et al.: Query by image and video content: the QBIC system. IEEE Comput. 28, 2332 (1995) 17. Guting, R.H., Bohlen, M.H., Erwig, M., Jensen, C.S., Lorentzos, N.A., Schneider, M., et al.: A foundation for representing and 18.
19. 20.
21.
22.
23. 24.
25.
26.
27.
28.
29.
30.
31.
32.
querying moving objects. ACM Trans. Database Syst. 25(1), 142 (2000) Hampapur, A., Brown, L., Connell, J., Ekin, A., Haas, N., Lu, M., et al.: Smart video surveillance, exploring the concept of multiscale spatiotemporal tracking. IEEE Signal Process. Mag. 22, 3851 (2005) Hu, M.K.: Visual pattern recognition by moment invariants. IRE Trans. Inf. Theory. IT-8, 179187 (1962) Hua, K.A., Yu, N., Liu, D.: Query decomposition: a multiple neighborhood approach to relevance feedback processing in content-based image retrieval. In: Proceedings of the 22nd International Conference on Data Engineering (2006) Javed, O., Rasheed, Z., Shah M.: Tracking across multiple cameras with disjoint views. In: The Ninth IEEE International Conference on Computer Vision (ICCV), Nice, (2003) Javed, O., Shaque, K., Rasheed, Z., Shah, M.: Modeling intercamera space-time and appearance relationships for tracking across non-overlapping views. Comput. Vis. Image Underst. 109(2), 146162 (2008). doi:10.1016/j.cviu.2007.01.003 Kuo, T.C.T., Chen, A.L.P.: Content-based query processing for video databases. IEEE Trans. Multimedia 2(1), 113 (2000) Li, J.Z., Ozsu, M.T., Szafron, D., Oria, V.: MOQL: a multimedia object query language. In: Proceedings of the 3rd International Workshop on Multimedia Information Systems, pp. 1928, Como (1997) Peng, R., Aved, A.J., Hua, K.A.: Real-time query processing on live videos in networks of distributed cameras. Int. J. Interdiscip. Telecommun. Netw. 2(1), 2748 (2010) Saini, M., Atrey, P.K., Mehrotra, S., Emmanuel, S., Kankanhalli, M.: Privacy modeling for video data publication. 2010 IEEE International Conference on Multimedia and Expo (ICME), pp. 6065, 1923 July 2010 Senior, A., Pankanti, S., Hampapur, A., Brown, L., Tian, Y., Ekin, A., Connell, J., Shu, C., Lu, M.: Enabling video privacy through computer vision. IEEE Secur. Priv. 3(3), 5057 (2005) Song, B., Roy-Chowdhury, A.: Stochastic adaptive tracking in a camera network. In: IEEE International Conference on Computer Vision (2007) The London Evening Standard. Tens of thousands of CCTV cameras, yet 80% of crime unsolved. http://www.thisislondon. co.uk/news/article-23412867-tens-of-thousands-of-cctv-camerasyet-80-of-crime-unsolved.do (2007) Tieu, K., Dalley, G., Grimson, W.E.L.: Inference of non-overlapping camera network topology by measuring statistical dependence. In: IEEE International Conference on Computer Vision (2005) Velipasalar, S., Brown, L.M., Hampapur, A.: Detection of userdened, semantically high-level, composite events, and retrieval of event queries. Multimedia Tools Appl. 50(1), 249278 (2010) Yilmaz, A., Javed, O., Shah, M.: Object tracking: a survey. ACM Comput. Surv. 38, 4 (2006)
123