0% found this document useful (0 votes)
79 views18 pages

Cross Camera Tracking

Cross Camera Tracking

Uploaded by

Soma Hazra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views18 pages

Cross Camera Tracking

Cross Camera Tracking

Uploaded by

Soma Hazra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

1

Enabling Cross-Camera Collaboration for Video


Analytics on Distributed Smart Cameras
Chulhong Min, Juheon Yi, Utku Günay Acer, and Fahim Kawsar

Abstract—Overlapping cameras offer exciting opportunities to view a scene from different angles, allowing for more advanced,
comprehensive and robust analysis. However, existing visual analytics systems for multi-camera streams are mostly limited to (i)
per-camera processing and aggregation and (ii) workload-agnostic centralized processing architectures. In this paper, we present
Argus, a distributed video analytics system with cross-camera collaboration on smart cameras. We identify multi-camera, multi-target
tracking as the primary task of multi-camera video analytics and develop a novel technique that avoids redundant, processing-heavy
identification tasks by leveraging object-wise spatio-temporal association in the overlapping fields of view across multiple cameras. We
arXiv:2401.14132v2 [cs.CV] 27 Jan 2024

further develop a set of techniques to perform these operations across distributed cameras without cloud support at low latency by (i)
dynamically ordering the camera and object inspection sequence and (ii) flexibly distributing the workload across smart cameras,
taking into account network transmission and heterogeneous computational capacities. Evaluation of three real-world overlapping
camera datasets with two Nvidia Jetson devices shows that Argus reduces the number of object identifications and end-to-end latency
by up to 7.13× and 2.19× (4.86× and 1.60× compared to the state-of-the-art), while achieving comparable tracking quality.

Index Terms—Cross-camera collaboration, Smart cameras, Video analytics

1 I NTRODUCTION

It is increasingly common for physical locations to be


surrounded and monitored by multiple cameras with over-
lapping fields of view (hereinafter ’overlapping cameras’),
e.g., intersections, shopping malls, public transport, con-
Fig. 1: Places with overlapping cameras: intersection, shop-
struction sites and airports, as shown in Figure 1. Such
ping mall, transport, construction site.
multiple overlapping cameras offer exciting opportunities
to observe a scene from different angles, enabling enriched,
comprehensive and robust analysis. For example, our anal-
ysis of the CityFlowV2 dataset [4] (5 cameras deployed to MAX78000 [11]) and embedded AI models [12], [13] will
monitor vehicles on the road intersection) shows that each accelerate this trend. However, the current practice of multi-
individual camera separately detects only 3.7 vehicles on camera stream processing is limited to being deployed on
average, while five cameras detect a total of 12.0 vehicles cameras without relying on cloud servers in two ways.
altogether. Since a target vehicle can be captured by mul- (i) Per-camera processing and aggregation. Previous work has
tiple cameras from different distances and angles, we can mostly focused on processing the video analytics pipeline
also observe objects of interest with a holistic view. Such on each camera individually and aggregating the results
view diversity can make the analytics more enriched and at the final stage [14], [15], [16], thereby suffering from
robust, e.g., a vehicle’s license plate may be occluded in one significant processing redundancy and latency. (ii) Workload-
camera’s view due to its position or occlusion, but not in the agnostic centralized processing. Some systems have been pro-
other cameras. posed to handle enormous multi-video streams, but they
Most visual analytics systems are deployed in cloud mostly assume that multiple videos are streamed to the
environments. On the other hand, on-camera video analyt- cloud and focus on optimization and coordination of the
ics offer various attractive benefits such as immediate re- serving engine (e.g., GPU scheduling and batch process-
sponse, increased reliability and privacy protection. We ing [17], [18]).
envision that on-board AI accelerators [5], [6], [7] (e.g., In this paper, we present Argus, a distributed video
Nvidia Jetson [8], [9], Google Coral TPU [10] and Analog analytics system designed for cross-camera collaboration with
overlapping cameras. Here, the term ‘cross-camera collab-
• Chulhong Min is with Nokia Bell Labs (e-mail: chulhong.min@nokia-bell- oration’ not only encompasses the fusion of multi-view
labs.com). images for video analytics, but also refers to the cooperative
• Juheon Yi is with Nokia Bell Labs (e-mail: [email protected]). utilization of distributed resources to ensure video analytics
• Utku Günay Acer is with Nokia Bell Labs (e-mail:
utku [email protected]). with high accuracy and low latency on distributed smart
• Fahim Kawsar is with Nokia Bell Labs (e-mail: fahim.kawsar@nokia-bell- cameras, eliminating the need for a cloud server. To this end,
labs.com). we identify that multi-camera, multi-target tracking serves as
• Fahim Kawsar is corresponding author of this paper.
a fundamental task for multi-camera video analytics. This
2

TABLE 1: Comparison of cross-camera collaboration approach in Argus with REV [1], Spatula [2], and CrossRoI [3].
REV [1] Spatula [2] CrossRoI [3] Argus (Ours)
Target environment Overlapping cameras Non-overlapping cameras Overlapping cameras Overlapping cameras
Optimization goal On-server computation Communication and on- Communication and on- End-to-end latency on cam-
costs server computation costs server computation costs eras
Collaboration gran- Cells (group of cameras) Cameras Areas (RoIs) Objects
ularity
Applying associa- Dynamic (depending on Dynamic (depending on Static (once when the cam- Dynamic (depending on
tion the target’s existence) the target’s existence) eras are deployed) the target’s location)
Approach Incrementally search cells Identify the subset of cam- Find the smallest RoI that Minimise the # of iden-
that have lowest identifica- eras that capture target ob- contains the target objects tification operations across
tion confidence jects cameras
Video processing Centralized Centralized Centralized Distributed

process involves determining the location and capturing object is not yet known. This would lead to an increase in
image crops of target objects (presented as query images) end-to-end latency, even with the fewer number of identifi-
on deployed cameras over time. We find that the computa- cation model inferences. Also, since cameras have different
tional bottleneck for camera collaboration arises due to the workloads (i.e., the number of detected objects) and het-
frequent execution of identification model inference across erogeneous processing capabilities, careless scheduling and
different cameras. To address this challenge, we develop a distribution might not maximize the overall performance.
fine-grained, object-wise spatio-temporal association tech- To this end, we develop a multi-camera dynamic inspector
nique. This novel approach strategically avoids redundant (§5.1) that dynamically orders the camera and bounding
identification tasks on both spatial (across multiple cameras) box inspection sequence to avoid identification tasks for
and temporal (within each camera over time) axes. This not query-irrelevant objects. We also distribute identification
only streamlines the process but also enhances the efficiency tasks across multiple cameras, taking into account network
of the system. transmission and heterogeneous computing capacities on
To enable effective multi-camera, multi-target tracking the fly, to minimize end-to-end latency (§5.2).
across overlapping cameras, we develop an object-wise We prototype Argus on two Nvidia Jetson de-
association-aware identification technique. Specifically, Ar- vices (AGX [8] and NX [9]) and evaluate its perfor-
gus continuously tracks records of the association of objects mance with three real-world overlapping camera datasets
(their bounding boxes) with the same identity across both (CityFlowV2 [4], CAMPUS [20], and MMPTRACK [21]).
multiple cameras (§4.1) and time (§4.2). Then, it identifies The results show that Argus reduces the number of iden-
the object by matching the location association instead of tification model executions and the end-to-end latency by
running the identification model inference and matching the up to 7.13× and 2.19× compared to the conventional per-
appearance feature. The concept of spatio-temporal associa- camera processing pipeline (4.85× and 1.60× compared
tion has been proposed in several previous works to reduce to the state-of-the-art spatio-temporal association), while
the repetitive appearance or query irrelevant areas [2], [3], achieving comparable tracking quality.
[19]. However, they apply to association at a coarse-grained We summarize the contribution of this paper as follows.
level, e.g., groups of cameras [1], cameras [2], [19] or regions
• We present Argus, a novel system for robust and low-
of interest (RoIs) [3]. Thus, the expected gain is small for
latency multi-camera video analytics with cross-camera
our target environment, which is multi-camera, multi-target
collaboration on distributed smart cameras.
tracking on overlapping cameras. For example, the resource
• To enable efficient cross-camera collaboration, we develop
saving from camera-wise association and filtering [2], [19] is
a novel object-wise spatio-temporal association technique
expected to be marginal for densely deployed overlapping
that exploits the overlap in FoVs of multiple cameras
cameras. RoI-wise association and filtering [3] also degrade
to optimise redundancy in the multi-camera, multi-target
tracking accuracy, as the target object is not detected on a
tracking pipeline.
subset of cameras. Please refer to Table 1 and §7 for more
• We also develop a scheduling technique that dynamically
details of these works. In §2.2 and §6.2, we also provide
schedules the inspection sequence and workload distri-
an in-depth analysis and a comparative study with these
bution across multiple cameras to optimise end-to-end
prior arts, respectively. Furthermore, we carefully incorpo-
latency.
rate techniques to handle corner cases in the association
• Extensive evaluations over three overlapping camera
process (e.g., newly appearing objects, occasional failure
datasets show that Argus significantly reduces the num-
of the identification model and its error propagation) and
ber of identification model executions and end-to-end
improve the robustness of the spatio/temporal association
latency by up to 7.13× and 2.19× (4.86× and 1.60×
process (§4.4).
compared to the state-of-the-art [2], [3]) while achieving
Next, we develop a set of strategies that perform spatio-
comparable tracking quality to baselines.
temporal association over distributed smart cameras at
low latency. To maximize the benefits of association-aware
identification, it needs to process cameras one by one in
a sequential manner so that the number of identification 2 BACKGROUND AND M OTIVATION
model inferences is minimized; identification model infer-
ence needs to be performed when the identity of the pivot
3

Fig. 3: Camera Fig. 4: The same vehicle cap-


topology of tured from multiple views in
CityFlowV2 [4]. CityFlowV2 [4].

Fig. 2: Typical pipeline for multi-camera, multi-target track-


ing; an example of vehicle tracking without cross-camera
collaboration. Each camera runs the detection and identifi-
cation independently and aggregates the output at the final
stage.

TABLE 2: Identification latency on Jetson devices.


Fig. 5: Identification sav- Fig. 6: Number of cam-
Vehicle (ResNet-101) [25] Person (ResNet-50) [26]
Batch size 1 2 4 1 2 4 ing opportunity for differ- eras after filtering by Spat-
NX 0.119s 0.206s 0.399s 0.043s 0.045s 0.066s ent overlapping ratios in in ula [2] in CityFlowV2 (5
AGX 0.065s 0.121s 0.217s 0.018s 0.020s 0.028s CityFlowV2. cameras).

2.1 Multi-Camera, Multi-Target Tracking yet capable of processing a number of identification tasks in
real time. Table 2 shows the latency of two identification
In this work, we focus on multi-camera, multi- models (ResNet-101-based vehicle identification [25] and
tracking using deep learning-based object detection and re- ResNet-50-based person identification [26]) with different
identification models. These models robustly track objects batch sizes over two Nvidia Jetson devices. It shows that
across multiple views even in complex scenarios, by lever- the number of identification model executions to run on
aging the discriminative power of deep neural networks. one camera is quite limited. For example, if 4 vehicles are
They also handle occlusions, changes in appearance, and detected on every frame on average, even the powerful
other challenges that are difficult to address with geometry- Jetson AGX platform can only process about 4 frames per
based methods. To this end, they often learn from large- second. The throughput would drop even further if object
scale datasets, enabling them to generalize to a wide range detection is included (we show the detailed results in §6).
of scenarios and adapt to changes in the environment.
Operational flow. The key to enabling video analytics on 2.2 Exploring Optimisation Opportunities
overlapping cameras is multi-camera, multi-target tracking: Redundant identification of the same objects. To explore
detecting and tracking target objects (given as query images) the opportunities for optimizing the pipeline for multi-
from video streams captured by multiple cameras. This is camera, multi-target tracking, we investigate the pattern of
typically achieved in three stages, as shown in Figure 2. identification tasks with the CityFlowV2 dataset [4]; five
(i) The object detection stage detects the bounding boxes of cameras are installed at an intersection as shown in Figure 3.
objects in one frame on each camera using object detectors Figure 5 shows the redundancy probability, i.e., the proba-
(e.g., YOLO [22]) or background subtraction techniques [23], bility of objects appearing simultaneously in multiple cam-
[24]. (ii) The per-camera object identification stage extracts the eras for different overlap ratios; the overlap ratio is defined
appearance features of the detected objects by running the as the ratio of the time the object appears simultaneously
object identification (ID) model (e.g., [25]) and determines in both cameras and the total time it is detected in any
whether it matches the query image based on feature sim- camera; for a target appearing in n cameras, we calculate
ilarity (e.g., L2 distance, cosine similarity). (iii) The result all pairwise overlap ratios (nC2 ) and take the average. Each
aggregation stage aggregates the identification results across point represents a different query. The results show that,
multiple cameras and generates tracklets [14] that can be as the overlap ratio increases, the probability of an object’s
used for further processing for application logic, e.g., object appearance in multiple cameras also becomes higher. This
counting, license plate extraction and face recognition. means that a dense array of cameras with overlapping FoVs
Compute bottleneck: per-object identification. The main will have more redundant identification tasks for the same
compute bottleneck is the execution of identification tasks, object across multiple cameras.
which need to be performed for all detected objects in Spatio-temporal association. To avoid unnecessary and re-
every frame across multiple cameras to determine identity dundant identification tasks, we adopt spatio-temporal asso-
of objects, as shown in Figure 2. Although we envision smart ciation of objects, which have been proposed in the auto-
cameras equipped with built-in AI accelerators, they are not calibration techniques [27], [28] for multi-view tracking sys-
4

from 1990s. This technique aims to automatically estimate


the camera parameters, such as intrinsic and extrinsic pa-
rameters for object tracking in multiple camera views,
without the need for manual intervention or specialized
calibration objects [27]. Auto-calibration methods leverage
the spatio-temporal correlation of objects in multiple views
Fig. 8: Example of tem- as described in §2.2, taking advantage of the geometric
Fig. 7: Example of spatial associ-
poral association (yellow constraints imposed by the scene and the motion of objects
ation (green lines).
lines). or the camera itself [29]. By harnessing these constraints,
auto-calibration techniques can iteratively refine the camera
parameters, leading to improved tracking accuracy and
tems. Spatial-temporal association refers to the geographical robustness [30]. Several auto-calibration methods have been
and temporal association of an object to different cameras. proposed in the literature, including the self-calibration of
More specifically, we associate the identity of an object space and time technique [28], which exploits the corre-
across multiple cameras by matching their correlated po- lation between space and time in the image sequence to
sitions on the frame, rather than matching appearance fea- estimate the camera parameters. Other approaches [31], [32]
tures extracted from the identification model, as shown in utilize the epipolar geometry and geometric constraints to
Figure 2. This intuition arises from the observation that, estimate the camera parameters. Additionally, the establish-
once installed in a place, cameras’ FoVs are fixed over ment of a common coordinate frame across multiple views
time. We explain spatial association with an example. If the has been proposed to improve tracking performance [33].
bounding box of two objects (at different times) is located While auto-calibration methods have shown the feasibility
at the same position in one camera’s FoV, the position of of object tracking from multiple camera views, they still face
their bounding boxes in other cameras will also remain the several limitations, especially when compared to modern
same.1 Figure 7 shows the spatial association obtained from approaches that utilize deep learning-based object detec-
the CityFlowV2 dataset [4]. Each row shows a list of images tion and identification models. Auto-calibration methods
captured by three cameras (Camera 1, 2, 3) installed as in are typically based on geometric constraints, which can
Figure 3 at the same time. Each column shows the images be sensitive to errors in feature detection and correspon-
taken by the same camera. The red and blue overlay boxes in dence matching, leading to inaccurate camera parameter
each row represent the bounding box of a red vehicle and a estimation. Also, these methods rely on the assumption of
blue vehicle, respectively. Although two vehicles crossed the a static scene, which may not hold true in dynamic envi-
intersection at different times, when two vehicles are located ronments where objects and people are constantly moving
at a similar location on Camera 1, we can observe that the and changing [28]. Moreover, these techniques struggle with
corresponding bounding boxes remain in a similar position real-deployment environments due to computational con-
on the other cameras. Similarly, as shown in Figure 8, we straints, as the complexity grows with the number of cam-
expect the temporal association of an object, which means eras and tracked objects. In this paper, we propose a novel
that an object in a video stream remains in proximity within approach that enables multi-camera, multi-object tracking
successive frames. on cameras using deep learning-based object detection and
identification models.
2.3 Limitations of Prior Work Camera-wise filtering in non-overlapping cameras. Spat-
Despite such benefits, developing an effective filtering strat- ula [2], [19] leverages cross-camera correlation to identify a
egy using spatio/temporal association is not straightfor- subset of cameras likely to contain the target objects and
ward. There are prior works that explored spatio/temporal filter out unnecessary cameras (that do not contain the
association for filtering out redundant workloads from mul- target objects). While it shows a significant performance
tiple video streams. We present their techniques and limita- benefit in its target environment (widely deployed non-
tions; we explain their work in more detail in §7. overlapping cameras), it fails to effectively reduce redundant
identification operations in overlapping cameras. To quantify
Auto calibration using spatio/temporal association. Auto-
its benefit, we analyzed the CityFlowV2 dataset [4]. Fig-
calibration, also known as self-calibration, has been pro-
ure 6 shows the average number of cameras out of five
posed as a solution to enable multi-view tracking systems
cameras, used by Spatula; the error bar indicates the min-
1. Of course, this argument is not always right in theory. Since
imum/maximum number of cameras. The results show that
a camera projects 3D space onto the 2D plane, the same bounding the benefit of Spatula-based camera-wise filtering quickly
box of one camera at different times does not guarantee the same diminishes when more queries are used, i.e., fewer cameras
position of an object. The simplest case would be when two objects are filtered out. This is because a higher number of objects
of different sizes are located in the same direction from the camera but
a smaller object is located closer to the camera. However, in practice, are likely to be captured by a higher number of cameras,
such cases are very rare because the camera is often installed to look simultaneously.
at a 2D plane (e.g., street and floor) obliquely to cover a wide area
and objects of interest (e.g., vehicles and people) cannot be located at Camera-wise filtering in overlapping cameras. REV [1]
arbitrary 3D positions, as shown in Figure 7. Also, although such a leverages spatial correlation across multiple overlapping
case happens (e.g., two objects at different positions are captured in the cameras to minimize the number of processed cameras in
same bounding box in Camera 1), the spatial association of two objects
is not made because the position and size of the bounding boxes in identifying the target object. However, its goal is to confirm
other cameras (e.g., Camera 2 and 3) will be different. the presence of the target object within a given timestamp. As
5

(a) Camera #1 (RoI: 6 grids). (b) Camera #2 (RoI: 1 grid).


Fig. 9: Overlapping RoI example between 2 cameras. Cross-
RoI [3] favours Camera #2 which has the smaller RoI size.

such, it cannot be applied for Argus, which not only aims to


confirm the presence of a target object but also extract the image
crops of the target from all cameras that capture it. Specifically, Fig. 10: Argus system architecture.
REV employs an incremental approach, starting its search
from the camera that detects the significant number of 3.2 Approach
objects 2 and discontinues the search once the target is Multi-camera object-wise spatio-temporal association.
identified. Thus, it often misses the image crops from the Our preliminary study reveals that the computational bot-
remaining cameras, which may have captured the target in tleneck for multi-camera, multi-target tracking in overlap-
superior quality. ping cameras is the redundant identification of the same
RoI-wise filtering in overlapping cameras. CrossRoI [3] object (§2.1). To achieve both resource-efficient and accurate
leverages spatio-temporal correlation to optimize the region tracking, we devise a method for object-wise association-
of interest (RoI) of multiple video streams from overlapping aware identification. As shown in Figures 7 and 8, Argus
cameras. When multiple objects are captured by a set of associates the spatio-temporal correlation of objects’ posi-
cameras from different views, CrossRoI extracts the smallest tions and identifies redundant identification tasks. It reduces
possible total RoI across all cameras in which all target on-camera computational costs by filtering out redundant
objects appear at least once, and then reduces processing identification tasks for the same object across multiple cam-
and transmission costs by filtering out unmasked RoI areas, eras (spatially) and over time (temporally). It also provides
i.e., (a) redundant appearances and (b) areas that do not accurate tracking by guaranteeing tracking information on
contain the target objects; RoI is defined as a 6-by-4 grid. all cameras.
While it effectively reduces the workload to be processed, it On-camera distributed processing. To enable on-camera
is not suitable for multi-camera, multi-task tracking. Since it processing, we further devise two optimization techniques.
aims to minimize the RoI size that covers the overlapping First, we optimize on-camera workload by minimizing the
FoVs, the smaller RoI that contains the object is preferred number of model executions (both object detection and
(e.g., 1 grid in Camera 2 instead of 6 grids in Camera 1 identification). By inspecting cameras and objects (bound-
in Figure 9). This would lead to considerable degradation ing boxes) in order of probability of containing the target
of the accuracy. Furthermore, since it filters out redundant object, Argus avoids model executions that are irrelevant
appearance in the initial stage, analytics applications cannot to the target objects; note that the tracking operation is
benefit from a holistic view, as shown in Figure 4. finished when all target objects are found. Second, we
further optimize end-to-end latency with parallel execution
3 A RGUS D ESIGN on distributed cameras. More specifically, Argus distributes
the identification workload across multiple cameras on the
3.1 Design Goals fly and executes it in parallel.
Low-latency and high accuracy. We aim at achieving both
low latency and high accuracy in running multi-camera, 3.3 System Architecture
multi-target tracking across overlapping cameras, which is Figure 10 shows the system architecture of Argus. It takes
the key requirement of various video analytics apps. the query images as input from analytics apps and provides
On-device processing on distributed smart cameras: the tracklets (list of cropped images and bounding boxes
Streaming videos to a cloud server for processing incurs of the detected objects) tracked from multiple cameras as
significant networking and computing costs as well as pri- output. Given the targets to track, Argus first starts by run-
vacy issues. We aim to run the video analytics pipeline ning the object detector to detect objects for identification
with cross-camera collaboration fully on distributed smart on each frame in parallel. Afterwards, the head camera
cameras leveraging on-device resources. runs the Dynamic Inspector (Section 5.1) to determine the
processing order of cameras and bounding boxes. Once the
Flexibility of tracking pipeline. We treat the AI models
processing order is determined, the Multi-Camera Work-
as a black box, thereby supporting both open-source and
load Distributor (Section 5.2) schedules the identification
proprietary models and allowing analytics app developers
tasks across cameras (head and members), considering the
to select the models for the purpose flexibly.
network transmission latency and heterogeneous compute
2. The underlying rationale is that cameras with more bounding capabilities. Given the identification workloads, each cam-
boxes are more likely to capture the target object. era runs the object identifier; the Spatio-temporal Associator
6

1) Camera order determination. We decide the order of


cameras to inspect (§5.1), and repeat below steps for each
camera.
2) Object detection ( 1 in Figure 11): For a frame from i-
th camera C i , we perform the object detection. We define
its output as {bboxit,j , labelt,j
i
}, where bboxit,j and labelt,j
i
i
are a bounding box and a label for j -th object on C at
time t, respectively.
3) ID feature extraction ( 2 in Figure 11). For objects with
a detected label that matches the query label, we sort the
bounding boxes for inspection (§5.1). For each cropped
image, we execute the identification model and obtain
i i
the ID features, {Et,j }, where Et,j is the ID appearance
feature from the cropped image (bboxit,j ). We further
reduce the identification operations within a camera by
Fig. 11: Overview of our cross-camera collaboration-enabled leveraging the temporal locality of an object (§4.2).
multi-camera spatio-temporal association. (*): operations are 4) Identity matching ( 3 in Figure 11). For each object, we
performed on a subset of cameras. compute its similarity to a query image by comparing
their extracted features, EQ , and determine its identity.
Steps 3–4 are repeated until the target object is found.
(Sections 4.1 and 4.2) opportunistically skips the inference 5) Mapping-based identity matching ( 4 in Figure 11).
by leveraging the spatio-temporal correlations across cam- We construct a set of bounding boxes of the target object
eras. from previously inspected cameras including the current
camera, Ci , i.e., {entry bboxkt,j | k ⊂ K}, where K is a
set of cameras that are inspected. Note that it may contain
4 S PATIO -T EMPORAL A SSOCIATION one or more N/A elements, which show that the object
is located in non-(or partially) FoV of Ci at time t. We
Operation Overview. The goal of multi-camera spatio- look up the mapping entry that matches {entry bboxkj }
temporal association is to accurately track the query iden- for the camera set, K . If the entry is found, then we
tities from multiple cameras with a minimal number of extract the bounding boxes of other cameras (that are not
identification operations, which is the key bottleneck of inspected yet) in the entry, i.e., {entry bboxij | i ∈ / K}.
the multi-camera, multi-object tracking pipeline. Figure 11
a) If the bounding box in other camera exists in the entry,
shows the operation of multi-camera spatio-temporal as- ′
e.g., entry bboxit,j , we perform object detection in the
sociation, illustrating how to leverage the association to ′
achieve efficient multi-camera, multi-target tracking. For corresponding camera, C i and determine the query
example, if an object matching the query is found in Camera identity by spotting the bounding box that matches

1 and the expected position of its bounding box in Camera entry bboxit,j .
′′
2 can be obtained, we determine the identity of an object b) If the entry has N/A in other cameras, e.g., C i , we skip
i′′
in Camera 2 if its position matches the expected position. If all the operations of C at time t.
no bounding box that corresponds to the one in Camera 1 6) If a target object is not found in the frame, we set
is expected to exist in other cameras, e.g., in Camera 4, we entry bboxjt as N/A and do step 5.
skip all identification operations in Camera 4, as this means 7) Steps 1–6 are repeated until all cameras are inspected.
that the object is located outside of Camera 4’s FoV. When multiple queries are given, the output from object
Formulation. We formalize our problem setting as follows: detection 1 and ID feature extractions 2 is shared, but
i th identification and mapping-based identity matching ( 3
• C : a set of cameras, where C is i camera,
i
• Ft : a set of image frames at time t, where Ft is an image and 4 ) are performed separately. We explain the imple-
i
frame from C at time t. mentation details in §4.3.
• EQ : a set of id feature embedding of query images, where
Ej is the feature embedding of j th query 4.1 Spatial Association
i i
• bboxt,j : bounding boxes of the detected on C at time t. We first explain how we define a spatial association across
C i has nit objects detected at time t (j = 1, 2, ..., nit ). multiple cameras. Once an object with the same identity is
Formally, the goal of multi-camera spatio-temporal as- captured by multiple cameras, we create a mapping entry
sociation is to minimize the total number of identification that contains a timestamp and a list of the corresponding
operations across all cameras, bounding boxes on each camera in C . We use bounding
X boxes as location identifiers for fine-grained matching of
min niIDs , (1) the spatial association. Formally, we define a mapping entry
i as entry j = {entry bboxij }, where entry bboxij is a pair
of coordinates referring to the southwestern and north-
where niIDs is number of identification operations on C i . eastern corner of the box in C i at the j th mapping entry.
Operational flow. Argus operates as follows in detail. For entry bboxij is set to N/A if the object is not found in the
simplicity, we explain the procedure for a single query. corresponding camera, C i .
7

Utilizing the spatial association. In subsequent time in-


tervals, we apply the identification model to the detected
objects in a single camera (please refer to §5.1 for determin-
ing the order of camera inspection). Upon identifying an
object, we search for a mapping entry matching a bounding
box of the identified object in the same camera. If such an
entry is found, we examine the detected bounding boxes on
the remaining cameras whose entry value is not N/A. If a
bounding box in the remaining camera matches the located
entry, we associate (i.e., reuse) the identification result from Fig. 12: Handling newly appearing objects on frame edges.
the first camera, avoiding the need to rerun the identifi-
cation model. Note that searching for a matching entry in
the mapping table involves calculating the bounding box the number of ROIs to be examined by matching the cor-
overlap, which is an extremely lightweight operation (e.g., responding labels with the object class of the query (e.g.,
takes ¡1 ms for 1,000 matches) as detailed in §4.3. vehicles or people). In this paper, we use the YOLOv5n
Management of the spatial association. We use bounding model, the lightest model in the YOLOv5 family [22] as
boxes as location identifiers for fine-grained matching of the it provides reasonable detection accuracy even for small
spatial association. To facilitate quick access, we maintain cropped images in 1080p streams of our datasets. Note
the entries as a hash table. Also, if the number of entries that app developers can flexibly use different RoI detection
exceeds a threshold (e.g., 100), Argus filters out duplicate or methods depending on the processing capacity of smart
closely located entries by running non-maximum suppres- cameras and the service requirements.
sion on the bounding boxes of the entries. Specifically, when
two entries have bounding boxes from the same cameras Identity matching. For identification, it is common to train
with significant overlap, we retain only the entry that has the object type-specific identification models (e.g., vehicle
(i) a higher number of non-N/A values and (ii) a higher and person) and establish correspondences by measuring
average identity matching score; implementation details are the similarity between the feature vectors of the (cropped)
provided in §4.3. These mapping entries can be obtained images (e.g., Euclidean distance or cosine similarity). We
during the offline phase with pre-recorded video clips or use the dataset-specific identification models and similarity
updated during the online phase with runtime results. functions to ensure the accuracy of tracking (details in §6.1).
Bounding box matching. The key to leveraging the spa-
4.2 Temporal Association tio/temporal association is to match the bounding box of
a detected object with the other bounding boxes in the
We leverage temporal association to further reduce the num-
mapping entry and in the previous frame. We use the
ber of identification operations. It is inspired by the observa-
intersection-over-union (IoU) to measure the overlap be-
tion proposed in simple online tracking methods [34], [35],
tween two bounding boxes and detect a match if the IoU
that the location of an object does not change significantly
value exceeds 0.5 (widely used threshold for object track-
within a short period of time. That is, the bounding box of
ing [36]). Note that IoU calculation overhead is negligible
an object in a video stream would remain in proximity to
(e.g., takes <1 ms even for 1,000 matchings on Jetson AGX
the bounding box with the same identity in the previous
board).
frame. For example, even in the vehicle tracking scenario in
CityFlowV2 [4], the distance of a vehicle moving at a speed
of 60 km/h in successive frames of a video stream at 10 Hz 4.4 Improving Robustness
is about 1.7 metres, which is relatively small compared to
the size of the area that a security camera usually covers.
Handling newly appearing objects. One practical issue that
When ID feature extraction is performed, Argus caches
needs to be considered when applying spatial association is
the ID features with their bounding box. Then, when the ID
how to deal with objects that appear in the FoV for the first
feature is needed for a new bounding box in a later frame,
time. Figure 12 shows an example. At time t (first row),
Argus finds the matching bounding box in the cache; we ex-
a target vehicle is found only in Cameras 1 and 3, so the
plain the implementation details of bounding box matching
mapping entry is made as {bbox1t,j , N/A, bbox3t,j ′ }. At time
in §4.3. When the matching bounding box is found, Argus
t + 1 (second row), the target starts to appear in Camera 2.
reuses its ID feature and updates the bounding box in the
However, if the target is found in Camera 1 and its mapping
cache. We set the expiry time to one frame, i.e., the cache
entry matches the one at time t, the target in Camera 2 will
expires in the next frame unless it is updated.
not be inspected. To avoid such a case, we skip mapping-
based identity matching for objects that appear in the frame
4.3 Implementation of Association Technique for the first time (i.e., we perform an identification task for
RoI extraction. There are several options for the RoI ex- a vehicle in the blue box in Camera 2 of the second row
traction stage that can be adopted in Argus (e.g., back- in Figure 12) and match its identity based on identification
ground subtraction [23], [24] or object detection mod- feature matching. Note that we apply for the mapping-
els [22]). Although the background subtraction method is based identity matching for other cameras (e.g., Camera 3).
more lightweight, we use the object detection method be- To effectively identify objects when they first appear,
cause the object detection method can effectively reduce we devise a simple and effective heuristic method. Inspired
8

by the observation that an object appears in the camera’s 5.1 Multi-Camera Dynamic Inspection
frame by moving from out-of-FoV to FoV, we consider the The key to maximizing the benefit from spatial association
bounding boxes that are newly located at the edge of the is to quickly find the objects that match the query, thereby
frame as potential candidates, and perform the ID feature (a) skipping identification tasks on other cameras from
extraction regardless of the matching mapping entry if no spatial association and (b) skipping identification of objects
corresponding identification cache is found. irrelevant to the query even on the same camera. To this
Handling occlusion. Depending on a camera’s FoV, a target end, we develop a method to dynamically arrange the order
object might be obscured by another moving object. For of cameras and bounding boxes to be examined.
instance, in a camera with a FoV perpendicular to the road, Inter-camera dynamic inspection. The inspection order of
a vehicle in the front lane could occlude a vehicle in the the cameras heavily affects the identification efficiency (i.e.,
rear lane. Under such circumstances, the detection model the total number of required identifications). Specifically, we
might fail to identify the target object. To handle errors find that searching the cameras which most likely contain
resulting from sudden, short-term occlusions, we develop the target object first improves the search efficiency. This is
an interpolation technique that leverages the detection re- because we can leverage the bounding box location of the
sults from the preceding frame in the same camera and/or identified target object to aggressively skip identification on
time-synchronised frames from other cameras. Specifically, non-matching bounding boxes in the remaining cameras.
during a sudden, short-term occlusion, the target object For example, consider a case with three cameras (Cameras
might be visible up to a certain point in the frame, then 1, 2, and 3). At a given timestamp, assume that all cam-
abruptly disappear mid-frame. If the object remains visible eras detect the same number of vehicles, e.g., four, and a
in other cameras, we can estimate the existence of the oc- target object is captured by Cameras 1 and 2. If Camera
cluded object by comparing the current mapping entry with 1 is inspected first, we can find the query object within
past mapping entries. For example, if an object suddenly four identification and skip the identification operations for
disappears in Camera 1, Argus searches for a prior mapping Cameras 2 and 3. However, if the inspection starts with
entry containing the object located in the previous frame of Camera 3, we need to perform further inspections with
Camera 1 and extracts the position of the object in other Camera 1 and 2 just in case the target object is located out of
cameras. If corresponding bounding boxes are found in all Camera 3’s FoV. Hence, eight identifications are required.
other cameras, Argus performs object detection and identi- In addition to efficiency, the inspection order of the
fication on the other cameras. Where occlusion persists for cameras also affects the identification accuracy because our
an extended period, we employ periodic cache refreshing approach relies on identification-based target matching from
(details provided subsequently). It is important to note that the first camera. Specifically, inspecting the camera where
such occlusions are rare in practical settings, as objects move the target identification accuracy is expected to be the high-
at varying speeds and cameras are often installed to monitor est leads to the highest association accuracy in the remaining
the target scene from a high vantage point (e.g., mounted on cameras. While the identification accuracy is affected by
a traffic light as shown in Figure 12). multiple attributes of the captured object (e.g., the detected
Periodic cache refreshing. To avoid error propagation in object’s size, pose, blur), we currently use the bounding box
our association-based identification (due to occlusion as size as the primary indicator; we plan to extend the analysis
well as the failure of identification model inference), we to other attributes in our future work. For example, in the
limit the maximum number of consecutive skips and per- case of Figure 7, we consider Camera 2 as the first camera
form identification task regardless of a matching mapping to be inspected as the size of the box of the detected target
entry at the predefined interval (e.g., every 2s). This variable object is the largest, i.e., has the highest probability that it is
controls the trade-off between efficiency and accuracy. correctly identified.
Considering these two factors (efficiency and accuracy),
Time synchronisation. For spatial association, it is impor- at each time t, we calculate each camera’s priority as follows
tant for all video streams to be time synchronised. To this (higher value indicates higher priority)
end, Argus periodically synchronises the camera clock time
using the network time protocol (NTP) periodically and Nt−1 i

aligns frames based on their timestamp, i.e., two frames Ni X


are considered time-synchronised if the difference of their
α × t−1 + (1 − α) × c × size(bboxit−1,j ), (2)
NQ j
timestamps is below the threshold (in the current imple-
i
mentation we set it to 3 ms). Considering that existing CCTV where Nt−1 is the number of target objects found in C i at
networks are often connected with a gigabit wired connec- time t − 1, NQ is the number of queries, size() is a function
tion, NTP is capable of achieving this value. Leveraging that returns the size of the given bounding box, c is a coef-
the synchronized clocks, we match frames across different ficient to normalise the size. α is a variable that determines
cameras with the smallest timestamp difference to handle the weight of resource efficiency and identification accuracy.
cases where the cameras have different frame rates.
Intra-camera bounding box dynamic inspection. After ob-
ject detection in a frame of a single camera, the order of the
boxes to be inspected also affects the overall identification
5 O N -C AMERA D ISTRIBUTED P ROCESSING performance. Specifically, inspecting the detected bounding
boxes close to the expected location of the target object is
more beneficial, as we can skip the identification on the
9

remaining boxes as soon as the target object is identified.


For example, in Camera 1 in Figure 11, it would be beneficial
to start by examining the expected target object (a white
vehicle in the third row) rather than starting from the query-
irrelevant objects. We order the sequence of boxes to be
Fig. 13: Snapshots of CAMPUS-Garden1 [20] dataset.
examined by leveraging temporal association, i.e., sorting

X
min dist(bboxit,j , Bt−1
i
) (3)
max T D(C i , C j , nj ) + BP (C j , nj ) ,

j min
n1 ,n2 ,...,nK j

for each bounding box bboxit,j in C i at time t in ascending


X (4)
where nj = N.
i
order, where Bt−1 is a set of bounding boxes of the target i
objects detected in C i at time t − 1 and min dist(bbox, B) is j
where n is the number of bounding boxes to distribute to
a function that returns the minimum distance from bbox to
Camera j (C j ) to extract the ID features, T D(C i , C j , nj )
any bounding box in B . Note that this only affects the order
is a function that returns the network transmission delay
of the boxes to be examined, but not the tracking result.
to distribute nj cropped images from C i to C j (note that
T D(C i , C i , ni ) is zero as no transmission is required), and
BP (C j , nj ) is the batched processing latency the identifica-
5.2 Multi-Camera Parallel Processing
tion model on C j . T D(C i , C j , nj ) and BP (C j , nj ) vary for
each C j depending on its processing capability and network
The key challenge of running spatio/temporal associa- bandwidth; we predict T D(C i , C j , nj ) and BP (C j , nj ) as
tion on distributed cameras is the long execution time. The follows.
end-to-end execution time may increase if the target objects First, transmission delay for the workload distribution
are not found in the previously inspected cameras due to the T D(C i , C j , nj ) is calculated as
sequential execution of the inspection operations. We apply
the following techniques that exploit the resources of the T D(C i , C j , nj ) = H · W · nj /BWij , (5)
distributed cameras to prevent this. where H, W is the height and width of the image crop
1) Given an image, we perform spatial association- (resized to the input size of the identification model) and
irrelevant tasks on the cameras in parallel, i.e., object BWij is the network bandwidth between C i and Cj (BWii
detection and ID feature extraction of newly appeared is set as ∞). For each network transmission event between
objects at the edges of the frame. C i and Cj , we estimate BWij by the transmitted data
2) If the number of objects in a frame exceeds the pre- size divided by the transmission latency (similar to [38]),
defined batch size (e.g., 4), we distribute the identifica- and update it with Exponential Weighted Moving Aver-
tion tasks to nearby cameras and execute them in parallel. age (EWMA) filtering for future prediction. In our current
Such distribution has a beneficial effect on the end-to-end implementation, the cameras are connected with a Gigabit
execution time because 1) current AI accelerators do not wired connection similar to [39], and the distribution latency
support parallel execution of AI models [5]3 and 2) the is negligible compared to the identification model inference
network latency is relatively much shorter since we need latency (e.g., ≈0.3 ms to transmit a 128×128 cropped image
to send only the cropped image (e.g., 85×141), not the over 1 Gbps connection, whereas single identification takes
full-frame image (e.g., 1080p). >100 ms as shown in Table 2).
Note that batch processing [37] is widely used to re- Next, identification latency with batched processing
duce execution time for multiple inferences on a device. To BP (C i , ni ) is calculated as
maximize the benefits of workload distribution and batch BP (C i , ni ) = ⌈ni /nibatch ⌉ × T (C i , nibatch ), (6)
processing, we profile the execution time with different
batch sizes on each camera and network latency with data where nibatch is the batch size on C i , and T (C i , nibatch ) is the
transmission sizes. Then, we dynamically select the optimal identification model latency on C i with batch size nibatch .
batch size for processing in one camera and the optimal nibatch is determined at the offline stage by running the
number of bounding boxes for distribution to other cameras. identification model on each C i with different batch sizes
Formally, we define this problem as follows. Given the and determining the one that maximizes the throughput. At
inspection order determined in §5.1, suppose that we are runtime, we update T (C i , nibatch ) upon each inference using
currently processing Camera i, where a total of N bounding the EWMA to account for dynamic resource fluctuation
boxes were detected. We distribute the identification tasks (e.g., due to thermal throttling) similar to [40].
across K cameras to minimize the total execution time as:
6 E VALUATION
3. Please note that, while Nvidia offers multi-process service (MPS)
and multi-instance GPU (MIG) software packages to facilitate model 6.1 Experimental Setup
co-running on their GPUS on cloud servers, they are not supported in Datasets. We use three real-world overlapping camera
the Jetson family devices [8], [9] designed for edge AI. The other AI
accelerators such as Google Coral TPU and Intel NCS 2 also do not datasets for the evaluation: CityFlowV2 [4], CAMPUS [20],
support parallel execution on the accelerator chip. and MMMPTRACK [21] to ensure a fair comparison with
10

baseline methods and to enable an in-depth study of the number of matches in frame t and dt.i is the overlap of
impact of various system parameter values. When spatial the bounding box (IoU) of target i with the ground truth.
association is used, we use the pre-generated mapping For each frame, we compute the MOTP for each camera
entries learned with 10% of the data in the dataset. separately and report its average.
• CityFlowV2 [4] consists of video streams from five het- • MOTA measures the overall accuracy P of both the detector
(F Nt +F Pt +M Mt )
erogeneous overlapping cameras at a road intersection. and the tracker. We define it as 1 − t P ,
t Tt
The cameras are located to cover the intersection from where t is the frame index and Tt is the number of
different sides of the road (Figure 3). 4 videos are recorded target objects in frame t. FN, FP, and MM represent false-
at 1080p@10fps and 1 video is recorded at 720p@10fps negative, false-positive and miss-match errors, respec-
with a fisheye lens. Each video stream is ≈3 minutes long tively. Similarly, we calculate the average MOTA across
and the ground truth data contains 95 unique vehicles. multiple cameras and report their average.
• CAMPUS [20] consists of overlapping video streams
recorded in four different scenes. We use the Garden1 Baselines. We evaluate Argus against the following state-of-
scene, which consists of 4× 1080@30fps videos capturing the-art methods. The baselines perform all model operations
a garden and its perimeter (Figure 13). We resized the on the camera where the corresponding image frame is
images to 720p as they show comparable object detection generated.
performance to the original 1080p at a lower cost. Each • Conv-Track is the conventional pipeline of multi-camera,

video is ≈100s and the ground truth contains 16 unique multi-target tracking (e.g., [14]), as shown in Figure 2.
individuals. Since the dataset provides inaccurate ground It identifies the query object on each camera separately
truth labels and bounding boxes, we manually regener- and aggregates the identification results across multiple
ated the ground truth for three targets (id 0, 2, 9). cameras.
• MMPTRACK [21] is composed of overlapping video • Spatula-Track adopts the camera-wise filtering approach

streams recorded from 5 different scenes: cafe shop, in- proposed in Spatula [2] for object tracking. For each times-
dustry safety, office, lobby, and retail. In total, there are 23 tamp, it first filters out the cameras that do not contain the
scene samples (3-8 samples per scene), and each sample target objects and then performs the Conv-Track pipeline
is composed of 4-6 overlapping video streams capturing for the selected cameras. We use ground truth labels for
6-8 people. Each video stream is 360p@15fps and ≈400 correlation learning and camera filtering, assuming the
seconds (in total 133k frames = 8,800 seconds). We use ideal operation of Spatula [2].
this dataset to evaluate the robustness of Argus in §6.7 • CrossRoI-Track adopts the RoI-wise filtering approach
proposed in CrossRoI [3] for tracking. Offline, it learns
Queries. For queries, we randomly chose ten vehicles for the minimum-sized RoI that contains all objects at least
CityFlowV2, three people for CAMPUS, and two people once over deployed cameras. At runtime, it performs the
for MMPTRACK. In the in-depth analysis, we also examine Conv-Track pipeline only for the masked RoI areas. We
performance with different numbers of queries. use the ground truth labels for the optimal training of the
Object detection and identification models. For object RoI mask, assuming the ideal operation of CrossRoI [3].
detection, we use YOLO-v5 [22]. For vehicle identification Hardware. For the hardware of smart cameras, we consid-
in CityFlowV2, we use the ResNet-101-based model [25] ered two platforms, Nvidia Jetson AGX and Jetson NX.
trained on the CityFlowV2-ReID dataset [4]. For person Jetson AGX hosts an 8-core Nvidia Carmel Arm, a 512-
identification in CAMPUS and MMPTRACK, we trained core Nvidia VoltaTM GPU with 64 Tensor Cores and 32
the ResNet-50-based model using the dataset. Note that GB of memory. Jetson NX hosts a 6-core Nvidia Carmel
the performance of the re-id model is not the focus of this Arm, a Volta GPU with 384 NVIDIA CUDA cores and 48
work and different models can be used. All models are Tensors, and 8 GB of memory. We prototyped Argus on
implemented in PyTorch 1.7.1. these platforms and measured performance; we used Jetson
Metrics. To measure system resource costs, we evaluate AGX for the CityFlowV2 dataset and the MMMPTRACK
the end-to-end latency and the number of identification dataset, and Jetson NX for the CAMPUS dataset. For the
model inferences. To measure tracking quality, we use two network configurations, we connected the Jetson devices
metrics that are widely used in multi-object tracking [41]: with a Gigabit wired connection, which is commonly used
Multiple Object Tracking Precision (MOTP) and Multiple for existing CCTV networks. It is important to note that,
Object Tracking Accuracy (MOTA). while we used the offline data traces from three datasets for
• End-to-end latency is the total latency for generating
the repetitive and comprehensive analysis, we implemented
multi-camera, multi-target tracking results. Note that the the end-to-end, distributed architecture of Argus on top of
latency includes all the operations required for the sys- multiple Jetson devices and evaluated the resource metrics
tem, i.e., image acquisition, model inference, uploading by monitoring the resource cost at runtime.
the cropped images to other cameras, and cross-camera
communication time. 6.2 Overall Performance
• Number of IDs is defined as the total number of identi-
fication model inferences required across all cameras for Figures 14 and 15 show overall performance on
each timestamp. CityFlowV2 and CAMPUS respectively. In Figures 14a and
• MOTP quantifies how preciselyP the tracker estimates 15a, the bar chart represents the average end-to-end latency
t,i dt,i
object positions. It is defined as P ct , where ct is the
t
(Detect: object detection latency, ID: identification latency)
11

(a) Resource cost. (b) Tracking quality. (a) Resource cost. (b) Tracking quality.

Fig. 14: Overall performance on CityFlowV2. Fig. 15: Overall performance on CAMPUS.

and the line chart shows the average number of IDs. In CrossRoI-Track, the number of IDs and latency are 24.9 and
Figures 14b and 15b, the bar and line charts represent the 419 ms, respectively.
average MOTA and MOTP respectively.
6.2.2 Tracking Quality
6.2.1 Resource Efficiency We investigate how spatial and temporal association-aware
identification affects tracking quality. Figure 14b and Fig-
Overall, Argus achieves significant resource savings by ure 15b show the MOTP and MOTA on CityFlowV2 and
adopting the spatio-temporal association and workload dis- CAMPUS, respectively. Overall, Argus achieves comparable
tribution, while not compromising the tracking quality. We tracking quality, even with significant resource savings.
first examine the resource saving of Argus. In CityFlowV2 Interestingly, in CityFlowV2, Argus increases both MOTA
where five cameras are involved, Figure 14a shows that and MOTP compared to Conv-Track; MOTA increases from
the average number of IDs decreases from 42.6 (Conv- 0.88 to 0.91 and MOTP increases from 0.60 to 0.67. This
Track ) to 21.1 (Spatula-Track), 30.1 (CrossRoI-Track) and is because several small cropped vehicles are identified by
11.6 (Argus). The end-to-end latency also decreases from associating with their position from other cameras, which
740 ms to 650 ms, 660 ms and 410 ms, respectively; Argus failed to be identified by matching their appearance features
is 1.8×, 1.59× and 1.61× faster than Conv-Track, Spatula- from the identification model in the baselines. In CAMPUS,
Track and CrossRoI-Track, respectively, which are the state- Argus shows almost the same tracking quality as Conv-
of-the-art multi-camera tracking solutions. We find several Track, but MOTA drops slightly from 0.85 (Conv-Track) to
interesting observations. First, the latency does not decrease 0.82. There were some cases where a target person was
proportionally to the number of IDs because all baselines suddenly occluded by another person in some cameras.
need to commonly perform object detection in every frame. However, Argus identifies the occluded person in the next
However, even when object detection is taken into account, frame using the robustness techniques §4.4, thereby being
Argus significantly decreases the end-to-end latency by able to minimize the error.
49% by reducing the number of IDs by 73%, compared to We investigate the benefit of cross-camera collaboration
Conv-Track. Second, both Spatula-Track and CrossRoI-track in more detail. In CAMPUS, Spatula also increases the
significantly reduce the number of IDs by selectively using tracking accuracy (MOTA) compared to Conv-Track by fil-
cameras and RoI areas, respectively. However, the reduction tering out query-irrelevant cameras, thereby avoiding false-
in end-to-end latency is not significant (about 10%). This is positive identifications. However, in CAMPUS, its quality
because the latency is tied to the longest execution time of all is identical to Conv-Track as no cameras are filtered out.
cameras due to the lack of distributed processing capability. Unlike Spatula-Track, CrossRoI degrades the tracking qual-
Figure 15a compares the resource costs for the CAM- ity on both MOTA and MOTP; for example, in CAMPUS,
PUS dataset. The results show a similar pattern to the CrossRoI shows 0.55 of MOTA, while other baselines includ-
CityFlowV2 dataset, but the saving ratio of Argus is much ing Argus show 0.85 of MOTA. This is because, for object
higher. Argus reduces the average number of IDs by 7.13× detection and identification, CrossRoI uses the smallest RoI
(35.8 to 5.0) compared to Conv-Track and Spatula-Track and across all cameras in which the target objects appear at least
4.86× (24.4 to 5.0) compared to CrossRoI-Track. The end- once. Therefore, these tasks sometimes fail due to the small
to-end latency also decreases by 1.72× (from 310 ms to size of the objects in the generated RoI areas.
180 ms) and 1.43× (from 258 ms to 180 ms), respectively.
The larger saving is mainly because the moving speed of
the target objects (here, people in the CAMPUS dataset) 6.3 Performance Breakdown: Benefit of On-Camera
is relatively slow, compared to vehicles in the CityFlowV2 Distributed Processing
dataset. Therefore, there are fewer newly appearing objects We developed two variants of Argus, namely Spatial and
(at the edge) and most of the identification tasks can be done Spa-Temp, in which we apply each enhancement to Conv-
by spatial and temporal association matching. Interestingly, Track in turn. For identification optimization, Spatial uses
Spatula-Track shows the same performance as Conv-Track, spatial association and Spa-Temp uses the spatio/temporal
which is different from the CityFlowV2 case. This is because association. Both of them have the capability of dynamic
all target people are captured by all four cameras all the inspection of cameras and bounding boxes (§5.1), but do
time and thus cameras are not filtered out. CrossRoI-Track not have distributed processing (§5.2).
reduces both latency and the number of IDs compared to Figure 16a shows the resource cost in CityFlowV2. The
Conv-Track, but its efficiency is still lower than Argus; for spatial association reduces the number of IDs from 40.5
12

CAMPUS. For Conv-Track, both the MOTP and MOTA are


not significantly affected by the number of queries. This is
because Conv-Track runs the identification model on all the
detected objects across all cameras regardless of the number
of queries; the identification matching accuracy with the
query images does not vary with the number of queries as
they are randomly selected and averaged. Argus also shows
(a) Resource cost. (b) Tracking quality. comparable accuracy with Conv-Track (with significantly
reduced number of ID operations as shown in Figure 18,
Fig. 16: Performance breakdown on CityFlowV2.
showing that it effectively reduces the identification work-
load without accuracy drop. Even with a large number of
matching attempts with other cameras and queries, Argus
identifies the objects accurately.
Figure 18 shows how the number of IDs and the end-
to-end latency change depending on the number of queries.
The number of IDs and the latency in Conv-Track do not
change because all objects in the frame must be examined
regardless of the number of queries. In contrast to Conv-
(a) MOTP. (b) MOTA. Track, Figure 18a shows that the number of IDs in Argus in-
creases with the number of queries. This is because, at each
Fig. 17: Impact of the number of queries on tracking time, if all target objects are not found on the previously
quality (CityFlowV2). inspected cameras, Argus has to perform the identification
operation for all detected objects (which are not filtered
out of the spatio/temporal association). This probability
increases when the number of queries is large, thereby
increasing the number of IDs. However, it saturates when
the number of queries is around 50 and, more importantly,
it is still much lower than Conv-Track.
Figure 18b also shows an interesting result. While the
number of IDs of Argus increases by 29% from 10.6 to 13.7
when the number of queries is 5 and 95, respectively, the
(a) Number of IDs. (b) End-to-end latency. increase in latency is much lower, i.e., by 8% from 399 ms
Fig. 18: Impact of the number of queries on resource to 433 ms. If we exclude the latency for object detection for
cost (CityFlowV2). the analysis, the execution time for identification increases
by only 10%, from 315 ms to 349 ms. This result shows
the benefit of the Argus’s distribution of the identification
(Conv-Track) to 16.5 (Spatial) and the temporal association operations to other cameras.
further decreases to 11.6 (Spa-Temp). These results show
that our spatial and temporal association techniques make 6.5 Impact of Number of Cameras
a significant contribution to overall resource savings. In-
terestingly, despite the reduction in the number of IDs by We examine the impact of the number of cameras on re-
Spatial, the latency increases from 854 ms (Conv-Track) to source saving. We consider all possible combinations and
992 ms (Spatial) due to the sequential operations on the report their average performance; for example, in the case
cameras. However, we observe that the temporal association of three cameras in the CityFlowV2, we report the average
and workload distribution successfully reduce the latency in result for all 10 (=5 C3 ) combinations. In this subsection, we
turn, to 624 ms and 317 ms, respectively. Figure 16b shows do not report the tracking quality results because it is not
the tracking quality of CityFlowV2. It confirms again that fair to compare tracking quality for different number and
both spatial and temporal associations (and their mapping- topology of cameras.
based identity matching) do not compromise the tracking Figure 19 shows the total number of IDs for Conv-Track
quality. We omit the result of CAMPUS as we also observe and Argus. As expected, the number of IDs required for
a similar trend. both Conv-Track and Argus increases when more cameras
are used. As shown in Figure 19a, in Conv-Track the number
of IDs increases from 8.2 (1 camera) to 42.5 (5 cameras) in
6.4 Impact of Number of Queries CityFlowV2, i.e., by 418%. Similarly, in Argus, it increases
We investigate the impact of the number of queries on from 2.2 to 11.6, i.e., by 423%. Figure 19b also shows that in
system performance. For CityFlowV2, we vary the num- CAMPUS, the total number of IDs increases by 316%, from
ber of queries from 5 to 95 with an interval of 5. For 8.6 (1 camera) to 35.8 (4 cameras) in Conv-Track, while in
each number of queries, we randomly select three sets Argus, it increases from 1.6 to 5.0, i.e., by 213%. However,
of queries (except the entire set) and report their average interestingly, Argus shows a much lower standard deviation
performance. Figures 17a and 17b show MOTP and MOTA across different combinations. This is because the number of
for CityFlowV2, respectively; we observe a similar trend in IDs in Conv-Track is proportional to the number of objects
13

(a) CityFlowV2. (b) CAMPUS.

(a) CityFlowV2. (b) CAMPUS.


Fig. 21: Impact of inspection order.
Fig. 19: Impact of number of cameras on number of IDs.

a predefined static order and in the reverse order of Argus,


respectively; Reverse is used to establish the performance
lower bound and validate our design choice. The Crowded-
first variant, inspired by REV [1], inspects the cameras
in a descending order based on the number of bounding
boxes. The underlying rationale is that cameras with more
bounding boxes are more likely to capture objects of interest.
For the bounding box inspection order, we employed the
same order as used in Static.
(a) CityFlowV2. (b) CAMPUS.
Figures 21a and 21b show the latency and the total
Fig. 20: Impact of the number of cameras on latency; number of IDs in CityFlowV2 and CAMPUS, respectively.
purple line is the execution time of the object detection. We omit the result of MOTP and MOTA as the differences
were marginal. The results validate our design choice. In
both datasets, Argus shows shorter latency and fewer iden-
in a frame and is therefore affected by the camera’s FoV, tification operations. As expected, Static and Crowded-first
i.e., how many objects are captured. In contrast, the number show better performance than Reverse, though their effect
of IDs in Argus is determined by the spatial association of is still lower than Argus. This is primarily due to a lack
the target objects, and is therefore more dependent on the of considerations for the relevance of target objects in a
number of queries. scene. This advantage is more evident in CAMPUS. The
We further investigate the impact of the number of number of IDs of Reverse is 11.6, while Argus’s number
cameras on end-to-end latency. Figure 20a and 20b show is 5.0. Similarly, the latency decreases from 245 ms to 186
the latency in the CityFlowV2 and CAMPUS datasets, re- ms. This is mainly because the target people mostly remain
spectively. While Conv-Track performs the identification op- in one of the cameras during the video stream. Therefore,
erations on each camera individually, the latency increases Argus is capable of reducing the number of IDs by initiating
as the number of cameras increases. This is because Conv- the inspection with potential target objects.
Track’s latency is tied to the maximum latency across all
cameras. Argus also increases latency for both CityFlowV2
and CAMPUS when more cameras are involved, as the 6.7 Robustness on Large Scale Benchmark
waiting time for the entry matching of previously inspected
cameras also increases. Nevertheless, the latency of Argus We perform large-scale evaluation on the MMPTRACK
in both Figures 20a and 20b is still much lower than dataset to validate the robustness of Argus. Figure 22
that of Conv-Track. We observe an interesting case in the compares the resource cost and tracking quality results of
CityFlowV2 dataset. The latency of Argus decreases from Conv-Track and Argus; we omit the results of Spatula-Track
587 ms to 400 ms when the number of cameras increases and CrossRoI-Track as we observe the similar performance
from 4 to 5, even though the number of IDs increases from trend in the previous experiments in Figures 14 and 15.
8.6 to 11.6. We conjecture that more cameras provide more First, Figure 22a shows that Argus achieves 2.19× latency
opportunities for the parallel processing of IDs across dis- gain, which mainly comes from reducing the number of
tributed cameras, and the benefit becomes apparent when IDs from 28.47 (4-8 objects×4-6 cameras) to 5.79. Next,
all five cameras are involved. Figure 22b shows the tracking quality of Conv-Track and
Argus. The base accuracy of Conv-Track varies across scenes
depending on ground truth labeling granularity and detec-
6.6 Impact of Inspection Order tor performance (e.g., retail scenes contain a lot of occlusions
We investigate the impact of the inspection order on system resulting in detection failure, whereas the ground truth label
performance. For the study, we developed three variants of is provided for all objects regardless of occlusion). Argus
Argus, namely Static, Reverse, and Crowded-first. All of them consistently shows marginal accuracy drop compared to
are built upon the original Argus system. The Static and Conv-Track, showing the robustness of our spatio-temporal
Reverse variants inspect the cameras and bounding boxes in association technique.
14

(a) Resource cost. (b) Tracking quality.

Fig. 22: Overall performance on MMPTRACK.

TABLE 4: Component-wise microbenchmark.


Detection (YOLOv5n [22]) Identification (batch size 4)
Device
1920 × 1080 1280 × 720 ResNet 101 [25] ResNet 50 [26]
NX 0.359s 0.073s 0.399s 0.063s
AGX 0.084s 0.038s 0.217s 0.027s
Fig. 23: Snapshots of real-world case study.

TABLE 3: Performance of Argus in the real-world case study. is interesting, especially given that the average number
of vehicles captured per video frame in the parking lot
Resource Efficiency Tracking Quality exceeded the count of vehicles in the CityFlowV2 dataset.
Latency Number of IDs MOTP MOTA We conjecture this is primarily due to the largely stationary
Parking lot 0.21s 3.2 0.71 0.95
nature of vehicles in the parking lot, allowing the benefit of
our spatio-temporal association to be maximized. Similarly,
6.8 Real-world Case Study & System Overhead the tracking quality in the parking lot is higher than that in
the CityFlowV2 dataset. The MOTP values in the parking lot
We conducted a supplementary experiment to investigate
and the CityFlowV2 dataset were 0.71 and 0.63, respectively.
both the performance and the operational characteristics
We attribute this to the relatively shorter distance between
of Argus’s runtime system within a practical deployment
the camera and the vehicles in the parking lot, enabling the
scenario. To achieve this objective, we installed four cam-
capture of vehicles at a larger scale.
eras and four Jetson AGX boards within a parking lot of
We delve deeper into the system overhead of Argus
the institute under the consent, employing them to record
with this deployment setup. Aside from object detection
videos at a resolution of 1080p with a frame rate of 10 frames
and identification, the principal operations of the Argus
per second. The parking lot selected for this study spans
system encompass two elements: (1) mapping-entry match-
an approximate area of 50 metres by 25 metres. To ensure
ing and (2) workload distribution decision-making. How-
a comprehensive coverage, the cameras were positioned
ever, according to our measurements derived from the
at the corners of the parking lot at a height of 3 metres;
real-world case study, the overhead associated with both
the corresponding AGX board is connected to the camera
these operations is negligible, quantified as less than a few
via the Ethernet cable and put on the ground. Figure 23
milliseconds. This minimal overhead can be attributed to
shows the snapshots of four cameras. Each video stream has
our efficient management of mapping entries via a hash
an approximate duration of an hour and the ground truth
table for the first operation. Additionally, for the second
data contains 60 vehicles in total. Since the target objects
operation, the system only needs to consider a relatively
are vehicles, we used the same object detection model and
small number of cases—typically fewer than five identifica-
identification model used in the CityFlowV2 dataset.
tion operations—when making distribution decisions. This
Table 3 shows the Argus’s overall performance in the
streamlined approach contributes to the overall efficiency
real-world case study. It is important to emphasize that we
and effectiveness of the Argus system.
did not conduct a comparative study due to the inability
to guarantee consistent behaviours across repetitive exper-
iments in real-world deployment. Moreover, a thorough 6.9 Micro-benchmark
analysis, compared with baseline methods, has already
been reported in preceding subsection. When contrasting We perform a micro-benchmark to better understand
the results obtained from the parking lot experiment with the resource characteristics of model inference on smart
the CityFlowV2 dataset, it is interesting to note that the cameras. Table 4 shows the latency of vision models we
parking lot exhibited marginally superior performance in used on Jetson NX and AGX; we report the detection latency
the aspects both of resource efficiency and tracking quality; at different image sizes. While the processing capability
we did not compare with the CAMPUS dataset due to the of smart cameras is still limited compared to the cloud
discrepancy in target objects and their respective charac- environment, performance can be optimized by applying
teristics. For instance, the average number of identification the right configurations depending on the requirement, e.g.,
tasks within the parking lot is 3.2, while the CityFlowV2 720p images with people tracking on Jetson NX. We also
dataset showed a higher figure of 10.7. This difference showed that Argus can further optimize the latency (and
15

corresponding throughput) by leveraging the spatial and Significantly, calibration may be impossible if video ana-
temporal association. lytics are detached from camera providers. Current video
analytics are restricted in leveraging the potential of de-
ployed cameras due to hard-coded analytics capabilities
7 R ELATED W ORK from tightly coupled hardware and software, and isolated
camera deployments from various service providers. We
propose a paradigm shift towards software-defined video an-
alytics, where analytic logics are decoupled from deployed
7.1 Cross-Camera Collaboration
cameras. This allows for dynamic composition and execu-
7.1.1 Multi-view Tracking using Camera Geometry tion of analytic services on demand, without altering or
Camera geometry, also referred to as the geometry of accessing the hardware. For instance, individual shops may
multiple views, has been studied for multiple decades to wish to run different analytic services using camera streams
enable accurate tracking of objects from different camera provided by shopping malls. However, camera parameters
views. It deals with the mathematical relationships between may be accessible only to the camera provider (e.g., owner
3D world points and their 2D projections onto the image of the shopping mall) and can change without notice de-
plane [27]. By understanding these relationships, the 3D pending on the provider’s requirements. In contrast, Argus,
structure of a scene, object, or person has been able to be relying solely on camera streams, can still be implemented
recovered from multiple 2D views, which enables the track- and supported on the video analytics’ side.
ing of objects even when they move out of one camera’s FoV
and into another [42]. Camera geometry has been applied in 7.1.2 Systems for Cross-Camera Collaboration
various fields, such as robotics, computer vision, and motion
capture, where the use of multiple synchronized cameras Enriched video analytics. One direction for cross-camera
with overlapping FoVs can improve the tracking accuracy collaboration is to provide enriched and combined video
and robustness of the system [43]. analytics from different angles and areas of multiple cam-
The foundation of multi-view tracking is the estimation eras [45], [46], [47]. Liu et al. developed Caesar [45], a system
of the fundamental matrix, which encodes the geometric that detects cross-camera complex activities, e.g., a person
relationship between the views of two cameras [27]. This walking in one camera and later talking to another person
matrix can be used to compute the epipolar geometry, which in another camera, by designing an abstraction for these
describes the relationship between corresponding points in activities and combining DNN-based activity detection from
the two images and can be utili‘ed to find the corresponding non-overlapping cameras. Li et al. presented a camera collab-
point in the other view when a point is detected in one oration system [46] that performs active object tracking by
view [42]. By using the fundamental matrix, triangulation exploiting the intrinsic relationship between camera posi-
techniques can be employed to estimate the 3D position of tions. Jha et al. developed Visage [47] which enables 3D im-
the tracked object in the scene [27]. Also, bundle adjustment, age analytics from multiple video streams from drones. Our
a non-linear optimisation technique, has been used to refine work can serve as an underlying on-camera framework for
camera parameters and 3D structure of the scene, leading to these works, providing multi-camera, multi-task tracking as
a more accurate estimation of the object’s position [44]. a primitive task on distributed smart cameras.
Despite the advantages of camera geometry in enabling
Resource efficiency. Another direction for cross-camera col-
tracking from multi-camera views, there are several lim-
laboration is to reduce the computational and communi-
itations in its deployment. One major challenge is the
cation costs of multiple video streams by exploiting their
sensitivity to camera calibration errors, which can lead to
spatial and temporal correlation [2], [3], [19]. Jain et al.
inaccurate 3D reconstruction and subsequently impact the
proposed Spatula [2], [19], a cross-camera collaboration
tracking performance [27]. The calibration process requires
system that targets wide-area camera networks with non-
the precise estimation of intrinsic camera parameters, such
overlapping cameras and limits the amount of video data
as focal length and lens distortion, and extrinsic parameters,
and corresponding communication to be analysed by iden-
like camera pose and orientation, which can be difficult to
tifying only those cameras and frames that are likely to
obtain in practical applications [42]. When using cameras
contain the target objects. REV [1] also aims at reduc-
with pan-tilt-zoom (PTZ) capabilities, the camera geometry
ing the number of cameras processed by incrementally
needs to be recalculated each time the camera view changes,
searching the cameras within the overlapping group and
adding to the complexity and computational load of the
opportunistically skipping processing the rest as soon as
tracking process. Similarly, the process of calibration should
the target has been detected. CrossRoI [3] and Polly [48]
be also repeated each time there are changes in the camera
leverages spatial correlation to extract the minimum-sized
set and topology, such as the addition of a new camera,
RoI from overlapping cameras and reduces processing and
failure of an existing camera, or the change of a camera’s
transmission costs by filtering out unmasked RoIs. All such
position in a retail store. While Argus also needs to adapt
works share the same high-level goal as Argus in that they
to these dynamics, it can be done more easily simply by
leverage spatio-temporal correlation from multiple cameras,
adjusting or regenerating the spatio/temporal association.
but Argus differs in several aspects, as shown in Table 1.
Moreover, in multi-view tracking using camera geometry,
occlusions and ambiguities in object appearances can pose Distributed processing. There have been several attempts
significant challenges in identifying corresponding points to distribute video analytics workloads from large-scale
across different views, leading to erroneous tracking [43]. video data to distributed cameras [39], [49]. VideoEdge [49]
16

optimises the trade-off between resources and accuracy by EC2 server4 ; it will be much higher when the network costs
partitioning the analytics pipeline into hierarchical clusters for 5.4 PB of video data are added. The bigger problem is
(cameras, private clusters, public clouds). Distream [39] that, even with excessive cameras, most of the video stream
adaptively balances workloads across smart cameras. Al- is never used. The study [74] further showed that less than
though this work provided a foundation for the devel- 0.005% of the video data is retrieved and used by less than
opment of distributed video analytics systems, it mainly 2% of the cameras.
focused on the video analytics pipeline with one camera To address this problem, on-camera AI processing is
as a main workload. In this work, we identify that multi- becoming increasingly popular [39], [74], [75], enabled by
camera, multi-target tracking is a primary underlying task two recent technology trends. First, low-cost, low-power
for overlapping camera environments, and propose an on- and programmable on-board AI accelerators are becoming
camera distributed processing strategy tailed to it. available [5], such as Nvidia Jetson, Google TPU and Ana-
log MAX78000. Second, lightweight, accurate and robust
embedded-ML models are emerging [12], [13]. Most im-
7.2 Resource-Efficient Video Analytics Systems
portantly, on-device processing is preferred as the privacy-
On-device processing. Many video analytics systems have sensitive raw image data does not need to be transferred to
been proposed to efficiently process a large volume of the cloud.
video data on low-end cameras, e.g., by adopting on-camera
frame filtering [50], [51], [52], pipeline adaptation [53], [54], Incorporation with non-overlapping camera collaboration.
edge-cloud collaborative inference [38], [55], [56], [57], [58] Argus currently targets cross-camera collaboration within a
RoI extraction [59], [60], [61], [62], [63]. On-camera frame closed set of cameras. However, analytics applications might
filtering techniques filter out the computationally intensive want to track objects in a wide area where large camera
execution of vision models in the early stages, e.g., by networks, including non-overlapping cameras, are installed,
dynamically adapting filtering decisions [51] and leveraging e.g., suspect monitoring in a large shopping mall or traffic
cheap CNN classifiers [50]. Yi et al. presented EagleEye [54], surveillance in an urban city. Our ultimate goal is to develop
a pipeline adaptation system for person identification that a system that supports seamless and efficient tracking across
selectively uses different face detection models depending overlapping and non-overlapping cameras by adopting the
on the quality of face images. MARLIN [53] has been solutions for non-overlapping cameras [2], [19].
proposed to selectively perform a deep neural network for Support for diverse coordination topologies. For cross-
energy-efficient object tracking. camera coordination, we assume a star topology where the
Computation offloading. Several attempts have been made most powerful camera becomes the head in a group, schedul-
to dynamically adjust video bitrate to optimise the net- ing multi-camera multi-target tracking operations for all
work bandwidth consumption to enable low-latency of- cameras and the other cameras become group members that
floading [62], [64], [65], optimise the video streaming proto- follow the head’s decision. We believe that our decision
col [63], and design DNN-aware video compression meth- is practical because the coordination overhead is marginal
ods [66], [67]. The other direction for efficient processing is as shown in §6.9, but sophisticated coordination would be
DNN inference scheduling from multiple video streams on necessary if more cameras are involved.
the GPU cluster [17], [18], [68], DNN merging for memory
Further optimization by splitting AI models Argus treats
optimisation [69], privacy-aware video analytics [70], [71],
AI models as a black box, thereby taking their full execution
and resource-efficient continual learning [72], [73].
as a primitive task for distributed processing. Splitting deep
While these works manage to achieve remarkable per-
neural networks into distributed cameras, e.g., Neurosur-
formance improvement, their attempts usually focus on a
geon [57], would allow further optimisation if we can have
single camera (or its server). In contrast to these works, we
access to the weights of the pre-trained models. We leave it
target an environment where multiple cameras are installed
as future work.
in close proximity, and focus on optimising cross-camera
operations by leveraging the spatio/temporal association of Cross-camera communication channel. For the commu-
objects. Argus can further improve system-wide resource nication channel between cameras, we consider a Gigabit
efficiency by applying these techniques. wired connection, which is already commonly used for
existing CCTV networks, e.g., at an intersection [4] and
a campus [20]. Considering that overlapping cameras are
8 D ISCUSSION AND F UTURE W ORKS deployed in proximity to each other, such an assumption
Why on-device processing on distributed smart cameras? would be still valid even in other environments. However,
The cost of video analytics is becoming a huge problem due when the communication channel is constrained, e.g., over
to the enormous amount of video data. The authors [74] cellular networks, the network overhead may dominate
studied the six-month deployment of over 1000 cameras at and the latency improvement achieved by multi-camera
Peking University, China, and reported that the cameras parallel processing could be less than expected. We leave
produced over 3 million hours of videos (5.4 PB). If we the detailed analysis as future work.
assume a simple application using the ResNet-18 model
at 30 frames per second continuously, the estimated ML
4. 100K inferences of ResNet18 costs 0.82 USD and the total number
operating expenses (OpEx) for six months would be 3.83 of inferences is 466,560,000,000 (= 30 fps × 60 sec × 60 min × 24 hrs ×
million USD if ML inference is executed on the Amazon 180 days × 1,000 cameras).
17

9 C ONCLUSION [17] H. Shen, L. Chen, Y. Jin, L. Zhao, B. Kong, M. Philipose, A. Kr-


ishnamurthy, and R. Sundaram, “Nexus: a gpu cluster engine for
We presented Argus, a first-kind-of distributed system for accelerating dnn-based video analysis,” in Proceedings of the 27th
robust and low-latency video analytics with cross-camera ACM Symposium on Operating Systems Principles, 2019, pp. 322–337.
collaboration on multiple cameras. We developed a novel [18] H. Zhang, G. Ananthanarayanan, P. Bodik, M. Philipose, P. Bahl,
object-wise spatio-temporal association that optimises the and M. J. Freedman, “Live video analytics at scale with ap-
proximation and delay-tolerance,” in 14th USENIX Symposium on
multi-camera, multi-target tracking by intelligently filtering Networked Systems Design and Implementation (NSDI 17), 2017, pp.
out unnecessary, redundant identification operations. We 377–392.
also developed a distributed scheduling technique that dy- [19] S. Jain, G. Ananthanarayanan, J. Jiang, Y. Shu, and J. Gonzalez,
“Scaling video analytics systems to large camera deployments,” in
namically orders the sequence of camera and bounding box Proceedings of the 20th International Workshop on Mobile Computing
inspection and distributes the identification workload across Systems and Applications, 2019, pp. 9–14.
multiple cameras. Evaluation on three real-world overlap- [20] Y. Xu, X. Liu, Y. Liu, and S.-C. Zhu, “Multi-view people tracking
ping camera datasets shows that Argus reduces the number via hierarchical trajectory composition,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2016, pp.
of identification model executions and end-to-end latency 4256–4265.
by up to 7.13× and 2.19× (4.86× and 1.60× compared to [21] X. Han, Q. You, C. Wang, Z. Zhang, P. Chu, H. Hu, J. Wang,
the state-of-the-arts). and Z. Liu, “Mmptrack: Large-scale densely annotated multi-
camera multiple people tracking benchmark,” in Proceedings of
the IEEE/CVF Winter Conference on Applications of Computer Vision,
R EFERENCES 2023, pp. 4860–4869.
[22] “Yolo-v5,” ”https://pytorch.org/hub/ultralytics yolov5/”, 24
[1] T. Xu, K. Shen, Y. Fu, H. Shi, and F. X. Lin, “Rev: A video engine Jan. 2024.
for object re-identification at the city scale,” in 2022 IEEE/ACM 7th [23] N. Friedman and S. Russell, “Image segmentation in video se-
Symposium on Edge Computing (SEC). IEEE, 2022, pp. 189–202. quences: A probabilistic approach,” arXiv preprint arXiv:1302.1539,
[2] S. Jain, X. Zhang, Y. Zhou, G. Ananthanarayanan, J. Jiang, Y. Shu, 2013.
P. Bahl, and J. Gonzalez, “Spatula: Efficient cross-camera video an- [24] Z. Zivkovic and F. Van Der Heijden, “Efficient adaptive density es-
alytics on large camera networks,” in 2020 IEEE/ACM Symposium timation per image pixel for the task of background subtraction,”
on Edge Computing (SEC). IEEE, 2020, pp. 110–124. Pattern recognition letters, vol. 27, no. 7, pp. 773–780, 2006.
[3] H. Guo, S. Yao, Z. Yang, Q. Zhou, and K. Nahrstedt, “Crossroi: [25] H. Luo, W. Chen, X. Xu, J. Gu, Y. Zhang, C. Liu, Y. Jiang, S. He,
Cross-camera region of interest optimization for efficient real time F. Wang, and H. Li, “An empirical study of vehicle re-identification
video analytics at scale,” arXiv preprint arXiv:2105.06524, 2021. on the ai city challenge,” in Proceedings of the IEEE/CVF Conference
[4] M. Naphade, S. Wang, D. C. Anastasiu, Z. Tang, M.-C. Chang, on Computer Vision and Pattern Recognition, 2021, pp. 4095–4102.
X. Yang, Y. Yao, L. Zheng, P. Chakraborty, C. E. Lopez et al., “The
[26] Z. Zheng, X. Yang, Z. Yu, L. Zheng, Y. Yang, and J. Kautz,
5th ai city challenge,” in Proceedings of the IEEE/CVF Conference on
“Joint discriminative and generative learning for person re-
Computer Vision and Pattern Recognition, 2021, pp. 4263–4273.
identification,” in Proceedings of the IEEE/CVF Conference on Com-
[5] M. Antonini, T. H. Vu, C. Min, A. Montanari, A. Mathur, and
puter Vision and Pattern Recognition, 2019, pp. 2138–2147.
F. Kawsar, “Resource characterisation of personal-scale sensing
[27] R. Hartley and A. Zisserman, Multiple view geometry in computer
models on edge accelerators,” in Proceedings of the First Interna-
vision. Cambridge university press, 2003.
tional Workshop on Challenges in Artificial Intelligence and Machine
Learning for Internet of Things, 2019, pp. 49–55. [28] G. P. Stein, “Tracking from multiple view points: Self-calibration
[6] A. Moss, H. Lee, L. Xun, C. Min, F. Kawsar, and A. Montanari, of space and time,” in Proceedings. 1999 IEEE Computer Society
“Ultra-low power dnn accelerators for iot: Resource characteriza- Conference on Computer Vision and Pattern Recognition (Cat. No
tion of the max78000,” in Proceedings of the 20th ACM Conference on PR00149), vol. 1. IEEE, 1999, pp. 521–527.
Embedded Networked Sensor Systems, 2022, pp. 934–940. [29] M. Pollefeys, L. Van Gool, M. Vergauwen, F. Verbiest, K. Cornelis,
[7] T. Gong, S. Y. Jang, U. G. Acer, F. Kawsar, and C. Min, “Collabora- J. Tops, and R. Koch, “Visual modeling with a hand-held camera,”
tive inference via dynamic composition of tiny ai accelerators on International Journal of Computer Vision, vol. 59, pp. 207–232, 2004.
mcus,” arXiv preprint arXiv:2401.08637, 2023. [30] P. Sturm and B. Triggs, “A factorization based algorithm for
[8] “Nvidia jetson agx,” https://www.nvidia.com/en-gb/ multi-image projective structure and motion,” in Computer Vi-
autonomous-machines/embedded-systems/jetson-agx-xavier/, sion—ECCV’96: 4th European Conference on Computer Vision Cam-
accessed: 24 Jan. 2024. bridge, UK, April 15–18, 1996 Proceedings Volume II 4. Springer,
[9] “Nvidia jetson nx,” https://www.nvidia.com/en-gb/ 1996, pp. 709–720.
autonomous-machines/embedded-systems/jetson-xavier-nx/, [31] D. Makris, T. Ellis, and J. Black, “Bridging the gaps between cam-
accessed: 24 Jan. 2024. eras,” in Proceedings of the 2004 IEEE Computer Society Conference on
[10] “Google coral,” https://coral.ai/, accessed: 24 Jan. 2024. Computer Vision and Pattern Recognition, 2004. CVPR 2004., vol. 2.
[11] “Analog max78000,” https://www.analog.com/en/products/ IEEE, 2004, pp. II–II.
max78000.html, accessed: 24 Jan. 2024. [32] J. Black, T. Ellis, and P. Rosin, “Multi view image surveillance
[12] A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T.-J. Yang, and and tracking,” in Workshop on Motion and Video Computing, 2002.
E. Choi, “Morphnet: Fast & simple resource-constrained structure Proceedings. IEEE, 2002, pp. 169–174.
learning of deep networks,” in Proceedings of the IEEE conference on [33] L. Lee, R. Romano, and G. Stein, “Monitoring activities from mul-
computer vision and pattern recognition, 2018, pp. 1586–1595. tiple video streams: Establishing a common coordinate frame,”
[13] P. Guo, B. Hu, and W. Hu, “Mistify: Automating dnn model IEEE Transactions on pattern analysis and machine intelligence, vol. 22,
porting for on-device inference at the edge,” in 18th USENIX no. 8, pp. 758–767, 2000.
Symposium on Networked Systems Design and Implementation (NSDI [34] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online
21), 2021, pp. 705–719. and realtime tracking,” in 2016 IEEE international conference on
[14] C. Liu, Y. Zhang, H. Luo, J. Tang, W. Chen, X. Xu, F. Wang, H. Li, image processing (ICIP). IEEE, 2016, pp. 3464–3468.
and Y.-D. Shen, “City-scale multi-camera vehicle tracking guided [35] N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime
by crossroad zones,” in Proceedings of the IEEE/CVF Conference on tracking with a deep association metric,” in 2017 IEEE international
Computer Vision and Pattern Recognition, 2021, pp. 4129–4137. conference on image processing (ICIP). IEEE, 2017, pp. 3645–3649.
[15] K. Shim, S. Yoon, K. Ko, and C. Kim, “Multi-target multi-camera [36] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and
vehicle tracking for city-scale traffic management,” in Proceedings A. Zisserman, “The pascal visual object classes (voc) challenge,”
of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- International journal of computer vision, vol. 88, no. 2, pp. 303–338,
tion, 2021, pp. 4193–4200. 2010.
[16] Y.-L. Li, Z.-Y. Chin, M.-C. Chang, and C.-K. Chiang, “Multi-camera [37] D. Crankshaw, X. Wang, G. Zhou, M. J. Franklin, J. E. Gonzalez,
tracking by candidate intersection ratio tracklet matching,” in and I. Stoica, “Clipper: A low-latency online prediction serving
Proceedings of the IEEE/CVF Conference on Computer Vision and system,” in 14th USENIX Symposium on Networked Systems Design
Pattern Recognition, 2021, pp. 4103–4111. and Implementation (NSDI 17), 2017, pp. 613–627.
18

[38] S. Laskaridis, S. I. Venieris, M. Almeida, I. Leontiadis, and N. D. cloud and mobile edge,” ACM SIGARCH Computer Architecture
Lane, “Spinn: synergistic progressive inference of neural networks News, vol. 45, no. 1, pp. 615–629, 2017.
over device and cloud,” in Proceedings of the 26th annual interna- [58] J. Yi, S. Kim, J. Kim, and S. Choi, “Supremo: Cloud-assisted low-
tional conference on mobile computing and networking, 2020, pp. 1–15. latency super-resolution in mobile devices,” IEEE Transactions on
[39] X. Zeng, B. Fang, H. Shen, and M. Zhang, “Distream: scaling live Mobile Computing, 2020.
video analytics with workload-adaptive distributed edge intelli- [59] W. Zhang, Z. He, L. Liu, Z. Jia, Y. Liu, M. Gruteser, D. Raychaud-
gence,” in Proceedings of the 18th Conference on Embedded Networked huri, and Y. Zhang, “Elf: accelerate high-resolution mobile deep
Sensor Systems, 2020, pp. 409–421. vision with content-aware parallel offloading,” in Proceedings of
[40] J. S. Jeong, J. Lee, D. Kim, C. Jeon, C. Jeong, Y. Lee, and B.-G. the 27th Annual International Conference on Mobile Computing and
Chun, “Band: coordinated multi-dnn inference on heterogeneous Networking, 2021, pp. 201–214.
mobile processors,” in Proceedings of the 20th Annual International [60] S. Jiang, Z. Lin, Y. Li, Y. Shu, and Y. Liu, “Flexible high-resolution
Conference on Mobile Systems, Applications and Services, 2022, pp. object detection on edge devices with tunable latency,” in Proceed-
235–247. ings of the 27th Annual International Conference on Mobile Computing
[41] K. Bernardin, A. Elbs, and R. Stiefelhagen, “Multiple object track- and Networking, 2021, pp. 559–572.
ing performance metrics and evaluation in a smart room environ- [61] K. Yang, J. Yi, K. Lee, and Y. Lee, “Flexpatch: Fast and ac-
ment,” in Sixth IEEE International Workshop on Visual Surveillance, curate object detection for on-device high-resolution live video
in conjunction with ECCV, vol. 90, no. 91. Citeseer, 2006. analytics,” in IEEE INFOCOM 2022-IEEE Conference on Computer
[42] R. Szeliski, Computer vision: algorithms and applications. Springer Communications. IEEE, 2022, pp. 1898–1907.
Nature, 2022. [62] L. Liu, H. Li, and M. Gruteser, “Edge assisted real-time object
[43] K. Kanatani, Geometric computation for machine vision. Oxford detection for mobile augmented reality,” in The 25th Annual Inter-
University Press, Inc., 1993. national Conference on Mobile Computing and Networking, 2019, pp.
[44] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon, 1–16.
“Bundle adjustment—a modern synthesis,” in Vision Algorithms: [63] K. Du, A. Pervaiz, X. Yuan, A. Chowdhery, Q. Zhang, H. Hoff-
Theory and Practice: International Workshop on Vision Algorithms mann, and J. Jiang, “Server-driven video streaming for deep learn-
Corfu, Greece, September 21–22, 1999 Proceedings. Springer, 2000, ing inference,” in Proceedings of the Annual conference of the ACM
pp. 298–372. Special Interest Group on Data Communication on the applications,
[45] X. Liu, P. Ghosh, O. Ulutan, B. Manjunath, K. Chan, and R. Govin- technologies, architectures, and protocols for computer communication,
dan, “Caesar: cross-camera complex activity recognition,” in Pro- 2020, pp. 557–570.
ceedings of the 17th Conference on Embedded Networked Sensor Sys- [64] B. Zhang, X. Jin, S. Ratnasamy, J. Wawrzynek, and E. A. Lee, “Aw-
tems, 2019, pp. 232–244. stream: Adaptive wide-area streaming analytics,” in Proceedings
[46] J. Li, J. Xu, F. Zhong, X. Kong, Y. Qiao, and Y. Wang, “Pose- of the 2018 Conference of the ACM Special Interest Group on Data
assisted multi-camera collaboration for active object tracking,” in Communication, 2018, pp. 236–252.
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, [65] R. Lu, C. Hu, D. Wang, and J. Zhang, “Gemini: a real-time video
no. 01, 2020, pp. 759–766. analytics system with dual computing resource control,” in 2022
[47] S. Jha, Y. Li, S. Noghabi, V. Ranganathan, P. Kumar, A. Nelson, IEEE/ACM 7th Symposium on Edge Computing (SEC). IEEE, 2022,
M. Toelle, S. Sinha, R. Chandra, and A. Badam, “Visage: Enabling pp. 162–174.
timely analytics for drone imagery,” in Proceedings of the 27th An- [66] X. Xie and K.-H. Kim, “Source compression with bounded dnn
nual International Conference on Mobile Computing and Networking, perception loss for iot edge computer vision,” in The 25th Annual
2021, pp. 789–803. International Conference on Mobile Computing and Networking, 2019,
pp. 1–16.
[48] J. Li, L. Liu, H. Xu, S. Wu, and C. J. Xue, “Cross-camera inference
[67] K. Du, Q. Zhang, A. Arapin, H. Wang, Z. Xia, and J. Jiang, “Ac-
on the constrained edge,” in Proc. IEEE INFOCOM, 2023.
cmpeg: Optimizing video encoding for accurate video analytics,”
[49] C.-C. Hung, G. Ananthanarayanan, P. Bodik, L. Golubchik, M. Yu,
Proceedings of Machine Learning and Systems, vol. 4, pp. 450–466,
P. Bahl, and M. Philipose, “Videoedge: Processing camera streams
2022.
using hierarchical clusters,” in 2018 IEEE/ACM Symposium on Edge
[68] J. Jiang, G. Ananthanarayanan, P. Bodik, S. Sen, and I. Stoica,
Computing (SEC). IEEE, 2018, pp. 115–131.
“Chameleon: scalable adaptation of video analytics,” in Proceed-
[50] K. Hsieh, G. Ananthanarayanan, P. Bodik, S. Venkataraman, ings of the 2018 Conference of the ACM Special Interest Group on Data
P. Bahl, M. Philipose, P. B. Gibbons, and O. Mutlu, “Focus: Query- Communication, 2018, pp. 253–266.
ing large video datasets with low latency and low cost,” in 13th [69] A. Padmanabhan, N. Agarwal, A. Iyer, G. Ananthanarayanan,
USENIX Symposium on Operating Systems Design and Implementa- Y. Shu, N. Karianakis, G. H. Xu, and R. Netravali, “Gemel: Model
tion (OSDI 18), 2018, pp. 269–286. merging for memory-efficient, real-time video analytics at the
[51] Y. Li, A. Padmanabhan, P. Zhao, Y. Wang, G. H. Xu, and R. Ne- edge,” in USENIX NSDI, April 2023.
travali, “Reducto: On-camera filtering for resource-efficient real- [70] F. Cangialosi, N. Agarwal, V. Arun, S. Narayana, A. Sarwate, and
time video analytics,” in Proceedings of the Annual conference of R. Netravali, “Privid: Practical,{Privacy-Preserving} video ana-
the ACM Special Interest Group on Data Communication on the lytics queries,” in 19th USENIX Symposium on Networked Systems
applications, technologies, architectures, and protocols for computer Design and Implementation (NSDI 22), 2022, pp. 209–228.
communication, 2020, pp. 359–376. [71] R. Lu, S. Shi, D. Wang, C. Hu, and B. Zhang, “Preva: Protecting
[52] M. Xu, T. Xu, Y. Liu, and F. X. Lin, “Video analytics with zero- inference privacy through policy-based video-frame transforma-
streaming cameras,” in 2021 USENIX Annual Technical Conference tion,” in 2022 IEEE/ACM 7th Symposium on Edge Computing (SEC).
(USENIX ATC 21), 2021, pp. 459–472. IEEE, 2022, pp. 175–188.
[53] K. Apicharttrisorn, X. Ran, J. Chen, S. V. Krishnamurthy, and A. K. [72] R. Bhardwaj, Z. Xia, G. Ananthanarayanan, J. Jiang, Y. Shu,
Roy-Chowdhury, “Frugal following: Power thrifty object detection N. Karianakis, K. Hsieh, P. Bahl, and I. Stoica, “Ekya: Continuous
and tracking for mobile augmented reality,” in Proceedings of the learning of video analytics models on edge compute servers,” in
17th Conference on Embedded Networked Sensor Systems, 2019, pp. 19th USENIX Symposium on Networked Systems Design and Imple-
96–109. mentation (NSDI 22), 2022, pp. 119–135.
[54] J. Yi, S. Choi, and Y. Lee, “Eagleeye: Wearable camera-based [73] K. Mehrdad, G. Ananthanarayanan, K. Hsieh, J. J. , R. N. , Y. Shu,
person identification in crowded urban spaces,” in Proceedings of M. Alizadeh, and V. Bahl, “Recl: Responsive resource-efficient
the 26th Annual International Conference on Mobile Computing and continuous learning for video analytics,” in USENIX NSDI, April
Networking, 2020, pp. 1–14. 2023.
[55] Y. Wang, W. Wang, D. Liu, X. Jin, J. Jiang, and K. Chen, “Enabling [74] M. Xu, T. Xu, Y. Liu, and F. X. Lin, “Video analytics with zero-
edge-cloud video analytics for robotics applications,” IEEE Trans- streaming cameras,” in 2021 USENIX Annual Technical Conference
actions on Cloud Computing, 2022. (USENIX ATC 21), 2021, pp. 459–472.
[56] M. Almeida, S. Laskaridis, S. I. Venieris, I. Leontiadis, and N. D. [75] M. Xu, Y. Liu, and X. Liu, “A case for camera-as-a-service,” IEEE
Lane, “Dyno: Dynamic onloading of deep neural networks from Pervasive Computing, vol. 20, no. 2, pp. 9–17, 2021.
cloud to device,” ACM Transactions on Embedded Computing Sys-
tems, vol. 21, no. 6, pp. 1–24, 2022.
[57] Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, and
L. Tang, “Neurosurgeon: Collaborative intelligence between the

You might also like