Cross Camera Tracking
Cross Camera Tracking
Abstract—Overlapping cameras offer exciting opportunities to view a scene from different angles, allowing for more advanced,
comprehensive and robust analysis. However, existing visual analytics systems for multi-camera streams are mostly limited to (i)
per-camera processing and aggregation and (ii) workload-agnostic centralized processing architectures. In this paper, we present
Argus, a distributed video analytics system with cross-camera collaboration on smart cameras. We identify multi-camera, multi-target
tracking as the primary task of multi-camera video analytics and develop a novel technique that avoids redundant, processing-heavy
identification tasks by leveraging object-wise spatio-temporal association in the overlapping fields of view across multiple cameras. We
arXiv:2401.14132v2 [cs.CV] 27 Jan 2024
further develop a set of techniques to perform these operations across distributed cameras without cloud support at low latency by (i)
dynamically ordering the camera and object inspection sequence and (ii) flexibly distributing the workload across smart cameras,
taking into account network transmission and heterogeneous computational capacities. Evaluation of three real-world overlapping
camera datasets with two Nvidia Jetson devices shows that Argus reduces the number of object identifications and end-to-end latency
by up to 7.13× and 2.19× (4.86× and 1.60× compared to the state-of-the-art), while achieving comparable tracking quality.
1 I NTRODUCTION
TABLE 1: Comparison of cross-camera collaboration approach in Argus with REV [1], Spatula [2], and CrossRoI [3].
REV [1] Spatula [2] CrossRoI [3] Argus (Ours)
Target environment Overlapping cameras Non-overlapping cameras Overlapping cameras Overlapping cameras
Optimization goal On-server computation Communication and on- Communication and on- End-to-end latency on cam-
costs server computation costs server computation costs eras
Collaboration gran- Cells (group of cameras) Cameras Areas (RoIs) Objects
ularity
Applying associa- Dynamic (depending on Dynamic (depending on Static (once when the cam- Dynamic (depending on
tion the target’s existence) the target’s existence) eras are deployed) the target’s location)
Approach Incrementally search cells Identify the subset of cam- Find the smallest RoI that Minimise the # of iden-
that have lowest identifica- eras that capture target ob- contains the target objects tification operations across
tion confidence jects cameras
Video processing Centralized Centralized Centralized Distributed
process involves determining the location and capturing object is not yet known. This would lead to an increase in
image crops of target objects (presented as query images) end-to-end latency, even with the fewer number of identifi-
on deployed cameras over time. We find that the computa- cation model inferences. Also, since cameras have different
tional bottleneck for camera collaboration arises due to the workloads (i.e., the number of detected objects) and het-
frequent execution of identification model inference across erogeneous processing capabilities, careless scheduling and
different cameras. To address this challenge, we develop a distribution might not maximize the overall performance.
fine-grained, object-wise spatio-temporal association tech- To this end, we develop a multi-camera dynamic inspector
nique. This novel approach strategically avoids redundant (§5.1) that dynamically orders the camera and bounding
identification tasks on both spatial (across multiple cameras) box inspection sequence to avoid identification tasks for
and temporal (within each camera over time) axes. This not query-irrelevant objects. We also distribute identification
only streamlines the process but also enhances the efficiency tasks across multiple cameras, taking into account network
of the system. transmission and heterogeneous computing capacities on
To enable effective multi-camera, multi-target tracking the fly, to minimize end-to-end latency (§5.2).
across overlapping cameras, we develop an object-wise We prototype Argus on two Nvidia Jetson de-
association-aware identification technique. Specifically, Ar- vices (AGX [8] and NX [9]) and evaluate its perfor-
gus continuously tracks records of the association of objects mance with three real-world overlapping camera datasets
(their bounding boxes) with the same identity across both (CityFlowV2 [4], CAMPUS [20], and MMPTRACK [21]).
multiple cameras (§4.1) and time (§4.2). Then, it identifies The results show that Argus reduces the number of iden-
the object by matching the location association instead of tification model executions and the end-to-end latency by
running the identification model inference and matching the up to 7.13× and 2.19× compared to the conventional per-
appearance feature. The concept of spatio-temporal associa- camera processing pipeline (4.85× and 1.60× compared
tion has been proposed in several previous works to reduce to the state-of-the-art spatio-temporal association), while
the repetitive appearance or query irrelevant areas [2], [3], achieving comparable tracking quality.
[19]. However, they apply to association at a coarse-grained We summarize the contribution of this paper as follows.
level, e.g., groups of cameras [1], cameras [2], [19] or regions
• We present Argus, a novel system for robust and low-
of interest (RoIs) [3]. Thus, the expected gain is small for
latency multi-camera video analytics with cross-camera
our target environment, which is multi-camera, multi-target
collaboration on distributed smart cameras.
tracking on overlapping cameras. For example, the resource
• To enable efficient cross-camera collaboration, we develop
saving from camera-wise association and filtering [2], [19] is
a novel object-wise spatio-temporal association technique
expected to be marginal for densely deployed overlapping
that exploits the overlap in FoVs of multiple cameras
cameras. RoI-wise association and filtering [3] also degrade
to optimise redundancy in the multi-camera, multi-target
tracking accuracy, as the target object is not detected on a
tracking pipeline.
subset of cameras. Please refer to Table 1 and §7 for more
• We also develop a scheduling technique that dynamically
details of these works. In §2.2 and §6.2, we also provide
schedules the inspection sequence and workload distri-
an in-depth analysis and a comparative study with these
bution across multiple cameras to optimise end-to-end
prior arts, respectively. Furthermore, we carefully incorpo-
latency.
rate techniques to handle corner cases in the association
• Extensive evaluations over three overlapping camera
process (e.g., newly appearing objects, occasional failure
datasets show that Argus significantly reduces the num-
of the identification model and its error propagation) and
ber of identification model executions and end-to-end
improve the robustness of the spatio/temporal association
latency by up to 7.13× and 2.19× (4.86× and 1.60×
process (§4.4).
compared to the state-of-the-art [2], [3]) while achieving
Next, we develop a set of strategies that perform spatio-
comparable tracking quality to baselines.
temporal association over distributed smart cameras at
low latency. To maximize the benefits of association-aware
identification, it needs to process cameras one by one in
a sequential manner so that the number of identification 2 BACKGROUND AND M OTIVATION
model inferences is minimized; identification model infer-
ence needs to be performed when the identity of the pivot
3
2.1 Multi-Camera, Multi-Target Tracking yet capable of processing a number of identification tasks in
real time. Table 2 shows the latency of two identification
In this work, we focus on multi-camera, multi- models (ResNet-101-based vehicle identification [25] and
tracking using deep learning-based object detection and re- ResNet-50-based person identification [26]) with different
identification models. These models robustly track objects batch sizes over two Nvidia Jetson devices. It shows that
across multiple views even in complex scenarios, by lever- the number of identification model executions to run on
aging the discriminative power of deep neural networks. one camera is quite limited. For example, if 4 vehicles are
They also handle occlusions, changes in appearance, and detected on every frame on average, even the powerful
other challenges that are difficult to address with geometry- Jetson AGX platform can only process about 4 frames per
based methods. To this end, they often learn from large- second. The throughput would drop even further if object
scale datasets, enabling them to generalize to a wide range detection is included (we show the detailed results in §6).
of scenarios and adapt to changes in the environment.
Operational flow. The key to enabling video analytics on 2.2 Exploring Optimisation Opportunities
overlapping cameras is multi-camera, multi-target tracking: Redundant identification of the same objects. To explore
detecting and tracking target objects (given as query images) the opportunities for optimizing the pipeline for multi-
from video streams captured by multiple cameras. This is camera, multi-target tracking, we investigate the pattern of
typically achieved in three stages, as shown in Figure 2. identification tasks with the CityFlowV2 dataset [4]; five
(i) The object detection stage detects the bounding boxes of cameras are installed at an intersection as shown in Figure 3.
objects in one frame on each camera using object detectors Figure 5 shows the redundancy probability, i.e., the proba-
(e.g., YOLO [22]) or background subtraction techniques [23], bility of objects appearing simultaneously in multiple cam-
[24]. (ii) The per-camera object identification stage extracts the eras for different overlap ratios; the overlap ratio is defined
appearance features of the detected objects by running the as the ratio of the time the object appears simultaneously
object identification (ID) model (e.g., [25]) and determines in both cameras and the total time it is detected in any
whether it matches the query image based on feature sim- camera; for a target appearing in n cameras, we calculate
ilarity (e.g., L2 distance, cosine similarity). (iii) The result all pairwise overlap ratios (nC2 ) and take the average. Each
aggregation stage aggregates the identification results across point represents a different query. The results show that,
multiple cameras and generates tracklets [14] that can be as the overlap ratio increases, the probability of an object’s
used for further processing for application logic, e.g., object appearance in multiple cameras also becomes higher. This
counting, license plate extraction and face recognition. means that a dense array of cameras with overlapping FoVs
Compute bottleneck: per-object identification. The main will have more redundant identification tasks for the same
compute bottleneck is the execution of identification tasks, object across multiple cameras.
which need to be performed for all detected objects in Spatio-temporal association. To avoid unnecessary and re-
every frame across multiple cameras to determine identity dundant identification tasks, we adopt spatio-temporal asso-
of objects, as shown in Figure 2. Although we envision smart ciation of objects, which have been proposed in the auto-
cameras equipped with built-in AI accelerators, they are not calibration techniques [27], [28] for multi-view tracking sys-
4
by the observation that an object appears in the camera’s 5.1 Multi-Camera Dynamic Inspection
frame by moving from out-of-FoV to FoV, we consider the The key to maximizing the benefit from spatial association
bounding boxes that are newly located at the edge of the is to quickly find the objects that match the query, thereby
frame as potential candidates, and perform the ID feature (a) skipping identification tasks on other cameras from
extraction regardless of the matching mapping entry if no spatial association and (b) skipping identification of objects
corresponding identification cache is found. irrelevant to the query even on the same camera. To this
Handling occlusion. Depending on a camera’s FoV, a target end, we develop a method to dynamically arrange the order
object might be obscured by another moving object. For of cameras and bounding boxes to be examined.
instance, in a camera with a FoV perpendicular to the road, Inter-camera dynamic inspection. The inspection order of
a vehicle in the front lane could occlude a vehicle in the the cameras heavily affects the identification efficiency (i.e.,
rear lane. Under such circumstances, the detection model the total number of required identifications). Specifically, we
might fail to identify the target object. To handle errors find that searching the cameras which most likely contain
resulting from sudden, short-term occlusions, we develop the target object first improves the search efficiency. This is
an interpolation technique that leverages the detection re- because we can leverage the bounding box location of the
sults from the preceding frame in the same camera and/or identified target object to aggressively skip identification on
time-synchronised frames from other cameras. Specifically, non-matching bounding boxes in the remaining cameras.
during a sudden, short-term occlusion, the target object For example, consider a case with three cameras (Cameras
might be visible up to a certain point in the frame, then 1, 2, and 3). At a given timestamp, assume that all cam-
abruptly disappear mid-frame. If the object remains visible eras detect the same number of vehicles, e.g., four, and a
in other cameras, we can estimate the existence of the oc- target object is captured by Cameras 1 and 2. If Camera
cluded object by comparing the current mapping entry with 1 is inspected first, we can find the query object within
past mapping entries. For example, if an object suddenly four identification and skip the identification operations for
disappears in Camera 1, Argus searches for a prior mapping Cameras 2 and 3. However, if the inspection starts with
entry containing the object located in the previous frame of Camera 3, we need to perform further inspections with
Camera 1 and extracts the position of the object in other Camera 1 and 2 just in case the target object is located out of
cameras. If corresponding bounding boxes are found in all Camera 3’s FoV. Hence, eight identifications are required.
other cameras, Argus performs object detection and identi- In addition to efficiency, the inspection order of the
fication on the other cameras. Where occlusion persists for cameras also affects the identification accuracy because our
an extended period, we employ periodic cache refreshing approach relies on identification-based target matching from
(details provided subsequently). It is important to note that the first camera. Specifically, inspecting the camera where
such occlusions are rare in practical settings, as objects move the target identification accuracy is expected to be the high-
at varying speeds and cameras are often installed to monitor est leads to the highest association accuracy in the remaining
the target scene from a high vantage point (e.g., mounted on cameras. While the identification accuracy is affected by
a traffic light as shown in Figure 12). multiple attributes of the captured object (e.g., the detected
Periodic cache refreshing. To avoid error propagation in object’s size, pose, blur), we currently use the bounding box
our association-based identification (due to occlusion as size as the primary indicator; we plan to extend the analysis
well as the failure of identification model inference), we to other attributes in our future work. For example, in the
limit the maximum number of consecutive skips and per- case of Figure 7, we consider Camera 2 as the first camera
form identification task regardless of a matching mapping to be inspected as the size of the box of the detected target
entry at the predefined interval (e.g., every 2s). This variable object is the largest, i.e., has the highest probability that it is
controls the trade-off between efficiency and accuracy. correctly identified.
Considering these two factors (efficiency and accuracy),
Time synchronisation. For spatial association, it is impor- at each time t, we calculate each camera’s priority as follows
tant for all video streams to be time synchronised. To this (higher value indicates higher priority)
end, Argus periodically synchronises the camera clock time
using the network time protocol (NTP) periodically and Nt−1 i
X
min dist(bboxit,j , Bt−1
i
) (3)
max T D(C i , C j , nj ) + BP (C j , nj ) ,
j min
n1 ,n2 ,...,nK j
baseline methods and to enable an in-depth study of the number of matches in frame t and dt.i is the overlap of
impact of various system parameter values. When spatial the bounding box (IoU) of target i with the ground truth.
association is used, we use the pre-generated mapping For each frame, we compute the MOTP for each camera
entries learned with 10% of the data in the dataset. separately and report its average.
• CityFlowV2 [4] consists of video streams from five het- • MOTA measures the overall accuracy P of both the detector
(F Nt +F Pt +M Mt )
erogeneous overlapping cameras at a road intersection. and the tracker. We define it as 1 − t P ,
t Tt
The cameras are located to cover the intersection from where t is the frame index and Tt is the number of
different sides of the road (Figure 3). 4 videos are recorded target objects in frame t. FN, FP, and MM represent false-
at 1080p@10fps and 1 video is recorded at 720p@10fps negative, false-positive and miss-match errors, respec-
with a fisheye lens. Each video stream is ≈3 minutes long tively. Similarly, we calculate the average MOTA across
and the ground truth data contains 95 unique vehicles. multiple cameras and report their average.
• CAMPUS [20] consists of overlapping video streams
recorded in four different scenes. We use the Garden1 Baselines. We evaluate Argus against the following state-of-
scene, which consists of 4× 1080@30fps videos capturing the-art methods. The baselines perform all model operations
a garden and its perimeter (Figure 13). We resized the on the camera where the corresponding image frame is
images to 720p as they show comparable object detection generated.
performance to the original 1080p at a lower cost. Each • Conv-Track is the conventional pipeline of multi-camera,
video is ≈100s and the ground truth contains 16 unique multi-target tracking (e.g., [14]), as shown in Figure 2.
individuals. Since the dataset provides inaccurate ground It identifies the query object on each camera separately
truth labels and bounding boxes, we manually regener- and aggregates the identification results across multiple
ated the ground truth for three targets (id 0, 2, 9). cameras.
• MMPTRACK [21] is composed of overlapping video • Spatula-Track adopts the camera-wise filtering approach
streams recorded from 5 different scenes: cafe shop, in- proposed in Spatula [2] for object tracking. For each times-
dustry safety, office, lobby, and retail. In total, there are 23 tamp, it first filters out the cameras that do not contain the
scene samples (3-8 samples per scene), and each sample target objects and then performs the Conv-Track pipeline
is composed of 4-6 overlapping video streams capturing for the selected cameras. We use ground truth labels for
6-8 people. Each video stream is 360p@15fps and ≈400 correlation learning and camera filtering, assuming the
seconds (in total 133k frames = 8,800 seconds). We use ideal operation of Spatula [2].
this dataset to evaluate the robustness of Argus in §6.7 • CrossRoI-Track adopts the RoI-wise filtering approach
proposed in CrossRoI [3] for tracking. Offline, it learns
Queries. For queries, we randomly chose ten vehicles for the minimum-sized RoI that contains all objects at least
CityFlowV2, three people for CAMPUS, and two people once over deployed cameras. At runtime, it performs the
for MMPTRACK. In the in-depth analysis, we also examine Conv-Track pipeline only for the masked RoI areas. We
performance with different numbers of queries. use the ground truth labels for the optimal training of the
Object detection and identification models. For object RoI mask, assuming the ideal operation of CrossRoI [3].
detection, we use YOLO-v5 [22]. For vehicle identification Hardware. For the hardware of smart cameras, we consid-
in CityFlowV2, we use the ResNet-101-based model [25] ered two platforms, Nvidia Jetson AGX and Jetson NX.
trained on the CityFlowV2-ReID dataset [4]. For person Jetson AGX hosts an 8-core Nvidia Carmel Arm, a 512-
identification in CAMPUS and MMPTRACK, we trained core Nvidia VoltaTM GPU with 64 Tensor Cores and 32
the ResNet-50-based model using the dataset. Note that GB of memory. Jetson NX hosts a 6-core Nvidia Carmel
the performance of the re-id model is not the focus of this Arm, a Volta GPU with 384 NVIDIA CUDA cores and 48
work and different models can be used. All models are Tensors, and 8 GB of memory. We prototyped Argus on
implemented in PyTorch 1.7.1. these platforms and measured performance; we used Jetson
Metrics. To measure system resource costs, we evaluate AGX for the CityFlowV2 dataset and the MMMPTRACK
the end-to-end latency and the number of identification dataset, and Jetson NX for the CAMPUS dataset. For the
model inferences. To measure tracking quality, we use two network configurations, we connected the Jetson devices
metrics that are widely used in multi-object tracking [41]: with a Gigabit wired connection, which is commonly used
Multiple Object Tracking Precision (MOTP) and Multiple for existing CCTV networks. It is important to note that,
Object Tracking Accuracy (MOTA). while we used the offline data traces from three datasets for
• End-to-end latency is the total latency for generating
the repetitive and comprehensive analysis, we implemented
multi-camera, multi-target tracking results. Note that the the end-to-end, distributed architecture of Argus on top of
latency includes all the operations required for the sys- multiple Jetson devices and evaluated the resource metrics
tem, i.e., image acquisition, model inference, uploading by monitoring the resource cost at runtime.
the cropped images to other cameras, and cross-camera
communication time. 6.2 Overall Performance
• Number of IDs is defined as the total number of identi-
fication model inferences required across all cameras for Figures 14 and 15 show overall performance on
each timestamp. CityFlowV2 and CAMPUS respectively. In Figures 14a and
• MOTP quantifies how preciselyP the tracker estimates 15a, the bar chart represents the average end-to-end latency
t,i dt,i
object positions. It is defined as P ct , where ct is the
t
(Detect: object detection latency, ID: identification latency)
11
(a) Resource cost. (b) Tracking quality. (a) Resource cost. (b) Tracking quality.
Fig. 14: Overall performance on CityFlowV2. Fig. 15: Overall performance on CAMPUS.
and the line chart shows the average number of IDs. In CrossRoI-Track, the number of IDs and latency are 24.9 and
Figures 14b and 15b, the bar and line charts represent the 419 ms, respectively.
average MOTA and MOTP respectively.
6.2.2 Tracking Quality
6.2.1 Resource Efficiency We investigate how spatial and temporal association-aware
identification affects tracking quality. Figure 14b and Fig-
Overall, Argus achieves significant resource savings by ure 15b show the MOTP and MOTA on CityFlowV2 and
adopting the spatio-temporal association and workload dis- CAMPUS, respectively. Overall, Argus achieves comparable
tribution, while not compromising the tracking quality. We tracking quality, even with significant resource savings.
first examine the resource saving of Argus. In CityFlowV2 Interestingly, in CityFlowV2, Argus increases both MOTA
where five cameras are involved, Figure 14a shows that and MOTP compared to Conv-Track; MOTA increases from
the average number of IDs decreases from 42.6 (Conv- 0.88 to 0.91 and MOTP increases from 0.60 to 0.67. This
Track ) to 21.1 (Spatula-Track), 30.1 (CrossRoI-Track) and is because several small cropped vehicles are identified by
11.6 (Argus). The end-to-end latency also decreases from associating with their position from other cameras, which
740 ms to 650 ms, 660 ms and 410 ms, respectively; Argus failed to be identified by matching their appearance features
is 1.8×, 1.59× and 1.61× faster than Conv-Track, Spatula- from the identification model in the baselines. In CAMPUS,
Track and CrossRoI-Track, respectively, which are the state- Argus shows almost the same tracking quality as Conv-
of-the-art multi-camera tracking solutions. We find several Track, but MOTA drops slightly from 0.85 (Conv-Track) to
interesting observations. First, the latency does not decrease 0.82. There were some cases where a target person was
proportionally to the number of IDs because all baselines suddenly occluded by another person in some cameras.
need to commonly perform object detection in every frame. However, Argus identifies the occluded person in the next
However, even when object detection is taken into account, frame using the robustness techniques §4.4, thereby being
Argus significantly decreases the end-to-end latency by able to minimize the error.
49% by reducing the number of IDs by 73%, compared to We investigate the benefit of cross-camera collaboration
Conv-Track. Second, both Spatula-Track and CrossRoI-track in more detail. In CAMPUS, Spatula also increases the
significantly reduce the number of IDs by selectively using tracking accuracy (MOTA) compared to Conv-Track by fil-
cameras and RoI areas, respectively. However, the reduction tering out query-irrelevant cameras, thereby avoiding false-
in end-to-end latency is not significant (about 10%). This is positive identifications. However, in CAMPUS, its quality
because the latency is tied to the longest execution time of all is identical to Conv-Track as no cameras are filtered out.
cameras due to the lack of distributed processing capability. Unlike Spatula-Track, CrossRoI degrades the tracking qual-
Figure 15a compares the resource costs for the CAM- ity on both MOTA and MOTP; for example, in CAMPUS,
PUS dataset. The results show a similar pattern to the CrossRoI shows 0.55 of MOTA, while other baselines includ-
CityFlowV2 dataset, but the saving ratio of Argus is much ing Argus show 0.85 of MOTA. This is because, for object
higher. Argus reduces the average number of IDs by 7.13× detection and identification, CrossRoI uses the smallest RoI
(35.8 to 5.0) compared to Conv-Track and Spatula-Track and across all cameras in which the target objects appear at least
4.86× (24.4 to 5.0) compared to CrossRoI-Track. The end- once. Therefore, these tasks sometimes fail due to the small
to-end latency also decreases by 1.72× (from 310 ms to size of the objects in the generated RoI areas.
180 ms) and 1.43× (from 258 ms to 180 ms), respectively.
The larger saving is mainly because the moving speed of
the target objects (here, people in the CAMPUS dataset) 6.3 Performance Breakdown: Benefit of On-Camera
is relatively slow, compared to vehicles in the CityFlowV2 Distributed Processing
dataset. Therefore, there are fewer newly appearing objects We developed two variants of Argus, namely Spatial and
(at the edge) and most of the identification tasks can be done Spa-Temp, in which we apply each enhancement to Conv-
by spatial and temporal association matching. Interestingly, Track in turn. For identification optimization, Spatial uses
Spatula-Track shows the same performance as Conv-Track, spatial association and Spa-Temp uses the spatio/temporal
which is different from the CityFlowV2 case. This is because association. Both of them have the capability of dynamic
all target people are captured by all four cameras all the inspection of cameras and bounding boxes (§5.1), but do
time and thus cameras are not filtered out. CrossRoI-Track not have distributed processing (§5.2).
reduces both latency and the number of IDs compared to Figure 16a shows the resource cost in CityFlowV2. The
Conv-Track, but its efficiency is still lower than Argus; for spatial association reduces the number of IDs from 40.5
12
TABLE 3: Performance of Argus in the real-world case study. is interesting, especially given that the average number
of vehicles captured per video frame in the parking lot
Resource Efficiency Tracking Quality exceeded the count of vehicles in the CityFlowV2 dataset.
Latency Number of IDs MOTP MOTA We conjecture this is primarily due to the largely stationary
Parking lot 0.21s 3.2 0.71 0.95
nature of vehicles in the parking lot, allowing the benefit of
our spatio-temporal association to be maximized. Similarly,
6.8 Real-world Case Study & System Overhead the tracking quality in the parking lot is higher than that in
the CityFlowV2 dataset. The MOTP values in the parking lot
We conducted a supplementary experiment to investigate
and the CityFlowV2 dataset were 0.71 and 0.63, respectively.
both the performance and the operational characteristics
We attribute this to the relatively shorter distance between
of Argus’s runtime system within a practical deployment
the camera and the vehicles in the parking lot, enabling the
scenario. To achieve this objective, we installed four cam-
capture of vehicles at a larger scale.
eras and four Jetson AGX boards within a parking lot of
We delve deeper into the system overhead of Argus
the institute under the consent, employing them to record
with this deployment setup. Aside from object detection
videos at a resolution of 1080p with a frame rate of 10 frames
and identification, the principal operations of the Argus
per second. The parking lot selected for this study spans
system encompass two elements: (1) mapping-entry match-
an approximate area of 50 metres by 25 metres. To ensure
ing and (2) workload distribution decision-making. How-
a comprehensive coverage, the cameras were positioned
ever, according to our measurements derived from the
at the corners of the parking lot at a height of 3 metres;
real-world case study, the overhead associated with both
the corresponding AGX board is connected to the camera
these operations is negligible, quantified as less than a few
via the Ethernet cable and put on the ground. Figure 23
milliseconds. This minimal overhead can be attributed to
shows the snapshots of four cameras. Each video stream has
our efficient management of mapping entries via a hash
an approximate duration of an hour and the ground truth
table for the first operation. Additionally, for the second
data contains 60 vehicles in total. Since the target objects
operation, the system only needs to consider a relatively
are vehicles, we used the same object detection model and
small number of cases—typically fewer than five identifica-
identification model used in the CityFlowV2 dataset.
tion operations—when making distribution decisions. This
Table 3 shows the Argus’s overall performance in the
streamlined approach contributes to the overall efficiency
real-world case study. It is important to emphasize that we
and effectiveness of the Argus system.
did not conduct a comparative study due to the inability
to guarantee consistent behaviours across repetitive exper-
iments in real-world deployment. Moreover, a thorough 6.9 Micro-benchmark
analysis, compared with baseline methods, has already
been reported in preceding subsection. When contrasting We perform a micro-benchmark to better understand
the results obtained from the parking lot experiment with the resource characteristics of model inference on smart
the CityFlowV2 dataset, it is interesting to note that the cameras. Table 4 shows the latency of vision models we
parking lot exhibited marginally superior performance in used on Jetson NX and AGX; we report the detection latency
the aspects both of resource efficiency and tracking quality; at different image sizes. While the processing capability
we did not compare with the CAMPUS dataset due to the of smart cameras is still limited compared to the cloud
discrepancy in target objects and their respective charac- environment, performance can be optimized by applying
teristics. For instance, the average number of identification the right configurations depending on the requirement, e.g.,
tasks within the parking lot is 3.2, while the CityFlowV2 720p images with people tracking on Jetson NX. We also
dataset showed a higher figure of 10.7. This difference showed that Argus can further optimize the latency (and
15
corresponding throughput) by leveraging the spatial and Significantly, calibration may be impossible if video ana-
temporal association. lytics are detached from camera providers. Current video
analytics are restricted in leveraging the potential of de-
ployed cameras due to hard-coded analytics capabilities
7 R ELATED W ORK from tightly coupled hardware and software, and isolated
camera deployments from various service providers. We
propose a paradigm shift towards software-defined video an-
alytics, where analytic logics are decoupled from deployed
7.1 Cross-Camera Collaboration
cameras. This allows for dynamic composition and execu-
7.1.1 Multi-view Tracking using Camera Geometry tion of analytic services on demand, without altering or
Camera geometry, also referred to as the geometry of accessing the hardware. For instance, individual shops may
multiple views, has been studied for multiple decades to wish to run different analytic services using camera streams
enable accurate tracking of objects from different camera provided by shopping malls. However, camera parameters
views. It deals with the mathematical relationships between may be accessible only to the camera provider (e.g., owner
3D world points and their 2D projections onto the image of the shopping mall) and can change without notice de-
plane [27]. By understanding these relationships, the 3D pending on the provider’s requirements. In contrast, Argus,
structure of a scene, object, or person has been able to be relying solely on camera streams, can still be implemented
recovered from multiple 2D views, which enables the track- and supported on the video analytics’ side.
ing of objects even when they move out of one camera’s FoV
and into another [42]. Camera geometry has been applied in 7.1.2 Systems for Cross-Camera Collaboration
various fields, such as robotics, computer vision, and motion
capture, where the use of multiple synchronized cameras Enriched video analytics. One direction for cross-camera
with overlapping FoVs can improve the tracking accuracy collaboration is to provide enriched and combined video
and robustness of the system [43]. analytics from different angles and areas of multiple cam-
The foundation of multi-view tracking is the estimation eras [45], [46], [47]. Liu et al. developed Caesar [45], a system
of the fundamental matrix, which encodes the geometric that detects cross-camera complex activities, e.g., a person
relationship between the views of two cameras [27]. This walking in one camera and later talking to another person
matrix can be used to compute the epipolar geometry, which in another camera, by designing an abstraction for these
describes the relationship between corresponding points in activities and combining DNN-based activity detection from
the two images and can be utili‘ed to find the corresponding non-overlapping cameras. Li et al. presented a camera collab-
point in the other view when a point is detected in one oration system [46] that performs active object tracking by
view [42]. By using the fundamental matrix, triangulation exploiting the intrinsic relationship between camera posi-
techniques can be employed to estimate the 3D position of tions. Jha et al. developed Visage [47] which enables 3D im-
the tracked object in the scene [27]. Also, bundle adjustment, age analytics from multiple video streams from drones. Our
a non-linear optimisation technique, has been used to refine work can serve as an underlying on-camera framework for
camera parameters and 3D structure of the scene, leading to these works, providing multi-camera, multi-task tracking as
a more accurate estimation of the object’s position [44]. a primitive task on distributed smart cameras.
Despite the advantages of camera geometry in enabling
Resource efficiency. Another direction for cross-camera col-
tracking from multi-camera views, there are several lim-
laboration is to reduce the computational and communi-
itations in its deployment. One major challenge is the
cation costs of multiple video streams by exploiting their
sensitivity to camera calibration errors, which can lead to
spatial and temporal correlation [2], [3], [19]. Jain et al.
inaccurate 3D reconstruction and subsequently impact the
proposed Spatula [2], [19], a cross-camera collaboration
tracking performance [27]. The calibration process requires
system that targets wide-area camera networks with non-
the precise estimation of intrinsic camera parameters, such
overlapping cameras and limits the amount of video data
as focal length and lens distortion, and extrinsic parameters,
and corresponding communication to be analysed by iden-
like camera pose and orientation, which can be difficult to
tifying only those cameras and frames that are likely to
obtain in practical applications [42]. When using cameras
contain the target objects. REV [1] also aims at reduc-
with pan-tilt-zoom (PTZ) capabilities, the camera geometry
ing the number of cameras processed by incrementally
needs to be recalculated each time the camera view changes,
searching the cameras within the overlapping group and
adding to the complexity and computational load of the
opportunistically skipping processing the rest as soon as
tracking process. Similarly, the process of calibration should
the target has been detected. CrossRoI [3] and Polly [48]
be also repeated each time there are changes in the camera
leverages spatial correlation to extract the minimum-sized
set and topology, such as the addition of a new camera,
RoI from overlapping cameras and reduces processing and
failure of an existing camera, or the change of a camera’s
transmission costs by filtering out unmasked RoIs. All such
position in a retail store. While Argus also needs to adapt
works share the same high-level goal as Argus in that they
to these dynamics, it can be done more easily simply by
leverage spatio-temporal correlation from multiple cameras,
adjusting or regenerating the spatio/temporal association.
but Argus differs in several aspects, as shown in Table 1.
Moreover, in multi-view tracking using camera geometry,
occlusions and ambiguities in object appearances can pose Distributed processing. There have been several attempts
significant challenges in identifying corresponding points to distribute video analytics workloads from large-scale
across different views, leading to erroneous tracking [43]. video data to distributed cameras [39], [49]. VideoEdge [49]
16
optimises the trade-off between resources and accuracy by EC2 server4 ; it will be much higher when the network costs
partitioning the analytics pipeline into hierarchical clusters for 5.4 PB of video data are added. The bigger problem is
(cameras, private clusters, public clouds). Distream [39] that, even with excessive cameras, most of the video stream
adaptively balances workloads across smart cameras. Al- is never used. The study [74] further showed that less than
though this work provided a foundation for the devel- 0.005% of the video data is retrieved and used by less than
opment of distributed video analytics systems, it mainly 2% of the cameras.
focused on the video analytics pipeline with one camera To address this problem, on-camera AI processing is
as a main workload. In this work, we identify that multi- becoming increasingly popular [39], [74], [75], enabled by
camera, multi-target tracking is a primary underlying task two recent technology trends. First, low-cost, low-power
for overlapping camera environments, and propose an on- and programmable on-board AI accelerators are becoming
camera distributed processing strategy tailed to it. available [5], such as Nvidia Jetson, Google TPU and Ana-
log MAX78000. Second, lightweight, accurate and robust
embedded-ML models are emerging [12], [13]. Most im-
7.2 Resource-Efficient Video Analytics Systems
portantly, on-device processing is preferred as the privacy-
On-device processing. Many video analytics systems have sensitive raw image data does not need to be transferred to
been proposed to efficiently process a large volume of the cloud.
video data on low-end cameras, e.g., by adopting on-camera
frame filtering [50], [51], [52], pipeline adaptation [53], [54], Incorporation with non-overlapping camera collaboration.
edge-cloud collaborative inference [38], [55], [56], [57], [58] Argus currently targets cross-camera collaboration within a
RoI extraction [59], [60], [61], [62], [63]. On-camera frame closed set of cameras. However, analytics applications might
filtering techniques filter out the computationally intensive want to track objects in a wide area where large camera
execution of vision models in the early stages, e.g., by networks, including non-overlapping cameras, are installed,
dynamically adapting filtering decisions [51] and leveraging e.g., suspect monitoring in a large shopping mall or traffic
cheap CNN classifiers [50]. Yi et al. presented EagleEye [54], surveillance in an urban city. Our ultimate goal is to develop
a pipeline adaptation system for person identification that a system that supports seamless and efficient tracking across
selectively uses different face detection models depending overlapping and non-overlapping cameras by adopting the
on the quality of face images. MARLIN [53] has been solutions for non-overlapping cameras [2], [19].
proposed to selectively perform a deep neural network for Support for diverse coordination topologies. For cross-
energy-efficient object tracking. camera coordination, we assume a star topology where the
Computation offloading. Several attempts have been made most powerful camera becomes the head in a group, schedul-
to dynamically adjust video bitrate to optimise the net- ing multi-camera multi-target tracking operations for all
work bandwidth consumption to enable low-latency of- cameras and the other cameras become group members that
floading [62], [64], [65], optimise the video streaming proto- follow the head’s decision. We believe that our decision
col [63], and design DNN-aware video compression meth- is practical because the coordination overhead is marginal
ods [66], [67]. The other direction for efficient processing is as shown in §6.9, but sophisticated coordination would be
DNN inference scheduling from multiple video streams on necessary if more cameras are involved.
the GPU cluster [17], [18], [68], DNN merging for memory
Further optimization by splitting AI models Argus treats
optimisation [69], privacy-aware video analytics [70], [71],
AI models as a black box, thereby taking their full execution
and resource-efficient continual learning [72], [73].
as a primitive task for distributed processing. Splitting deep
While these works manage to achieve remarkable per-
neural networks into distributed cameras, e.g., Neurosur-
formance improvement, their attempts usually focus on a
geon [57], would allow further optimisation if we can have
single camera (or its server). In contrast to these works, we
access to the weights of the pre-trained models. We leave it
target an environment where multiple cameras are installed
as future work.
in close proximity, and focus on optimising cross-camera
operations by leveraging the spatio/temporal association of Cross-camera communication channel. For the commu-
objects. Argus can further improve system-wide resource nication channel between cameras, we consider a Gigabit
efficiency by applying these techniques. wired connection, which is already commonly used for
existing CCTV networks, e.g., at an intersection [4] and
a campus [20]. Considering that overlapping cameras are
8 D ISCUSSION AND F UTURE W ORKS deployed in proximity to each other, such an assumption
Why on-device processing on distributed smart cameras? would be still valid even in other environments. However,
The cost of video analytics is becoming a huge problem due when the communication channel is constrained, e.g., over
to the enormous amount of video data. The authors [74] cellular networks, the network overhead may dominate
studied the six-month deployment of over 1000 cameras at and the latency improvement achieved by multi-camera
Peking University, China, and reported that the cameras parallel processing could be less than expected. We leave
produced over 3 million hours of videos (5.4 PB). If we the detailed analysis as future work.
assume a simple application using the ResNet-18 model
at 30 frames per second continuously, the estimated ML
4. 100K inferences of ResNet18 costs 0.82 USD and the total number
operating expenses (OpEx) for six months would be 3.83 of inferences is 466,560,000,000 (= 30 fps × 60 sec × 60 min × 24 hrs ×
million USD if ML inference is executed on the Amazon 180 days × 1,000 cameras).
17
[38] S. Laskaridis, S. I. Venieris, M. Almeida, I. Leontiadis, and N. D. cloud and mobile edge,” ACM SIGARCH Computer Architecture
Lane, “Spinn: synergistic progressive inference of neural networks News, vol. 45, no. 1, pp. 615–629, 2017.
over device and cloud,” in Proceedings of the 26th annual interna- [58] J. Yi, S. Kim, J. Kim, and S. Choi, “Supremo: Cloud-assisted low-
tional conference on mobile computing and networking, 2020, pp. 1–15. latency super-resolution in mobile devices,” IEEE Transactions on
[39] X. Zeng, B. Fang, H. Shen, and M. Zhang, “Distream: scaling live Mobile Computing, 2020.
video analytics with workload-adaptive distributed edge intelli- [59] W. Zhang, Z. He, L. Liu, Z. Jia, Y. Liu, M. Gruteser, D. Raychaud-
gence,” in Proceedings of the 18th Conference on Embedded Networked huri, and Y. Zhang, “Elf: accelerate high-resolution mobile deep
Sensor Systems, 2020, pp. 409–421. vision with content-aware parallel offloading,” in Proceedings of
[40] J. S. Jeong, J. Lee, D. Kim, C. Jeon, C. Jeong, Y. Lee, and B.-G. the 27th Annual International Conference on Mobile Computing and
Chun, “Band: coordinated multi-dnn inference on heterogeneous Networking, 2021, pp. 201–214.
mobile processors,” in Proceedings of the 20th Annual International [60] S. Jiang, Z. Lin, Y. Li, Y. Shu, and Y. Liu, “Flexible high-resolution
Conference on Mobile Systems, Applications and Services, 2022, pp. object detection on edge devices with tunable latency,” in Proceed-
235–247. ings of the 27th Annual International Conference on Mobile Computing
[41] K. Bernardin, A. Elbs, and R. Stiefelhagen, “Multiple object track- and Networking, 2021, pp. 559–572.
ing performance metrics and evaluation in a smart room environ- [61] K. Yang, J. Yi, K. Lee, and Y. Lee, “Flexpatch: Fast and ac-
ment,” in Sixth IEEE International Workshop on Visual Surveillance, curate object detection for on-device high-resolution live video
in conjunction with ECCV, vol. 90, no. 91. Citeseer, 2006. analytics,” in IEEE INFOCOM 2022-IEEE Conference on Computer
[42] R. Szeliski, Computer vision: algorithms and applications. Springer Communications. IEEE, 2022, pp. 1898–1907.
Nature, 2022. [62] L. Liu, H. Li, and M. Gruteser, “Edge assisted real-time object
[43] K. Kanatani, Geometric computation for machine vision. Oxford detection for mobile augmented reality,” in The 25th Annual Inter-
University Press, Inc., 1993. national Conference on Mobile Computing and Networking, 2019, pp.
[44] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon, 1–16.
“Bundle adjustment—a modern synthesis,” in Vision Algorithms: [63] K. Du, A. Pervaiz, X. Yuan, A. Chowdhery, Q. Zhang, H. Hoff-
Theory and Practice: International Workshop on Vision Algorithms mann, and J. Jiang, “Server-driven video streaming for deep learn-
Corfu, Greece, September 21–22, 1999 Proceedings. Springer, 2000, ing inference,” in Proceedings of the Annual conference of the ACM
pp. 298–372. Special Interest Group on Data Communication on the applications,
[45] X. Liu, P. Ghosh, O. Ulutan, B. Manjunath, K. Chan, and R. Govin- technologies, architectures, and protocols for computer communication,
dan, “Caesar: cross-camera complex activity recognition,” in Pro- 2020, pp. 557–570.
ceedings of the 17th Conference on Embedded Networked Sensor Sys- [64] B. Zhang, X. Jin, S. Ratnasamy, J. Wawrzynek, and E. A. Lee, “Aw-
tems, 2019, pp. 232–244. stream: Adaptive wide-area streaming analytics,” in Proceedings
[46] J. Li, J. Xu, F. Zhong, X. Kong, Y. Qiao, and Y. Wang, “Pose- of the 2018 Conference of the ACM Special Interest Group on Data
assisted multi-camera collaboration for active object tracking,” in Communication, 2018, pp. 236–252.
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, [65] R. Lu, C. Hu, D. Wang, and J. Zhang, “Gemini: a real-time video
no. 01, 2020, pp. 759–766. analytics system with dual computing resource control,” in 2022
[47] S. Jha, Y. Li, S. Noghabi, V. Ranganathan, P. Kumar, A. Nelson, IEEE/ACM 7th Symposium on Edge Computing (SEC). IEEE, 2022,
M. Toelle, S. Sinha, R. Chandra, and A. Badam, “Visage: Enabling pp. 162–174.
timely analytics for drone imagery,” in Proceedings of the 27th An- [66] X. Xie and K.-H. Kim, “Source compression with bounded dnn
nual International Conference on Mobile Computing and Networking, perception loss for iot edge computer vision,” in The 25th Annual
2021, pp. 789–803. International Conference on Mobile Computing and Networking, 2019,
pp. 1–16.
[48] J. Li, L. Liu, H. Xu, S. Wu, and C. J. Xue, “Cross-camera inference
[67] K. Du, Q. Zhang, A. Arapin, H. Wang, Z. Xia, and J. Jiang, “Ac-
on the constrained edge,” in Proc. IEEE INFOCOM, 2023.
cmpeg: Optimizing video encoding for accurate video analytics,”
[49] C.-C. Hung, G. Ananthanarayanan, P. Bodik, L. Golubchik, M. Yu,
Proceedings of Machine Learning and Systems, vol. 4, pp. 450–466,
P. Bahl, and M. Philipose, “Videoedge: Processing camera streams
2022.
using hierarchical clusters,” in 2018 IEEE/ACM Symposium on Edge
[68] J. Jiang, G. Ananthanarayanan, P. Bodik, S. Sen, and I. Stoica,
Computing (SEC). IEEE, 2018, pp. 115–131.
“Chameleon: scalable adaptation of video analytics,” in Proceed-
[50] K. Hsieh, G. Ananthanarayanan, P. Bodik, S. Venkataraman, ings of the 2018 Conference of the ACM Special Interest Group on Data
P. Bahl, M. Philipose, P. B. Gibbons, and O. Mutlu, “Focus: Query- Communication, 2018, pp. 253–266.
ing large video datasets with low latency and low cost,” in 13th [69] A. Padmanabhan, N. Agarwal, A. Iyer, G. Ananthanarayanan,
USENIX Symposium on Operating Systems Design and Implementa- Y. Shu, N. Karianakis, G. H. Xu, and R. Netravali, “Gemel: Model
tion (OSDI 18), 2018, pp. 269–286. merging for memory-efficient, real-time video analytics at the
[51] Y. Li, A. Padmanabhan, P. Zhao, Y. Wang, G. H. Xu, and R. Ne- edge,” in USENIX NSDI, April 2023.
travali, “Reducto: On-camera filtering for resource-efficient real- [70] F. Cangialosi, N. Agarwal, V. Arun, S. Narayana, A. Sarwate, and
time video analytics,” in Proceedings of the Annual conference of R. Netravali, “Privid: Practical,{Privacy-Preserving} video ana-
the ACM Special Interest Group on Data Communication on the lytics queries,” in 19th USENIX Symposium on Networked Systems
applications, technologies, architectures, and protocols for computer Design and Implementation (NSDI 22), 2022, pp. 209–228.
communication, 2020, pp. 359–376. [71] R. Lu, S. Shi, D. Wang, C. Hu, and B. Zhang, “Preva: Protecting
[52] M. Xu, T. Xu, Y. Liu, and F. X. Lin, “Video analytics with zero- inference privacy through policy-based video-frame transforma-
streaming cameras,” in 2021 USENIX Annual Technical Conference tion,” in 2022 IEEE/ACM 7th Symposium on Edge Computing (SEC).
(USENIX ATC 21), 2021, pp. 459–472. IEEE, 2022, pp. 175–188.
[53] K. Apicharttrisorn, X. Ran, J. Chen, S. V. Krishnamurthy, and A. K. [72] R. Bhardwaj, Z. Xia, G. Ananthanarayanan, J. Jiang, Y. Shu,
Roy-Chowdhury, “Frugal following: Power thrifty object detection N. Karianakis, K. Hsieh, P. Bahl, and I. Stoica, “Ekya: Continuous
and tracking for mobile augmented reality,” in Proceedings of the learning of video analytics models on edge compute servers,” in
17th Conference on Embedded Networked Sensor Systems, 2019, pp. 19th USENIX Symposium on Networked Systems Design and Imple-
96–109. mentation (NSDI 22), 2022, pp. 119–135.
[54] J. Yi, S. Choi, and Y. Lee, “Eagleeye: Wearable camera-based [73] K. Mehrdad, G. Ananthanarayanan, K. Hsieh, J. J. , R. N. , Y. Shu,
person identification in crowded urban spaces,” in Proceedings of M. Alizadeh, and V. Bahl, “Recl: Responsive resource-efficient
the 26th Annual International Conference on Mobile Computing and continuous learning for video analytics,” in USENIX NSDI, April
Networking, 2020, pp. 1–14. 2023.
[55] Y. Wang, W. Wang, D. Liu, X. Jin, J. Jiang, and K. Chen, “Enabling [74] M. Xu, T. Xu, Y. Liu, and F. X. Lin, “Video analytics with zero-
edge-cloud video analytics for robotics applications,” IEEE Trans- streaming cameras,” in 2021 USENIX Annual Technical Conference
actions on Cloud Computing, 2022. (USENIX ATC 21), 2021, pp. 459–472.
[56] M. Almeida, S. Laskaridis, S. I. Venieris, I. Leontiadis, and N. D. [75] M. Xu, Y. Liu, and X. Liu, “A case for camera-as-a-service,” IEEE
Lane, “Dyno: Dynamic onloading of deep neural networks from Pervasive Computing, vol. 20, no. 2, pp. 9–17, 2021.
cloud to device,” ACM Transactions on Embedded Computing Sys-
tems, vol. 21, no. 6, pp. 1–24, 2022.
[57] Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, and
L. Tang, “Neurosurgeon: Collaborative intelligence between the