1. Introduction
Although a massive shift from on-premise software to cloud computing has been witnessed, as a decentralised compromise, edge computing has enhanced cloud processing for time-sensitive applications. Smart city, as a typical large scale system, is gradually adopting IoT and edge computing, and object tracking, among the most important applications in a smart city, have actually used IoT and edge computing for communication and computation infrastructure. Here is an example of object tracking: In order for self-driving cars to become a reality, those cars need to react to external factors in real-time. The moving objects such as cars and pedestrians always bring considerable uncertainty to the self-driving system. If a self-driving car can track the trajectories of moving objects on the road, it will perceive both normal and sudden movements of its surroundings. Once abrupt lane changing is detected, the car must slow down immediately. If a self-driving car is traveling down a road and a pedestrian walks out in front of the car, the car must stop immediately. IoT and edge computing can definitely benefit this scenario for:
Low Latency. The computation is implemented directly on the IoT edge devices. There is no network latency of pushing and fetching data. Thus, the system will respond immediately once acquiring sensor information.
Using less network bandwidth. For example, Netflix uses local caches distributed on many cities around the globe to reduce the overall network transfer cost and optimise video streaming for much better overall experience.
Privacy. In some occasions, data should stay only close to the users, sending minimal amount of data to cloud for improving user experience, like enabling the user to control remotely.
Low Cost. Reducing unnecessary centralised computation and communication resource imply low running cost obviously. For example, Netflix uses local caches distributed on many cities around the globe to reduce the overall network transfer cost.
Besides the above example, in a smart parking lot facility, the tracking system can fetch the trajectory of every vehicles and use the information to guide the guests or detect abnormal driving behaviour. Unmanned Aerial Vehicles (UAVs) can follow its owners automatically and cameras can record one’s movement. However, in IoT and edge computing context, high requirement to computing power and memory of traditional computer vision techniques severely prevent massive object tracking. In this paper, we focus on lightweight object tracking that requires less computation, less memory, and less data transmission. Lightweight object tracking needs an elegant algorithm and implementation.
There are two major problems for the existing tracking algorithms. First, it is the tracking accuracy and robustness. Visual object tracking is a challenging task, especially when faced with difficult tracking conditions e.g., occlusion, object deformation and background cluttering. The ability to handle these difficulties directly influences the tracking accuracy and robustness of a tracking algorithm. Recently, the adoption of discriminative learning method in visual tracking field has greatly contributed to the promotion of tracking accuracy and speed. These methods usually train a classifier online in the first frame and then search among candidate patches using the trained classifier in the following frames. In 2010, Bolme et al. [
1] adopted the property of circulate matrix for training a Ridge Regression model as the classifier. Thus, the correlation response can be calculated efficiently in the frequency domain and the proposed tracker, as first discriminative correlation filter (DCF) method, enjoys low computation complexity. However, no feature extraction is implemented in [
1], thus, the tracking ability is limited by the feature discrimination. Then, Henriques et al. [
2,
3] introduced Histogram of Gradient (HOG) [
4] features into DCF methods and improved its accuracy and robustness. More inspiring improvement to DCF trackers came from works of Danelijan et al. [
5,
6] and Galoogahi et al. [
7,
8], that aimed at fading the boundary effects inherited in the circuit matrix assumption of DCF trackers. In 2017, Galoogahi et al. [
8] proposed to solve the zero-aliasing MOSSE filter by ADMM method [
9]. The tracker, called BACF, achieves relative high performance on benchmarks and runs quite fast. However, BACF still can not overcome the difficult tracking conditions such as occlusion and object deformation. In our approach, we take BACF as the baseline tracker and design concise frameworks to detect tracking status, and thus, enhance the tracking accuracy efficiently.
Next, we illustrate another issue that object tracking systems for edge computing context are faced with. In order to increase the tracking accuracy and robustness, an obvious solution is to adopt more discriminative image features such as deep-learning features. Deep convolutional neural networks(CNNs) have proven to be super effective on several challenging vision tasks [
10,
11,
12], e.g., object detection and face recognition. Compared to the hand-crafted features such as HOG and Color Naming [
13], CNN features significantly improve the tracking ability against the target deformation, occlusion and out-of-plane rotation. However, extracting CNN features from deep convolutional network requires expensive computation and CNN features always have high dimensions. These properties ruin the computational efficiency of DCF method. The method such as [
14] runs only 5 fps on a quad-core CPU at about 4 GHz. In 2016, Bertinetto et al. [
15,
16] firstly employed a Siamese network to combine the feature extraction and classification process jointly, called Siamese-fc. As an end-to-end CNN architecture, it achieves relative high performance owning to the one-shot architecture and Graphics Processing Unit (GPU) acceleration. Although Siamese-fc runs around 86 fps on a GPU device, it requires dense parallel computation thus, conducts high power consumption. Besides, CNN methods require huge memory space, usually above 10 Megabytes, to maintain its network model. Hence, in terms of power consumption and memory usage CNN-based tracking algorithms are not suitable for IoT edge applications.
1.1. Motivations
Smart city applications tend to adopt IoT and edge computing broadly for distributed data collection, communication and processing. However, front-end sensors and devices are usually not so capable as those computing units in huge data centres, and for this sake, in practice, engineers choose to compromise for limited capacity of embedded computing and limited memory. Visual object tracking as an important element of smart city is a promising application area. The crucial point for designing tracking systems in IoT edge is to keep a balance between tracking accuracy and computation complexity. After carefully review the recent development of object tracking, we notify that although the implementation of DCF-based trackers is quite efficient, tracking accuracy of the existing methods is not good enough for widespread usage. While, CNN-based methods, though more accuracy and robustness, require high computing complexity and large memory space. Hence, we propose a tracking algorithm with region proposals and tracking status detection schemes. Thanks to the proposed frameworks, our method can overcome difficult tracking conditions such as occlusion, rotation and deformation and, thus, dramatically improves the tracking accuracy. Meanwhile, it is suitable for IoT devices with limited computing capacity and memory space.
1.2. Contributions
This paper offers the following contributions:
- (1)
We demonstrate a visual object tracking system for smart city applications using IoT and edge computation with limited computing capacity and memory space. It can fit most low-power consumption IoT devices, e.g., Xilinx SoC platforms and ARM devices, hence provide flexible and energy conservation advantages.
- (2)
We introduce a novel region proposal correlation filter algorithm for lightweight computation tracking. Owing to the concise region proposal scheme and feature combination strategy, the algorithm runs around 40 fps on edge devices and requires only 157 KB on-chip memory. Besides, Response Confidence Level (RCL) is proposed to detect tracking status and update tracking model.
- (3)
Our embedded tracking system with proposed algorithm, surpasses existing systems for IoT in tracking accuracy and robustness. It can precisely track the targets even when difficult tracking conditions, e.g., occlusion, deformation or background cluttering, occur.
The remainder of our paper is organized as follows: In
Section 2, we briefly summarize the existing tracking systems and their drawbacks. Then, the theory of discriminative correlation filter, an efficient tracking paradigm, is discussed. Next, we introduce related tracking methods based on region proposals. In
Section 3, we firstly demonstrate our object tracking architecture using IoT and edge computing. Second, the proposed tracking algorithm is illustrated in detail. After explaining the solution to boundary effect of DCF-based tracker by zero-aliasing filter, we demonstrate our innovative approach of fetching the region proposals from the correlation response. Then, a robust criterion for judging the tracking status is introduced, based on the response’s peak value, PSR and number of proposals.
Section 4 is about the experiment and result. The implementation details of our tracking system in IoT context are firstly put forward. Next, we focus on the system tracking performance. We evaluate our algorithm with other related trackers on two benchmarks. The result shows the computation efficiency of our proposed frameworks on DCF-based tracker. Besides, our tracker requires much less memory space while achieves comparable tracking accuracy and robustness. The last section is a summary of the paper and future prospects for object tracking system using IoT and edge computing are forecasted.
3. Tracking System in IoT and Edge Computing Context
An overview of our system design is shown in
Figure 1:
The idea is to use some front-end cameras that provide us with an interface to fetch videos, a local server that connects to those local cameras and processes their images using machine learning and a computer vision algorithm, then sends the processed data or labels to the cloud, so we can monitor and treat the cameras as sensors by knowing the content of the videos. We can use this to track people or vehicles on the street, receive an alert if something alarming happens in a public space, or even train a custom model and allow authorized personnel to walk around a place. There are many application cases for using the object tracking system.
We are going to use some pan-tilt-zoom (PTZ) camera platforms as our front-end cameras. Use a Zynq-7000 development board as the local server (we also try on our laptops), that will run a pre-trained support vector machine (SVM) model for classifying and detecting objects on images. Afterwards, the tracking algorithm starts to track the detected object and analyzes the trajectory of the target. The trajectory is then fed back to the camera platform and used to control the servo system to follow the tracking target. Thus, our system can keep an eye on the target by the tracking system. The classified data is sent to the cloud securely using a Cloud IoT Core and also leveraging the new gateway feature, so the local server can act on behalf of the camera modules for sending data to the cloud. Then the data is processed in an event-based way using Firebase Cloud Functions, that basically stores data on Firebase to make the data available to internet-connected users on a web interface deployed on Firebase Hosting. Also, the local server will provide a local web interface to monitor the cameras. The project architecture contains five parts as shown in
Figure 2.
: Responsible for searching the cameras using mDNS, maintaining a list of devices on the local network and emitting events when devices are online or offline.
: Locate an object and tracks objects using proposed tracking algorithm.
: Handles all the communication and authentication with Cloud IoT Core and also acts as a Cloud IoT Core Gateway.
: Provides a web server that serves a web UI and also a real-time engine to sync data with the browser using socket.io.
: Uses all of the above classes, fetching images from the active devices, passing those images through the tracker, then sending data to Cloud IoT Core and providing the data to the local web interface.
In order to keep the target in view, the tracking algorithm plays an important role in the overall system. In this paper, we focus on the tracking algorithm fitting for IoT and edge computing context. Our algorithm is based on the BACF tracker, which aims at alleviating the boundary effect of DCF trackers. In
Section 3.1, we firstly explain the principle of BACF. Next,
Section 3.2 clarifies our region proposal framework based on DCF trackers. Then in
Section 3.3 innovative criterion to detect the tracking status is proposed based on HOG and color histogram response. At last, the strategy employed to combine multi-cues in our implementation is put forward in
Section 3.4.
3.1. Baseline Tracker: BACF
In order to transfer the objective function of Equation (
1) to the frequency domain, an assumption of periodic extension of image patch is required to ensure the applicability of the DFT. Due to the periodic assumption, a shifted image patch will introduce repetitive content of center patch other than background samples at the image boarder. It brings frequency aliasing in the frequency domain, while in the spatial domain, the background samples are covered, which makes the filter lack discriminative power. In the BACF method, zero-aliasing correlation filter is employed to handle the problem.
Background-aware correlation filter(BACF) is to minimize the following objective function:
where
is a
binary matrix which holds the mid
T elements of feature channel
. A
indicates the transport operator.
and
, N is the length of sample
. Different from Equation (
1), the filter
is of length
T, where
.
Similar to the traditional DCF tracker, Equation (
8) can be learned in the frequency domain for computational efficiency. However, since the zero-aliasing constraint of
is in the spatial domain we keep the regularization in spatial domain. Then, the frequency domain expression of Equation (
8) is as follows:
where
is an auxiliary variable and
is an orthonormal
DFT matrix, such that the DFT of signal
can be expressed as
.
To solve Equation (
9), Augmented Lagrangian Method(ALM) is employed:
where
is a
Lagrangian vector in the fourier domain and another regularization term to penalty the zero-padding filter
is introduced, which
is the penalty factor. Equation (
11) can be solved iteratively by Alternating Direction Method of Multipliers (ADMMs) [
9]. For simplify the process, we just give the solutions to two subproblems
and
. For more details, please refer to [
7,
8].
3.1.1. Subproblem
where
. It is easily obtained by applying DFT after masking
with
.
and
are the cross-spectral and auto-spectral of sample
with desired response
respectively.
3.1.2. Subproblem
where
and
. They can be efficiently calculated by applying inverse DFT of
and
and then cropped them by matrix
.
3.1.3. Lagrangian Multiplier Update
where
and
are the current solutions to the above subproblems at iterations
within iterative ADMM. The penalty factor
is gradually increase within iterative ADMM and
is a increment parameter.
The final discriminative filter
is then solved iteratively using Equations (
12)–(
15). Similar to traditional DCF solution, the predicted target location is then obtained by multiplying filter
with the feature patch
in frequency domain.
3.2. Region Proposal Method
In this section, we will discuss our region proposal approach based on correlation filters. Unlike the general region proposal methods, such as Selective Search, EdgeBox and BING, that aim at locating possible areas of all class objects, our approach is designed to find the candidate positions that contain similar objects with the tracking target. Specifically, the proposed method treats the correlation filter as a prior knowledge and utilizes the correlation response to fetch similar objects among the searching area. Experimental results show that a high recall ratio can be achieved with only 20 proposals generated, shows the efficiency of our scheme. Then, feature combination strategy can be implied at the proposed candidates to enhance tracking accuracy. For a single-target tracking task, the initial object in the first frame is known in advance. Thus, a more specific region proposal method can be proposed based on the known target model. Unlike general region proposal methods, such as Selective search, EdgeBox and BING, region proposal methods for tracking enjoy the advantage of utilizing the tracking model as a prior. Therefore, the proposed region proposal approach for tracking is computationally efficient embedding into DCF framework meanwhile it achieves high precision rate with less than 20 number.
As we review the implementation of DCF trackers, a discriminative filter
keeps updating during the tracking process to distinguish the target from background surroundings. Then, the response map
in Equation (
7) is calculated as the cross-correlation of the filter
with searching patch
. Generally, DCF trackers set only the location with peak response value as the prediction target. However, we notice through experiments that although the regression objective is set as a Gaussian function, the actual response map usually contains multiple peak values and the target can be located at suboptimal peaks, especially when targets are faced with object deformation, background clustering or fast motion. There are two possible reasons. First, the model drifts always happen during the tracking process and as it accumulates the discriminative power of the correlation filter drops. Second, hand-crafted features are not robust against all kinds of tracking scenarios. For example, the hog feature is fragile against target rotation and deformation; thus, distracters may have comparable responses as deformed targets. According to the observations, the region proposal method based on correlation filters are presented. Generally, in DCF trackers, only the maximum value location of the response map is used as the prediction of the target. However, during experiments, we find that sometimes the target does not locate at the maximum response position especially when faced with difficult tracking status, e.g., occlusion, fast motion and deformation. There are two possible reasons for the phenomena. First, in order to limit the frequent aliasing, a hanning window is applied on the sample
, which suppresses the response at the boarder. Second, accumulating of model drift may decrease the discriminative power of the filter and result in a suboptimal response. Further investigation about the position of the target under such circumstances notices that though it is not the maximum value location, it always be a local maximum of the surrounding areas. Based on that, we present our region proposal approach for DCF trackers.
All local maxima of the response map
are firstly extracted, where a local maximum is defined as the location with a response value larger than its eight connected neighbors. Next, keep only the local maxima with response values that are larger than
of the global peak value. are chosen to be the candidates The selected proposals are calculated as follows:
where
is the center locations of proposals and
denotes local maximum locations.
is an indicator function of value 1 at locations with response greater than
Peak, which is the global maximum of response map.
After locating the center positions, the proposals are generated by fetching bounding-boxes of the same size at each position. Getting the center locations, we fetch the bounding-boxes using the same patch size in the DCF framework. Besides, non-maximum suppression (NMS) is then used on the proposed bounding-boxes to eliminate heavy overlapped pairs.
Figure 3 shows the object proposals generated by our region proposal method on videos from OTB2015 and VOT2016 datasets. From
Figure 3, there are two regular patterns summarized: From the given sequences, two observation of patterns are catched: (1) The number of proposals is closely related to tracking status. Number of proposals increases when the target is faced with difficult scenarios and vice versa; (2) ground-truth bounding-box always lays among the proposed candidate bounding-boxes unless tracker drift occurs.
3.3. Tracking Status Detection
As mentioned in
Section 1, object tracking systems are faced with several difficult statuses, such as occlusion, deformation and fast motion, during the tracking process. The strategies for handling the difficult situations have major effects on the performance of the tracker. The first step towards a useful strategy for complex scenarios is to correctly predict the tracking status. Guan et al. [
32] proposed an event-triggered decision model to predict the tracking status. It combines spatial and temporal loss with an event decision tree. We adopt a concise implementation in view of the region proposals from correlation filter response.
Section 3.3 elaborates our innovative criterion to judge the tracking status. The criterion is closely related to our region proposal method in
Section 3.2. When demonstrating the experiment on region proposal extraction, we notice that more candidates are proposed as the target meets difficult situations. In view of the connection between the number of proposals and tracking status, we suggest a new criterion which combines peak value, Peak-to-Sidelobe Ratio (PSR) and number of proposals to predict the tracking status. Different from the methods from [
1,
33], the ratio between the current frame and the average value of the past frames is employed to indicate the tracking status. Since the criterion is based on the response map, it is named Response Confidence Level (
). The definition is as follows:
where
K indicates the
Kth frames among the sequence.
stands for the maximum value of the response map and
, where
and
represent the mean and standard derivation of the response map, respectively.
is the average peak value from the 1th frame to the
th frame. So does
.
The tracking statuses have been predicted by
and then a new update scheme different from the original DCF tracker is introduced. The details about the filter update scheme is in
Section 4.2.
Figure 4 shows the efficacious of
on sequence
. From the figure we can notify that the
significantly drops when faced with difficult tracking conditions occur, e.g., occlusion, rotation, background cluttering and fast motion.
3.4. Jointly Detection Using Multiple Cues
Although potential bounding-boxes have been proposed in the methods in
Section 3.2, how to determine the most likelihood among proposals is still unresolved. As discussed in
Section 2.3, multiple feature cues can considerably improve the accuracy and robustness of tracking performance. Aiming at designing a tracker for real-time applications, the CNN-based feature is not considered because of the computational complexity. HOG feature, which is the statistic of edge cue, is already employed in BACF architecture. Other features like shape or color are taken into account. In [
34,
35,
36], color cue is already proven to be powerful combined with HOG feature on DCF trackers. We also find that color cue is robust against the object deformation and rotation yet weak against illumination change and background cluttering. These properties are just complementary to edge-based features like HOG. thus, the color histogram feature is employed as the other cue for the joint detection in our approach, similar to the choice in papers [
36,
37].
To distinguish target pixels
x from background surrounding pixels, a naive Bayes classifier is trained with the color histogram of searching patch
I. Given a target region
(i.e., the initial bounding-box or the predicted area of DCF tracker) and its surrounding region
in
I, the object likelihood of pixels
x is obtained by Bayes rule:
We define a symbol
to denote the
bin of Histogram
H calculated from region
. And
stands for the bin where color component of
belongs to. The conditional probability can be directly estimate by the color histogram,
and
, where
calculates the total of pixels in the area. Besides, the prior probabilities are approximate as
and
respectively. Then, by simplifying Equation (
21), we get (
22).
The object likelihood map is defined as
, where the likelihood of each pixels in map
I is obtained by Equation (
22). Then,
is embedded into the region proposal framework to jointly combine color cue and edge cue for detection the target. The most intuitive approach is to use the value of
, where
x is the center location of the proposals. However, different from response map of DCF filter, whose regression target is a Gaussian distribution, the maximum response of the likelihood map is usually not located at the center of the target. Thus, we introduce a new method to measure the color similarity of the proposals by accumulating the value of
among
, where
is the bounding-box area of the proposal
k. Furthermore, the accumulated value can be quickly calculated by integral of likelihood map
.
where
is the integral response of likelihood map
.
is the color histogram response of proposal
k, where
,
,
,
are the top-left, top-right, bottom-right and bottom-left corner coordinates of bounding-box proposal
k.
Then the combined response of proposals
k is generated by multiplex the response of HOG feature with the response of color histogram together as follows:
Figure 5 shows the architecture of the region proposal correlation filter. The potential bounding-boxes are proposed from the response of the correlation filter, generated by the baseline tracker BACF. Then, different feature cues can be extracted from the candidates which dramatically enhances the robustness of proposed tracker meanwhile reduces the computation redundancy. In this paper, we combine the edge cue of HOG response with the color cue of histogram to achieve real-time tracking performance of about 30 fps. The efficiency of the overall framework is shown in
Section 4.3. Moreover, this architecture can be flexibly expanded to combine other feature cues, including shape, texture or deep CNN features.
When a new frame comes, one loop of region proposal correlation filter algorithm is summarized in Algorithm 1.
Algorithm 1 Framework of proposed region proposal correlation filter method. |
Input: z: searching image patch, ; g: zero-aliasing filter, ; fg: color histogram of foreground object, ; bg: color histogram of background surrounding, ; Output: Pos: Predicted object position, ;- 1:
Extract the hog feature of the input image patch x and translate it to frequency domain . - 2:
Calculate the response of feature patch with zero-aliasing filter and find the original position with peak response. and . - 3:
Generate the proposals from the response map according to Equation ( 17) and get final k candidates: by non-maximum suppression. - 4:
Compute the Response Confidence Level( RCL) in terms of Equation ( 18). - 5:
ifthen - 6:
Calculate the object likelihood of each proposals by color histogram according to Equations ( 22)–( 24). - 7:
Combine the response of hog feather with the color histogram according to Equation ( 22) and choose the proposal with the highest response. . - 8:
Predict the object position by multi-cue detection when confidence level is low. ; - 9:
Keep the zero-aliasing filter unchanged. - 10:
else - 11:
Predict the object position by original response when confidence level is high. ; - 12:
Update the zero-aliasing filter according to Equations ( 12)–( 15); - 13:
Update the color histogram of the foreground and background respectively. - 14:
end if
|
4. Experiments and Results
We extensively elaborate the implementation details of our tracking system using IoT and edge computing. The overall system is demonstrate in
Section 3. The core of the system is the EdgeServer. It firstly fetches images from the active cameras and then passes those images through the tracker. After the tracking process, EdgeServer sends only compressed information to Cloud IoT Core and provides the data to the local web interface. We will discuss the implementation details of EdgeServer in
Section 4.1.
Another critical design in the system is the tracking algorithm, which keeps a balance between tracking accuracy and energy consumption. We evaluate our proposed algorithm on the widely used benchmarks and metrics. The benchmark result of our tracking algorithm compared with the related methods indicate the effectiveness of our region proposal framework on DCF trackers. At the end of this section, several state-of-art trackers are used for comparisons to show the tracking accuracy as well as the computation and memory efficiency of our algorithm.
A brief view of our tracking performance:
Table 1 shows the computation efficiency of our algorithm. With all of the trackers tested on the same platform, our tracker achieves top accuracy performance and meanwhile ranks
in running speed. Moreover,
Table 2 shows a comparison between our method and state-of-the-art CNN-based trackers. Results represent that our algorithm attains comparable tracking accuracy with a model size of only 157 KB. Low computing consumption together with small memory requirement result in energy conservation of the whole procedure.
4.1. Flexible Implementation for IoT devices
We implement and test our EdgeServer on Xilinx SoC platform Zynq-7000, with a dual-core Cortex-
ARM at 1 GHz operation and an Artix7 FPGA with 512 KB on-chip memory. FPGA acts as the Devices Listener, it fetches video collected by front-end cameras and crops only the interest patches to ARM processor. Besides, it also stores the template model and current image on the on-chip block RAM. Meanwhile, a Linux operating system is running on the ARM Cortex-
processor and it works on the remaining jobs as an EdgeServer, including object tracking, IoT core gateway and the Web interface. The gateway collects data from the cameras, sending the processed data to the cloud, and also has a local web UI that runs by default on port 3000. On the web interface, you can access to see the cameras images and tracking results. To process the data being received though Cloud IoT Core, we use Firebase Cloud Function. Although we implement our EdgeServer on a Xilinx SoC platform, as it is shown in
Figure 6, the proposed tracking system in fact can be extended to many edge platforms with ARM or X86 cores, e.g.,
, STM32 series and so on. Since the overall system requires only 157 KB memory and runs in real-time tracking performance on a dual-core CPU at 1 GHz, the system can implement flexibly.
As we have mentioned, it is especially important to understand the relationship between algorithms and their respective energy consumption to efficiently utilize the IoT device’s power resources. Based on the platform of our EdgeServer, we dedicate on designing an energy-conservation tracking algorithm on dual-core processing unit and the implementation details about our algorithm are then discussed in
Section 4.1.
Details for Tracking Algorithm
Followed by [
3], the 31 dimensional HOG feature of cell size
is employed in our BACF filter. The searching area is set to be 5 times of the target size in order to handle difficult tracking situations, e.g., fast motion and occlusion. The regularization factor,
in Equation (
9), is set as
. A scale variation process referred to [
38] is implemented, with the number of scales and size step setting to 3 and
respectively. Moreover, the ADMM optimization parameters follow the setting in [
8]. The number of iterations is set to 2 to make a balance between performance and computation complexity. The penalty factor
, is updated by Equation (
15), where
and
. Color histogram of bin size 32 is computed for target and background areas. The object histogram is particularly calculated among an area with
inner-padding of the original target size. Furthermore, a normal Gaussian distribution of
is employed to the likelihood map such that center likelihood slightly stands out.
The discriminative filter for regression as well as the color histogram of the target and background surroundings are both updated in order to catch the target deformation during tracking. However, unlike the traditional strategy, that updates the model per frames, we propose a new update scheme based on the proposed
criterion. Only the confidential tracking results are considered as reliable training samples.As mentioned in
Section 3.3,
is a combination of peak value, Peak-to-Sidelobe ratio and the number of proposals. Through experiment, we observe that when
and
drop to around
or there are more than 3 number of proposals, the tracking results are usually unreliable. It usually happens on account of difficult tracking status, e.g., deformation, motion blur or occlusion. Thus, we set the threshold of
to be
. Tracking results with larger
will be used to update the BACF filter and the color histograms. Based on the selective update scheme, the learning rates for BACF filter and color histograms are then set to
and
respectively.
4.2. Benchmarks and Metrics
Our method is firstly evaluated on the OTB2015 [
39] benchmark, which contain 100 sequences. The performance of all trackers is measured by two indicators, namely
(OP) and
(DP). OP metric measures the percentage of frames with IoU scores that are greater than a threshold, where IoU is the intersection over union(IoU) of predicted and ground truth bounding boxes. We compare the trackers using threshold of
(IoU >
). The area-under-the-curve (AUC) of plot metric is also employed to rank the trackers. While, DP metric measures the percentage of frames with the center distances of predicted and ground-truth bounding boxes less than a threshold. The percentage at threshold of 20 pixels is reported for all trackers. In the experiment, we apply the One-Pass-Evaluation(OPE) followed methodology in [
39].
VOT2016 [
40] is also applied to show the robustness of the trackers. VOT2016 dataset contains 60 challenging sequences and it provides a evaluation toolkit to measure the trackers, which will re-initial the tracker to correct position once failure occurs. The expected average overlap is proposed to rank the trackers, which combines the per-frame accuracy and failures.
4.3. Accuracy and Efficiency
We evaluate our tracker with 15 state-of-the-art trackers includes KCF [
3], Struck [
23], fDSST [
41], SRDCF [
5], MDNet [
42], CCOT [
6], Siamese-fc [
15], HCF [
14], TCNN [
43], LCT [
44], Staple-ca [
45], PTAV [
46], TOPGT [
31], ECO-HC [
47], MCCT-H [
48] and SAMF [
16] on two benchmarks OTB2015 and VOT2016.
The results on the two challenging benchmarks show the accuracy and energy efficiency of our tracker, especially on OTB2015 dataset, our approach surpasses all of the state-of-the-art trackers for real-time application with only 157 KB model parameters.
4.3.1. Comparison to Traditional Tracking Algorithms
We firstly demonstrate the effective of our region proposal framework on DCF tracker by comparing the performance of our method with baseline tracker: BACF [
8] and other related trackers: SAMF [
38], Staple [
36], SRDCFad [
49], SRDCF [
5] and LMCF [
33]. OTB2015 benchmark is used to evaluate the trackers.
Figure 7 shows the results of both DP and OP plots on OTB2015 benchmark. Compared with 6 relevant trackers, our proposed tracker achieves top performance on both DP and OP plot of evaluations, with
precision and
overlap respectively. The DP metric gains
increments and OP metric gains
opposed to baseline tracker: BACF. Meanwhile, owing to the concise architecture of combining edge and color cues on region proposals our approach runs at
fps, even faster compared to the original 35 fps implementation of BACF. The SRDCF tracker, another method to limit the boundary effect of correlation filter, achieves
precision and
overlap respectively while only runs less than 10 fps. SAMF and Staple are two related trackers that combine the edge and color cues to jointly track the target. Same feature cues are implemented in our framework, however, on account of the region proposal methodology the combination of edge of color cues becomes more robust and efficient. It can be seen from
Figure 4 that compared to SAMF tracker, our approach has
gain in AUC scores. With regards to Staple, our tracker also achieves an average of
improvement.
Table 1 is a summary of our approach and 6 other relevant trackers. Distance precision indicates the percentage at threshold of 20 pixels error, high DP score means the trackers precisely follow the ground-truth of the target. And then overlap precision measures the percentage with
score greater than
, higher OP score guarantees that the trackers predict the exact scale variation of the target. The last column, mean fps, indicates the average running speed among all sequences in the dataset. Since all of the listed algorithm is employed on the same processor, higher fps demonstrates the computation efficiency of the algorithms. Although our tracker is not the fastest algorithm, our tracking performance in DP and OP score both rank
.
4.3.2. Comparison to CNN-Based Tracking Methods
We compare our algorithm with CNN-based tracker on VOT2016 dataset. When evaluated on
dataset, all of the tracking parameters keep exactly the same as its implementation in OTB2015 dataset in order to test the generalization of our proposed tracker. We evaluate the trackers in accordance with accuracy, expected average overlap(EAO) and parameter size. The accuracy measures the average overlap ratio between the predicted bounding box and the ground-truth. And EAO combines the raw values of per-frame accuracies and failures, and then measures the expected no-reset of a tracker run on a a short-term sequence [
50]. The model parameter size measures the memory space required by an algorithm. Higher the accuracy or EAO score are, better the tracker is. On the contrary, fewer model parameters indicates less memory access and thus, the algorithm enjoys more energy conservation. The tracking results are shown in
Figure 8.
Different from OTB2015 dataset, VOT2016 challenge employs a scheme that tracker will be re-initialized after each drift and count for one failure. Thus, VOT2016 dataset mainly focus on short-term tracking. From
Table 2 we notice that all top performance trackers like C-COT, TCNN, MDNet and HCF employ deep-learning feature rather than hand-crafted one because of high distinguish ability of deep feature. Moreover, in terms of our RPCF tracker, it introduces
RCL to self-detect the possible drift of tracker. And then tracker searches potential objects among region proposals and stops update the target’s models. Even if tracking failure occurs, RPCF tracker may correct itself in the following process. The schemes have less influence in VOT2016 evaluation. Even though our tracker ranks
in EAO score, only behind two deep-feature trackers, C-COT and TCNN. However, C-COT and TCNN implement VGG-19 as backbone network to collect deep feature, thus, the model parameter size of the trackers are more than 300 Megabytes. From
Table 2 we can notice that comparing with state-of-the-art CNN-based trackers our RPCF tracker has the comparable performance with only 157 KB parameter size. Fewer parameter size illustrates less memory access when processing the algorithm, and thus, leads to low power consumption.
4.4. Tracking Performance under Difficult Conditions
Section 4.4 demonstrates the tracking performance of our algorithm under different tracking status. Besides the overall performance of the trackers among all the sequences, there are manually labeled attributes to all of the sequences, which indicate the difficult situations happen during the tracking sequence. The performance on the associated attribute sequences shows the trackers capability of handling the specific tracking situation. There are total of 11 attributes in OTB2015 dataset, namely scale-variation, in-plain rotation, out of view, background cluttering, illumination variation, motion blur, fast motion, deformation, out-of plain rotation, occlusion and low-resolution.
In
Figure 9, trackers’ performance under different tracking conditions is shown. Owing to the efficiency of proposed region proposal and multi-cue combination scheme, our tracker overwhelmingly exceeds other trackers under difficult tracking conditions, such as out-of-plane rotation, occlusion, deformation and motion blur. We can tell from
Figure 9 that especially when faced with out-of-plane rotation and deformation scenarios occur, the overlap precision gains at least
comparing to other trackers. The reason is that although the hog feature of the object changes dramatically under out-of-plane rotation and deformation, other features like color histogram and texture stay unchanged. The region proposal scheme of our approach can generate several candidates and use color histogram to re-locate the target. Together with the proposed
criterion to sense the tracking status, RPCF outperforms other trackers under difficult tracking conditions.
5. Conclusions
In the paper, an object tracking system using IoT and edge computation is introduced. The proposed system focuses on the edge network design and utilizes Google Cloud as the Cloud IoT core. Specifically, EdgeServer is implemented on a Xilinx Zynq-7000 platform with dual-core ARM and FPGA Artix-7. The edge functions include capturing the images from front-end cameras and tracking interested objects among image sequences, where the tracking algorithm plays a crucial rule. Thus, we design an algorithm to keep a balance between the tracking accuracy and utility of limited resources on edge devices. It achieves the state-of-the-art tracking accuracy, but meanwhile, it requires low computing capacity and memory space. In general, the proposed algorithm can proceed a 1080P image in less than 25 milliseconds on Zynq-7000 platform and rank in accuracy among traditional tracking methods. Besides, our method acquires comparable tracking precision compared to the CNN-based trackers of model size beyond 300 Megabytes, while our parameter size is only 157 kilobytes. It is three orders of magnitude less than CNN-based models. All in all, the proposed accurate tracking system enjoys the lightweight computation and requires small memory space therefore it has a promising prospect in IoT and edge computing applications. The algorithm is flexibly deployed to many energy-efficient edge devices.
The proposed tracking algorithm plays as the core of our object tracking system. It maintains an ingenious balance between the tracking performance, system latency and energy consumption. Besides, the C++ implementation of the overall algorithm has the advantage of easy and rapid deployment. The system is easily rearranged from Google Cloud to other IoT platforms such as Microsoft Azure IoT or Amazon AWS Greengrass software merely by designing new interface programs. In the smart city applications, plenty of front-end cameras have been deployed for surveillance, Human–Computer Interaction and sports broadcasting. Our tracking system will improve the existing systems by introducing tracking trajectory of the interested characters. In the future, self-driving cars with the object tracking system will probably sense the movement of surrounding cars and respond instantly to unusual moves.
Although the proposed tracking system achieves the state-of-the-art, we argue that there is still limitation existing in our tracking system. Specifically, it lies in improving tracking robustness for long-time object tracking. An issue exists that a tracker should keep stable when occlusion and fast motion occur; on the contrary, when faced with deformation and rotation it should update rapidly to adapt the appearance changes. However, a reliable method to predict tracking status and its corresponding updating scheme still needs further study.