Paper 6-A New Framework of Moving Object Tracking
Paper 6-A New Framework of Moving Object Tracking
Abstract—Object Tracking (OT) on a Moving Camera so- objects according to their position. CNN is integrated with
called Moving Object Tracking (MOT) is extremely vital in correlation filter [14] or with particle filter [15], [16]. But
Computer Vision. While other conventional tracking methods these approaches do not take into account the challenges of
based on fixed camera can only track the objects in its range, a moving camera. J. S. Lim and W. H. Kim [17], Y. Chen et al.
moving camera can tackle this issue by following the objects. [18] tried to calculate translation vector between two
Moreover, single tracker is used widely to track object but it is consecutive frames (or two frames from stereo camera).
not effective due to the moving camera because the challenges
such as sudden movements, blurring, pose variation. The paper Based on data acquired by IMU and stereo camera, the
proposes a method inherited by tracking by detection approach. paper proposed a solution by integration of a single tracker as
It integrates a single tracker with object detection method. The Deep Particle Filter and an object detection method as
proposed tracking system can track object efficiency and YOLOv3 [19], however, the object would be tracked by its
effectively because object detection method can be used to find three-dimensional center. In traditional object tracking from
the tracked object again if the single tracker loses track. Three static camera, two-dimensional position of tracked object is
main contributions are presented in the paper as follow. First, enough but in MOT, three-dimensional position of tracked
the proposed Unified Visual based-MOT system can do the tasks object must be considered. The challenges must be taken into
such as Localization, 3D Environment Reconstruction and account as the vibration of the camera and the movement of
Tracking based on Stereo Camera and Inertial Measurement
the object. YOLOv3 is the right solution because it can detect
Unit (IMU). Second, it takes into account camera motion and the
objects very quickly and then this result can be used to support
moving objects to improve the precision rate in localization and
tracking. Third, proposed tracking system based on integration
the single tracker be more robust, and the most important
of single tracker as Deep Particle Filter and Object Detection as thing is that it is suitable for real-time applications. In
Yolov3. The overall system is tested on the dataset KITTI 2012, addition, in the localization and three-dimensional
and it has achieved a good accuracy rate in real time. environment reconstruction, the removal of moving objects is
considered to increase accuracy rate. To do that, the paper
Keywords—Moving object tracking; object detection; camera does not rely on estimating 6 degrees of freedom to find out
localization; 3D environment reconstruction; tracking by detection the robot position, but inspired from [20], the paper splits it
into two separate transformations including a rotation
I. INTRODUCTION transformation and a translation transformation. Rotation
In Object Tracking, it is necessary to predict the position transformation is calculated based on IMU and the translation
of object being tracked in the current frame and match them transformation is estimated from the stereo camera. Robot can
with the previous ones to achieve its precise position in the locate by itself based on these two transformations in real
current frame. Many significant works have dealt with environments. Data which is observed from stereo camera-
appearance changes overtime such as color histogram [1], based environments includes two kinds of object: moving
HoG feature [2], SIFT or SURF feature [3], or texture features objects and static objects. If the feature points of moving
like LBP [4].The single tracker based on the popular filters objects are used to estimate the robot position and 3D point
such as Correlation Filter, Kalman Filter or Particle Filter. cloud of environment, the estimated error will increase over
Correlation filter [5] [6] [7] is also used and it acquired high time. Therefore, the paper considers to eliminate feature points
speed and accuracy. Other filters such as Kalman Filter or of moving objects to increase accuracy rate. This is an
Particle Filter are used because they could predict the position improvement of the paper to increase the accuracy rate of
of objects and then match the predicted position with the localization and 3D environment reconstruction. Most of the
previous one. Kalman Filter [8]-[10] could not deal with non- solutions be published have not yet considered the feature
linearity in the measurements because the filter tries to points of moving objects. But in experimental results of the
linearize it using approximation method. Particle filter [11], paper, the accuracy rate with removal of moving features has
[12] are used as it could solve the drawback of Kalman filter. yielded better results than the opposite. To remove moving
Recently, deep neural networks have been applied in tracking objects, the paper uses the background subtraction method
problems. S. Chen and W. Liang [13] used a CNN to with camera motion compensation to detect moving objects
distinguish the background from objects and then track the proposed in [21], [22], the advantage is fast and accurate
35 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 4, 2020
detection of moving objects. Meanwhile in the research [23] it sensor, especially it takes into account moving objects to
is assumed that moving objects are identified as belonging to increase accuracy rate.
movable categories which are likely to move currently or in
the time coming, such as people, dog, cat, and car. For B. Moving Object Tracking
instance, once a person is detected, no matter walking or The moving camera could solve the disadvantages of fixed
standing, it is considered as a potentially moving object and camera. The fixed camera can only track objects within their
remove the features belonging to the region in the image range, if the objects come out of field of view (FOV) of
where the person was detected. The limitation of the proposed camera, it cannot monitor the objects and for realize this
method in [23] is that it is impossible to distinguish moving matter, it should be mounted on the moving framework such
objects or static objects. as robot, drone or an autonomous-driving car.
In the MOT problem, the paper tries to use stereo camera Y. Chen et al. [18] used features such as SIFT, SURF to
and IMU without GPS for the following reasons: The paper match the features between two consecutive frames to find out
would like to test the power of visual information acquired the translation vector of camera and uses it to predict the
from stereo camera in estimating the position of the robot. The position of objects in the frame. J. S. Lim and W. H. Kim [17]
IMU data will provide rotation transformation of the robot estimated motion by distinguishing 16x16 patches between the
motion. Stereo camera integrated with IMU can work better two frames. Each patch has a vector that is the main motion in
than GPS in many environments such as indoors, radio this area and after traverse all the 16x16 patches in two
interference, noisy GPS and in the cases that the input is only consecutive frames, the vector with the highest frequency is
visual information of tracked object. selected as camera motion vector. These frameworks partly
alleviate the effects of fast moving, rotation, vibration of the
In Section II, the paper reviews the previous work in visual cameras.
tracking on a fixed camera as well as moving camera.
Section III describes the proposed methods such as object There are also several ways to matching objects between
localization, 3D environment reconstruction and tracking two images. Q. Zhao et al. [1] matched objects by comparing
algorithm based on stereo camera and IMU. Section IV shows color histograms but this is easy to fail in case there are
experimental results of localization and tracking. The paper regions which have the same colors with the objects. C. Ma et
discusses about the pros and cons of the proposed methods in al. [14] applied CNN to extract features and compared objects
Section V. Conclusion and future works will be presented in by a correlation filter. R. J. Mozhdehi and H. Medeiros [15],
Section VI. T. Zhang et al. [16] inherited the previous framework and
integrated it with Particle Filter.
II. RELATED WORKS
The tracking part will inherit Particle filter to track objects
A. Camera Localization and improve the performance in its prediction and
Robot localization is crucial for many high-level tasks measurement steps. Firstly, the paper will find out a
such as object tracking, obstacle detection and avoidance, translation vector by using feature matching algorithm, and
motion planning, autonomous navigation, local path planning then the position of the tracked object will be solved by
and a waypoint follower, etc. Over the years, many applying deep neural network in conjunction with correlation
researchers have been working on the problem of robot filter.
localization and made certain contributions. David Nistér et al. Moreover, the paper inherits a deep CNN-based object
[24] proposed a system for real-time ego-motion estimation of detection algorithm named YOLOv3 [19] which is very fast
a single camera or stereo camera. Bernd Kitt et al. [25] and quite accurate to detect objects. By combining these
proposed another visual odometry algorithm based on methods, the tracking part has developed an algorithm called
RANSAC outlier rejection technique. Shaojie Shen et al. [20] Tracking by Detection.
used the feature points from stereo images and IMU
information to estimate robot position. S.Prabu and G. Hu [12] However, to track the object in the context of moving
proposed a vision based on localization algorithm which camera and moving object, the tracking part has to track
combinesthe partial depth estimation and particle filter object in 3D environment (by IMU and stereo cameras) so that
techniques. Yanqing Liu et al. [26] present a robust stereo the tracking system is realistic.
visual odometry using an improved RANSAC based method
(PASAC) that makes the procedure of motion estimation III. METHOD
much faster and more accurate than standard RANSAC. The paper proposes a Unified Visual Based-MOT system
Yuquan Xu et al. [27] propose a novel algorithm for the can do the tasks such as Camera Localization, 3D
problem of three-dimensional point cloud map based on Environment Reconstruction and Object Tracking.
localization using a stereo camera. S. Hong et al. [28]
proposed the real-time autonomous navigation system using A. Camera Localization
only a stereo camera and a low-cost GPS. All the Inspired from the method of [20], the significant
aforementioned research works have provided the improvement is proposed in feature detection stage with
fundamental background knowledge to solve the localization removal of moving feature points. In addition, there are some
problem in this paper. Here, the paper proposes a novel differences between [20] and the paper. Specifically, instead
method to localize a robot using a stereo camera and IMU using a built-in system [20] to get the camera position as
36 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 4, 2020
ground truth, the paper used the ground truth GPS of KITTI In the model, at each moment, the paper gets two images
dataset. from the stereo camera (see Fig. 2). These two images are
used for feature detection and reproduce the 3D positions of
To locate the position of the camera, the paper estimates the features in WCS. However, feature detection and 3D
the camera motion at time t, consists of the translation position reconstruction will not be performed consecutively in
transformation and the rotation transformation of camera pairs of successive images, but they are performed in a given
coordinate system between two consecutive frames based on cycle, which corresponds to 25 consecutive frames (depending
the stereo camera and IMU sensor. The IMU data provides the on the device) (see Fig. 2). This means that at the beginning
rotation matrix for rotation transformation. The feature points the feature detection is made from the two images of stereo
of the image used to estimate the translation, these features camera and reconstructed the 3D position of the features in the
include moving and static features. In this case, moving world coordinate system, and after 25 consecutive frames of
features are noise. Therefore, the paper removes the moving cycle (including frames used for feature detection), the above
feature points to reduce error rate in estimating camera calculation process will be performed again. For 25
position. It is a new point in improving the robot localization consecutive frames of cycle, the feature detection process and
process. The camera location estimation steps are shown in estimate the 3D positions are not performed, instead the
Fig. 1. features will be kept track on the successive image frames
1) Camera model, feature detection and feature tracking: until a new cycle be done. The purpose of this solution is to
Both cameras in the system are calibrated using the Camera reduce computational time but still retain the required
Calibration Toolbox. Both cameras are divided into two accuracy rate.
systems and play different roles: In feature detection stage, the image features play an
important role in locating robot positions. The SURF feature
Stereo Camera System (right and left cameras): They (Speeded Up Robust Features) [29] is extracted from pairs of
are used to estimate 3D positions of the features in the images of the left and right cameras. FLANN matching
world coordinate system (WCS), initialize the local algorithm (Fast Library for Approximate Nearest Neighbor)
map at the start and update local maps when the local [30] is used for matching the features of two images. In order
map accumulated errors large enough (see Fig. 1 and 7). to remove outliers, Lowe outlier rejection method [31] is used.
Monocular Camera System (left camera): It is used to This outlier removal supports significant improvement the
estimate robot locations, initialize and update local accuracy of localization.
maps.
37 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 4, 2020
38 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 4, 2020
Fig. 6. Illustrate the Positions of the Left Camera at Time 𝑡 and the Error
between 𝑔𝑖𝑡 and the Observation Vector 𝑘𝑖𝑡 . The Position with the Smallest
Error (Error is Total Area of Red Parallelogram) will be Robot Position at
Time 𝑡. Here, the Position 𝑟𝑡 has the Smallest Error.
39 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 4, 2020
Assume that the camera motion between two consecutive B. 3D Environment Reconstruction
images is small, formula (7) can be approximated as: In this section, 3D environment reconstruction task is
𝐫𝑡 −𝐩𝑖 2 presented (see Fig. 7). The environmental map is a local map.
𝐫𝑡∗ = argmin ∑𝑖∈ℑ ‖ × 𝐤 𝑖𝑡 ‖ (10) The local map is defined as the set of currently tracked 3D
𝐫𝑡 𝑑𝑖
features. The 3D points are calculated from two different
where 𝑑𝑖 = ‖𝐫𝑡 − 𝐩𝑖 ‖ ≈ ‖𝐫𝑡−1 − 𝐩𝑖 ‖ are known methods, one from the stereo camera, the other from the
quantities. By taking the derivative of formula (10) and setting monocular camera. These 3D points will be transferred from
it to zero, a linear system are obtained with the optimal camera the CCS to the WCS of robots at the start and is added to the
position 𝐫𝑡 is the unknown: local map. The 3D features added to the local map are static
feature points, because these will be used to estimate robot
𝕀3 −𝐤𝑖𝑡 𝐤𝑖𝑡 𝑇 𝕀3 −𝐤𝑖𝑡 𝐤𝑖𝑡 𝑇
(∑𝑖∈ℑ ) 𝐫𝑡 = ∑𝑖∈ℑ 𝐩𝑖 (11) positions at different times. For moving feature points, it will
𝑑𝑖 𝑑𝑖
cause an error when estimating the robot position.
where 𝐫𝑡 is the 3D position of the camera at time 𝑡 in WCS,
𝐤 𝑖𝑡 is the observation vector of the 𝑖 𝑡ℎ feature point at time 𝑡
in WCS, 𝐩𝑖 is the 3D position of the 𝑖 𝑡ℎ feature in WCS, ℑ
represents the set of features observed in the image at time t,
𝑑𝑖 = ‖𝐫𝑡−1 − 𝐩𝑖 ‖ is the known value, (⋅)𝑇 represents the
matrix transpose. 𝕀3 is a 3 × 3 matrix unit.
Equation (11) consisted of three equations corresponding
to three unknowns which are the 3D position of the camera in Fig. 7. Diagram of Initializing and Updating Local Maps.
WCS, these three equations will not change regardless of the
number of observed features. Therefore, the camera position At the initial time 𝑡 = 0, the robot position is initialized.
estimation can be solved efficiently in constant time. The The 3D positions of the feature points will be estimated in the
observed features used to calculate camera position are world coordinate from stereo camera. The 3D points are used
features that are not in moving regions. Suppose, if using the to initialize the local map.
features of the moving regions, the error of the estimated
camera position will increase. Since the camera position is At the time 𝑡(𝑡 ≠ 0), with given robot position, the local
estimated based on 3D feature points at time t-1 and the map is updated according to the following two systems:
corresponding feature observation vectors at time t, if you 1) Stereo camera system: After a given period of time, the
consider a feature of moving regions, the 3D position of system will be restarted to update the local map. The 3D
features in the environment will be different at the time t-1
points are calculated by stereo camera.
and t in the same WCS, and the feature observation vector at
time t will not match the truth vector of the 3D feature point at 2) Monocular camera system: During the feature tracking
time t-1 (i.e. the observation vector 𝐤 𝑖𝑡 will not match the process, some features will be lost and lost features will be
truth vector 𝐠 𝑖𝑡 ) will increase the error for camera location removed from the map. New features are added to the local
estimation. In this case, if it is a static feature, the error level map if the current number of features is smaller than the
will be 0 or very small. minimum allowable feature count. The 3D feature location 𝐩𝑖
of the new feature point is estimated based on a set of 𝜏
Equation (11) can use at least two features to calculate
camera position 𝐫𝑡 . As such, an efficient 2-point RANSAC observation of the 𝑖 𝑡ℎ feature at different camera positions and
(Random Sample Consensus) can be applied for outlier is given by:
rejection. Using this algorithm will help reduce computational 𝐩∗𝒊 = argmin ∑𝑡∈𝜏‖(𝐩𝑖 − 𝐫𝑡 ) × 𝐤 𝑖𝑡 ‖2 (12)
time compared to the traditional 3-point algorithm [35] and 5- 𝐩𝑖
point algorithm [36].
where, 𝐫𝑡 is the 3D position of the camera at time 𝑡 in
As mentioned above, equation (11) is solved by the 2-point WCS, 𝐤 𝑖𝑡 is the observation vector of the 𝑖 𝑡ℎ feature point at
RANSAC algorithm. This algorithm includes the following time 𝑡, 𝐩𝑖 is the 3D position of the 𝑖 𝑡ℎ feature in WCS.
steps: Firstly, determine the number of iterations for the
algorithm. Secondly, at each iteration, define a random sample And solve equation (12) via the following linear system:
set of two elements which are two random points from the 3D (∑𝑡∈𝜏(𝕀3 − 𝐤 𝑖𝑡 𝐤 𝑖𝑡 𝑇 ))𝐩𝑖 = ∑𝑡∈𝜏(𝕀3 − 𝐤 𝑖𝑡 𝐤 𝑖𝑡 𝑇 )𝐫𝑡 (13)
features point set. Then, the estimation results based on this
sample will be evaluated by an error function. The steps above where 𝐫𝑡 is the 3D position of the camera at time 𝑡 in WCS,
are done several times to find the best robot position. Finally, 𝐤 𝑖𝑡 is the observation vector of the 𝑖 𝑡ℎ feature point at time 𝑡
after all iterations, RANSAC will converge at a good robot in WCS, 𝐩𝑖 is the 3D position of the 𝑖 𝑡ℎ feature in WCS, 𝜏 is
position, but not sure if this is the best position. This the set of times t that the 𝑖 𝑡ℎ feature is observed on the left
RANSAC algorithm ensures fast processing time, the ability camera, (⋅)𝑇 represents the matrix transpose, 𝕀3 is a 3 × 3
to estimate a good enough model and eliminate noise in the matrix unit.
data set.
Equation (13) is solved by basic matrix algebra, 𝜏 is
defined as 2 consecutive times t-1 and t.
40 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 4, 2020
The 3D positions of the feature points are calculated from motion vector of the tracked object. An object detector will
monocular camera will be updated in the following two cases: come in handy in this step, it will detect objects that is the
same kind with the tracking object and then, some of detected
The feature points have been restored the 3D position objects will be chosen to add into the particle set. Object
from stereo camera system: Add the 3D position from detection based on deep learning can be robust to the vibration
monocular camera system to the local map. of camera.
The feature points have not had the 3D position from In the matching step, YOLOv3 will detect a number of
stereo camera system: The feature points are added to objects of the same type as the tracked object and then select a
the local map if the current number of feature points in detected object with highest matching rate with the tracked
local map is smaller than the minimum allowable object as the current tracked object. If matching is failed, each
feature points count. particle will be matched with the previous object by an
Failure Detection and Recovery observation model. The correlation filter will be integrated to
a deep neural net to match objects. After that, resampling
Because the 3D positions of the feature points in local map could be performed if it is necessary.
are calculated from two systems: feature tracking by
monocular camera, feature detection and matching by stereo The tracking pipeline of the paper is as Fig. 8:
camera, therefore, it will accumulate errors. The error of the 1) Initialization: In the first frame, an object will be
local map at time 𝑡 is calculated as:
chosen and it is expressed as four parameters which are
1 ‖𝐩𝑚
𝑘 −𝐫𝑡 ‖ location and size.
𝛾 = |𝒦| ∑𝑘∈𝒦 (14)
‖𝐩𝑠𝑘 −𝐫𝑡 ‖ 𝑝(𝑡) = [𝑝𝑥(𝑡), 𝑝𝑦(𝑡), 𝑝𝑤(𝑡), 𝑝ℎ(𝑡)] (15)
where, 𝐩𝑘𝑠 is the location of feature 𝑘 obtained by stereo
𝑝𝑥(𝑡), 𝑝𝑦(𝑡) is the position of the object at time (t).
correspondence, 𝐩𝑚 𝑘 is the location of feature 𝑘 obtained by
monocular camera, 𝒦is the set of 3D points to calculate the 𝑝𝑤(𝑡), 𝑝ℎ(𝑡) is the width and height of the object at time
errors, 𝐫𝑡 is the camera position at time 𝑡.
(t).
The system works well when 𝛾 ≅ 1 and otherwise it is
error. When the system has error, the system will remove all 𝐹(𝑡) and 𝐹(𝑡 − 1) are sets of feature (SURF) points of the
features from the monocular camera system and restart the object at time t and time t – 1 from left or right camera.
local map based on the 3D positions of feature points from 𝐹(𝑡) = (𝑓𝑡0 , 𝑓𝑡1 , … , 𝑓𝑡𝑛 ) (16)
stereo camera 𝐩𝑘𝑠 .
0
(𝑓𝑡−1 1 𝑛 )
𝐹(𝑡 − 1) = , 𝑓𝑡−1 , … , 𝑓𝑡−1 (17)
C. Tracking by Detection
The goal of the paper’s tracking algorithm is to solve the Using a matching feature algorithm, feature points which
challenges of a moving camera which are vibration of the are not matched are eliminated and the remaining K pairs of
camera and motion of tracked object. The basic essence of the feature points are treated as a set of K motion vectors.
paper’s tracking algorithm is the integration of single tracker 𝑓𝑣 (𝑡) = (𝑓𝑡0 , 𝑓𝑡1 , … , 𝑓𝑡𝐾 ) (18)
and object detection to become a method called tracking by
detection. The single tracker will give the state of the tracked With each 𝑓𝑡𝑖 , i ∈ K, is a camera’s motion vector of a pair
object at every time step, moreover, its 3D position by of feature points.
reconstructing the environment can provide the necessary
This set is used to compute camera’s motion vector of the
information for tracking process. A single tracker is integrated
object at time t.
to object detection as YOLOv3 because YOLOv3 can detect
objects very quickly and accurately, it can support the single
tracker to find the tracked object more quickly and when the
single tracker fails to track down the tracked object, YOLOv3
is a useful helper to find the object being tracked. And
Tracking by Detection with 3D Environment Reconstruction
can estimate the three-dimensional position of the tracked
object.
This section describes the workflow of the paper’s tracking
algorithm, it includes four steps: Initialization, Prediction,
Matching and Resampling.
Firstly, an object could be selected in a frame to track,
however, the system might be given an image of object and it
would be found out before tracking.
After having the bounding box of the tracked object at Fig. 8. The Tracking Pipeline.
time t, in prediction step at time (t+1), the particle filter will
generate some particles and each particle will be guided by a
41 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 4, 2020
2) Prediction: In this step, the state of the object will be 3) Matching: Each particle will be compared with the
predicted in the next frame. previous object by an observation (matching) model, after
comparing, each particle will have a weight.
The particle filter generates some particles around the
previous object and each object has a weight which is the The correlation filter is presented in [5] [6] [7] will be used
importance of a particle. in observation model. This filter is obtained at time (t-1), next,
For example, in Fig. 9, the motorbike is being tracked: it will traverse the frame at time (t), this means the filter will
convolve with each region of the image from left to right, up
The yellow box indicates the object is being tracked. The to down. This result of a value which determine the
green box indicates a particle around the object. correlation between the object at time (t-1) and the region at
time (t). The higher the value is, the higher the region will be
The particle is guided by a camera’s motion vector which the state of the tracked object at time (t).
is obtained by matching SURF features in two consecutive
frames got from the left or right camera. Stereo camera is used The correlation filter is integrated with particle filter
to get two current frames instead of using two consecutive because it does not need to traverse the whole frame, it only
frames as [38]. After matching features, needs to convolve with each particle. This will reduce
computational cost.
The predicted state of tracked object is 𝑝̂ (𝑡):
But, to use this correlation filter, it has been learned in
𝑝̂ (𝑡) = 𝑝(𝑡 − 1) + 𝑓𝑣 (𝑡) + 𝑞(𝑡) (19) frame (t-1). It is called as 𝑐𝑟𝑓 and the region of the tracked
𝑝(𝑡 − 1) is the state of the tracked object in previous object at time (t-1) is 𝑟 and M, N is width and height of this
frame (t-1). region, and 𝑟𝑚,𝑛 with 𝑚, 𝑛 ∈ {0,1, … , 𝑀 − 1} × {0,1, … , 𝑁 −
1} is the region of the tracked object at time (t-1) after
𝑓𝑣 (𝑡) is a camera’s motion vector at time t computed by
translating to the right 𝑚 pixel and to below 𝑛 pixel.
feature matching.
Each 𝑟𝑚,𝑛 is corresponding with a value called 𝑔𝑚,𝑛 =
𝑞(𝑡) is Gaussian noise added. −(𝑚−𝑀 ⁄2)2 +(𝑛−𝑁 ⁄2)2
ⅇ . This value is a Gaussian value because
After that, a number of particles will be generated around 2𝜎 2
𝑝̂ (𝑡) and each particle is called 𝑝̂𝑖 (𝑡). Gaussian value expresses how far a pixel from its center.
And, After find out 𝑐𝑟𝑓 ∗ , the weight also known as the
correlation of each particle calculated by consoling 𝑐𝑟𝑓 ∗ with
̂ (𝑡)
𝑝𝑥 ̂ (𝑡)
𝑝𝑤
̂ 𝑡𝑖 =
𝑝𝑥 + 𝑟𝑎𝑛𝑑𝑜𝑚(0,1) ∗ (22) the region of a particle.
2 2
̂ (𝑡)
𝑝𝑦 ̂ (𝑡)
𝑝ℎ In this step, a deep neural network called VGG-19 is
̂ 𝑡𝑖 =
𝑝𝑦 + 𝑟𝑎𝑛𝑑𝑜𝑚(0,1) ∗ (23) applied [14] [16]. This network is trained and has a optimal
2 2
parameters. 𝑟𝑚,𝑛 and the region of each particle in this step
̂ 𝑡𝑖 = 𝑝𝑤
𝑝𝑤 ̂ (𝑡) + 𝑟𝑎𝑛𝑑𝑜𝑚(0,1) ∗ SCALE_CONSTANT (24)
have to go through this network to obtain three feature maps
̂ 𝑡𝑖
𝑝ℎ ̂ (𝑡) + 𝑟𝑎𝑛𝑑𝑜𝑚(0,1) ∗ SCALE_CONSTANT
= 𝑝ℎ (25) called conv-3, conv-4, conv-5. And with each convolutional
map, a corresponding 𝑐𝑟𝑓 ∗ will be learned. Three weights of a
1
𝑓𝑣 (𝑡) = ∑𝐾
𝑖 𝑓𝑡
𝑖
(26) particle called 𝑐1 , 𝑐2 , 𝑐3 and then the final weight of a particle
𝐾
is:
YOLOv3 [19] is used to detect objects in this step, after
detecting, some objects will be selected by their IoU with the 𝑤𝑡𝑖 = 𝑐1 + 𝑐2 + 𝑐3 (28)
tracked object, if this value is higher than a threshold, this
Finally, the weight of each particle will be normalized.
object will be kept and added into the particle set.
𝑤𝑡𝑖
𝑤𝑡𝑖 = ∑𝑁 𝑖 (29)
𝑖=0 𝑤𝑡
42 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 4, 2020
43 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 4, 2020
location 𝐫𝑡 , GPS𝑡 is camera location from GPS at time t; C. Accuracy of the Tracking Object Position based on Object
GPS𝑡𝑥 , GPS𝑡𝑦 , and GPS𝑡𝑧 are the coordinates x, y and z of Center and IoU in 2D
camera location from GPS𝑡 , N is the number of frames. The IoU metric is used to calculate the IoU between
From the experiment results (see Table II), estimated predicted boxes with its ground truth boxes in 2D.
camera position is quite good and if the removal of moving In Section 4.C, the following datasets are used because of
object is considered, the better results can be achieved some objects exist to track, others do not have a consistent
compared to the opposite case. object to follow.
B. Accuracy of the Tracked Object Center in 3D In Table IV, Euclide distance is used to estimate error
Object center is estimated as the average of all 3D points between predicted box center and its ground truth box center.
of the object being considered. The table has shown that the IoU metric of proposed tracking
method in case of using YOLOv3 is better than the opposite case.
The following distances are used to evaluate errors (see
Table III) between tracked object center and its ground truth In Table V, Euclidean distance is used to estimate the
object center: errors between the ground truth centers and the predicted
centers. The table has shown that when combines YOLOv3,
1 2
error-center= ∑𝑁
𝑡=1 [(𝐜𝑒𝑠𝑡,𝑥 (𝑡) − 𝐜𝑔𝑡,𝑥 (𝑡)) + (𝐜𝑒𝑠𝑡,𝑦 (𝑡) − the Euclid distances of proposed tracking method in case of
𝑁
2 2 1⁄2
using YOLOv3 is smaller than the opposite case.
𝐜𝑔𝑡,𝑦 (𝑡)) + (𝐜𝑒𝑠𝑡,𝑧 (𝑡) − 𝐜𝑔𝑡,𝑧 (𝑡)) ] (33)
D. Speed of the Tracking Algorithm
where 𝐜𝑒𝑠𝑡 (𝑡) is the center of the estimated 3D bounding Number of frames are estimated per second (FPS).
box at the time t; 𝐜𝑒𝑠𝑡,𝑥 (𝑡) , 𝐜𝑒𝑠𝑡,𝑦 (𝑡) and 𝐜𝑒𝑠𝑡,𝑧 (𝑡) are the
In Table VI, the paper measured the time taken to process
coordinates x, y and z of center 𝐜𝑒𝑠𝑡 (𝑡); 𝐜𝑔𝑡 (𝑡) is the center of a frame and then invert this time to get the given FPS. The
3D bounding box from ground-truth data at time t; 𝐜𝑔𝑡,𝑥 (𝑡), table has shown that FPS of proposed tracking method in case
𝐜𝑔𝑡,𝑦 (𝑡) and 𝐜𝑔𝑡,𝑧 (𝑡) are the coordinates x, y and z of of using YOLOv3 is about more three times faster than the
center 𝐜𝑔𝑡 (𝑡). N is number of frames. opposite case.
In this Section 4.B, a different dataset will be used to TABLE IV. COMPARE THE ACCURACY OF ESTIMATED OBJECT POSITION
evaluate the accuracy of the tracked object center because it BASED ON TRACKING BY DETECTION AND TRACKING WITHOUT DETECTION
has ground truth data about the object's center. (USING METRIC IOU, OBJECT DETECTION METHODYOLOV3)
From Table III, the center error in case of using YOLO has Method Average IoU 2D with object Average IoU without object
Data detection (YOLOv3) (%) detection (YOLOv3) (%)
achieved better results than the opposite case.
0000 0.66 0.29
TABLE II. ERROR OF THE ESTIMATED CAMERA POSITION COMPARED TO 0004 0.41 0.11
GPS BETWEEN 'WITHOUT REMOVAL MOVING OBJECTS' AND 'REMOVAL
MOVING OBJECTS' 0005 0.74 0.6
0010 0.82 0.64
Distance Without removal Removal moving
Data Frames 0011 0.78 0.48
(m) moving object (m) object (m)
0020 0.65 0.6
0091 150 97.5 0.8449 0.8245
Average 0.68 0.45
0060 70 0 0.0197 0.0193
Outliers
0095 150 137.86 1.1654 1.0322
0018 0.08 0.03
0113 80 16.27 0.5542 0.5440
0106 174 83.61 0.5199 0.5105 TABLE V. COMPARE THE ACCURACY OF ESTIMATED OBJECT POSITION
BASED ON TRACKING BY DETECTION AND TRACKING WITHOUT DETECTION
0005 150 66.78 1.5397 0.6656 (USING EUCLIDE DISTANCE, OBJECT DETECTION METHOD YOLOV3)
Average 0.7740 0.5993 Method Average errors of centers with Average errors of centers
Data YOLOv3 (pixel) without YOLOv3 (pixel)
TABLE III. EVALUATING THE ERRORS BETWEEN THE TRACKED OBJECT 0000 16.14 234.3
CENTER AND ITS GROUND TRUTH CENTER IN 3D
0004 10.74 61.64
error-center with error-center without YOLOv3 0005 3.66 1.47
Data
YOLOv3 (m) (m)
0010 3.62 4.11
0005 0.5421 0.8608
0011 5.57 14.06
0010 0.5470 1.7793 0020 6.22 13.13
0011 0.4271 2.104 Average 7.66 54.79
Average 0.5054 1.5868 Outliers
0018 25.56 315.52
44 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 4, 2020
TABLE VI. EVALUTATE THE SPEED OF THE PROPOSEDTRACKING support the single tracker become more accurate because it
ALGORITHM WITH AND WITHOUT YOLO V3.
can detect more participants for particle filter, besides, it also
Method Average FPS with object Average FPS without object can detect objects very fast and accurately.
Data detection (YOLOv3) (Hz) detection (YOLOv3) (Hz)
Although the framework works well on the KITTI dataset,
0000 11 4 it is necessary to improve both localization and tracking
0004 9 3 algorithms.
0005 12 3 In the localization algorithm, the algorithm should be
0010 12 3 tested on a real robot. The localization algorithm can also pave
the way for the dynamic obstacle avoidance. And combination
0011 12 3
lidar with stereo camera is a novel way to explore.
0020 11 4
In the tracking algorithm, the most time-consuming step is
Average 11 3 the matching step of the Particle Filter. In the matching step,
Outliers Correlation Filter trained at each frame so the matching has
0018 6 3
increased the computational time. In the future, the
Correlation Filter should be replaced by a pre-trained model
V. DISCUSSION such as Siamese net that can compare the features of the target
at the consecutive times in real time.
The experimental results have shown that camera position
is estimated quite well because the moving features are ACKNOWLEDGMENT
removed when estimating camera position. However, the This research is funded by Viet Nam National University
moving features removal algorithm is still limited and should Ho Chi Minh City (VNUHCM) under grant no. B2018-18-01.
be improved in the future, though it has shown an
improvement in the accuracy of the camera position when the Thank to Director of AIOZ Pte Ltd Company Erman
moving features are removed. Though the error between the Tjiputra, CTO Quang D. Tran for the valuable support on
estimated camera location and ground truth camera location is internship cooperation.
still remaining, but it is useful in the environments such as in- REFERENCES
doors, noisy GPS and in the cases where input of tracked [1] Q. Zhao, Z. Yang and H. Tao, “Differential earth mover’s distance with
object is image. The experimental results has shown the its applications to visual tracking,” in Proceedings of the IEEE
important role of visual information in MOT. Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no.
2, pp. 274–287, 2010.
The experimental results have also shown that the [2] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
processing speed is suitable for real-time applications. The detection,” in Proceedings of the IEEE Computer Society Conference on
speed of tracking algorithm is increased when Particle filter is Computer Vision and Pattern Recognition (CVPR'05), vol. 1, pp. 886-
integrated to YOLOv3, this is because the tracking algorithm 893, 2005.
could use either of these methods to track the object. And [3] D. Ta, W. Chen, N. Gelfand and K. Pulli, “Surftrac: Efficient tracking
when it uses YOLOv3, the algorithm can run very fast. and continuous object recognition using local feature descriptors,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern
Because of this, the speed increases significantly, it achieves Recognition, pp. 2937-2944, 2009.
more three times faster than the conventional method. In [4] D. A. Ross, J. Lim, R.-S. Lin and M.-H. Yang, “Incremental learning for
comparison to accuracy, the tracking by detection method is robust visual tracking,” in International Journal of Computer Vision, vol.
also more effective than the single tracker method based on 77, no. 1-3, pp. 125–141, 2008.
metric of IoU in 2D or the tracked object center. [5] D. S. Bolme, J. R. Beveridge, B. A. Draper and Y. M. Lui, “Visual
Object Tracking using Adaptive Correlation Filters,” in Proceedings of
VI. CONCLUSIONS AND FUTURE WORK the IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, pp. 2544-2550, 2010.
In this paper, a unified system consisted of robot
[6] H. K. Galoogahi, T. Sim and S. Lucey, “Multi-Channel Correlation
localization, environment reconstruction and object tracking Filters,” in Proceedings of the IEEE International Conference on
based on stereo camera and IMU has been proposed. Computer Vision, pp. 3072-3079, 2013.
[7] J. F. Henriques, R. Caseiro, P. Martins and J. Batista, “High-Speed
In localization, the paper’s contribution is a solution to Tracking with Kernelized Correlation Filters,” in Proceedings of the
estimate camera position and tracked object position based on IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.
stereo camera and IMU with the removal of moving features. 37, no. 3, pp. 583-596, 2014.
It has shown that the accuracy rate of locatization is improved [8] X. Li, K. Wang, W. Wang and Yang Li, “A multiple object tracking
and the computational time is suitable to real-time applications. method using Kalman filter,” in Proceedings of the 2010 IEEE
International Conference on Information and Automation, pp. 1862-
In tracking, the paper’s contributions are: (1) particles 1866, 2010.
guided by a motion vectors are calculated using pairs of SURF [9] P. Kalane, “Target Tracking Using Kalman Filter,” in International
feature points, so the direction the object is pointing to is Journal of Science & Technology (IJST), vol. 2, no. 2, Article ID
gathered; (2) an observation (matching) model contains IJST/0412/03, 2012.
correlation filter and a deep neural network (VGG-19), it can [10] H. A. Patel and D. G. Thakore, “Moving Object Tracking Using Kalman
deal with changes of translation of the object; (3) Tracking by Filter”, in International Journal of Computer Science and Mobile
Computing, vol. 2, no. 4, pp. 326-332, 2013.
detection with an object detection algorithm (YOLOv3) can
45 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 4, 2020
[11] K. Nummiaro, E. Koller-Meier and L. V. Gool, “Object Tracking with [27] Y. Xu, V. John, S. Mita et al., “3D Point Cloud Map Based Vehicle
an Adaptive Color-Based Particle Filter,” in DAGM 2002: Pattern Localization Using Stereo Camera,” in 2017 IEEE Intelligent Vehicles
Recognition, Lecture Notes in Computer Science (LNCS, vol. 2449), pp. Symposium (IV), pp. 487-492, 2017.
353-360, 2002. [28] S. Hong, M. Li, M. Liao and P. v. Beek, “Real-time mobile robot
[12] S. Prabu and G. Hu, “Stereo Vision based Localization of a Robot using navigation based on stereo vision and low-cost GPS,” in Intelligent
Partial Depth Estimation and Particle Filter,” in Proceedings of the 19th Robotics and Industrial Applications using Computer Vision 2017, pp.
World Congress, The International Federation of Automatic Control, 10-15(6), 2017.
vol. 47, no. 3, pp. 7272-7277, 2014. [29] H. Bay, A. Ess, T. Tuytelaars and L. Van Gool, “Speeded-Up Robust
[13] S. Chen and W. Liang, “Visual Tracking by Combining Deep Learned Features (SURF),” in Computer Vision and Image Understanding, vol.
Image Representation with Particle Filter,” in ICIC Express Letters, Part 110, no. 3, pp. 346-359, 2008.
B: Applications, vol. 3, no. 1, pp. 1-6, 2012. [30] M. Muja and D. G. Lowe, “Fast Approximate Nearest Neighbors with
[14] C. Ma, J. B. Huang, X. Yang and M. H. Yang, “Robust Visual Tracking Automatic Algorithm Configuration,” in VISAPP International
via Hierarchical Convolutional Features,” in Proceedings of the IEEE Conference on Computer Vision Theory and Applications, vol. 1, 2009.
Transactions on Pattern Analysis and Machine Intelligence (Early [31] D. G. Lowe, “Distinctive image features from scale-invariant
Access), pp. 1-1, 2018. keypoints,” in International journal of computer vision, vol. 60, no. 2,
[15] R. J. Mozhdehi and H. Medeiros, “Deep Convolutional Particle Filter for pp. 91-110, 2004.
Visual Tracking,” in Proceedings of the IEEE International Conference [32] B. D. Lucas and T. Kanade, “An iterative image registration technique
on Image Processing (ICIP), pp. 3650-3654, 2017. with an application to stereo vision,” in Proceedings of the 7th
[16] T. Zhang, C. Xu and M. H. Yang, “Multi-task Correlation Particle Filter International Joint Conference on Artificial Intelligence (IJCAI '81), pp.
for Robust Object Tracking,” in Proceedings of the IEEE Conference on 674-679, 1981.
Computer Vision and Pattern Recognition (CVPR), pp. 4819-4827, [33] S. Kim, K. Yun, K. Yi, S. Kim and J. Choi, “Detection of moving
2017. objects with a moving camera using non-panoramic background model,”
[17] J. S. Lim and W. H. Kim, “Detection and Tracking Multiple Pedestrians in Machine Vision and Applications, vol. 24, no. 5, pp. 1015– 1028,
from a Moving Camera,” in ISVC 2005: Advances in Visual 2013.
Computing, LNCS 3804, pp. 527-532, 2005. [34] R. I. Hartley and A. Zisserman, “Multiple View Geometry in Computer
[18] Y. Chen, R. H. Zhang, L. Shang and E. Hu, “Object detection and Vision,” Cambridge University Press, 2004.
tracking with active camera on motion vectors of feature points and [35] R. Haralick, C. Lee, K. Ottenberg and M. Nolle, “Review and Analysis
particle filter,” in The Review of Scientific Instruments, vol. 84, no. 6, of Solutions of the Three Point Perspective Pose Estimation Problem,”
2013. in International Journal of Computer Vision, vol. 13, no. 3, pp. 331-356,
[19] J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,” 1994.
in University of Washington, 2018. [36] D. Nister, “An efficient solution to the five-point relative pose problem,”
[20] S. Shen, Y. Mulgaonkar, N. Michael and V. Kumar, “Vision-Based in Proceedings of the IEEE Computer Society Conference on Computer
State Estimation for Autonomous Rotorcraft MAVs in Complex Vision and Pattern Recognition, vol. 2, pp. II-195, 2003.
Environments,” in Proceedings of the IEEE International Conference on [37] A. Geiger, P. Lenz and R. Urtasun, “Are we ready for Autonomous
Robotics and Automation, pp. 1758-1764, 2013. Driving? The KITTI Vision Benchmark Suite,” in Conference on
[21] L. Gong, M. Yu and T. Gordon, “Online codebook modeling based Computer Vision and Pattern Recognition (CVPR), 2012.
background subtraction with a moving camera,” in 2017 3rd [38] S. Minaeian, J. Liu and Y. J. Son, “Effective and Efficient Detection of
International Conference on Frontiers of Signal Processing (ICFSP), pp. Moving Targets from a UAV’s Camera,” in IEEE Transactions on
136-140, 2017. Intelligent Transportation Systems, vol. 19, no. 2, pp. 497 – 506, 2018.
[22] S. Minaeian, J. Liu and Y. J. Son, “Effective and Efficient Detection of [39] S. Ren, K. He, R. Girshick and J. Sun, “Faster R-CNN: Towards Real-
Moving Targets from a UAV’s Camera,” in IEEE Transactions on Time Object Detection with Region Proposal Networks,” in IEEE
Intelligent Transportation Systems, vol. 19, no. 2, pp. 497 – 506, 2018. Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no.
[23] F. Zhong, S. Wang, Z. Zhang, C. Zhou and Y. Wang, “Detect-SLAM: 6, pp. 1137-1149, 2015.
Making Object Detection and SLAM Mutually Beneficial,” in IEEE [40] T. Y. Lin, P. Goyal, R. Girshick, K. He and P Dollar, “Focal Loss for
Winter Conference on Applications of Computer Vision (WACV), pp. Dense Object Detection,” in 2017 IEEE International Conference on
1001-1010, 2018. Computer Vision (ICCV), pp. 2999-3007, 2017.
[24] D. Nistér, O. Naroditsky and J. Bergen, “Visual Odometry,” in [41] W. Liu, D. Anguelov, D. Erhan et al., “SSD: Single Shot MultiBox
Proceedings of the IEEE Computer Society Conference on Computer Detector,” in Computer Vision – ECCV 2016, pp. 21-37, 2016.
Vision and Pattern Recognition (CVPR 2004), vol. 1, pp. I-I, 2004.
[42] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn and A.
[25] B. Kitt, A. Geiger and H. Lategahn, “Visual Odometry based on Stereo Zisserman, “The PASCAL Visual Object Classes Challenge 2007
Image sequences with RANSAC-based Outlier Rejection Scheme,” in (VOC2007) Results,” 2007.
2010 IEEE Intelligent Vehicles Symposium, pp. 486-492, 2010.
[43] C. C. Lin, “Detecting and Tracking Moving Objects from a Moving
[26] Y. Liu, Y. Gu, J. Li and X. Zhang, “Robust Stereo Visual Odometry Platform,” Georgia Institute of Technology, 2012.
Using Improved RANSAC-Based Methods for Mobile Robot
Localization,” in Sensors 2017, vol. 17, no. 10, 2017.
46 | P a g e
www.ijacsa.thesai.org