World of Fast Moving Object
World of Fast Moving Object
Denys Rozumnyi1,3 Jan Kotěra2 Filip Šroubek2 Lukáš Novotný1 Jiřı́ Matas1
1 2
CMP, Czech Technical University in Prague UTIA, Czech Academy of Sciences
3
SGN, Tampere University of Technology
arXiv:1611.07889v1 [cs.CV] 23 Nov 2016
Abstract
1
since the latent sharp map is a binary image. The same idea
applied to rotating objects was proposed in [21]. An inter-
esting variation was proposed in [7], where linear motion
blur is estimated locally using a relation similar to optical
flow. The main drawback of these methods is that an accu-
rate estimation of the transparency map using alpha matting
algorithms [18] is necessary.
Methods exploiting the fact that autocorrelation in-
object speed [pxl] intersection over union
creases in the direction of blur were proposed to deal with
Figure 2: The FMO dataset includes motions that are an or- objects moving over static backgrounds [5, 14]. Similarly
der of magnitude faster than three standard datasets - ALOV, [19, 23], autocorrelation was considered for motion detec-
VOT, OTB [22, 15, 27]. Normalized histograms of pro- tion of the whole scene due to camera motion. However, all
jected object speeds (left) and intersection over union IoU these methods require a relatively large neighborhood to es-
of bounding boxes (right) between adjacent frames. timate blur parameters, which means that they are not suit-
able for small moving objects. Simultaneously dealing with
rotation of objects has not been considered in the literature
pearance model of the object. Properties like the rotation so far.
axis, angular velocity and object appearance require precise
modeling of the image formation (acquisition) process. The 3. Problem definition
proposed method thus proceeds by solving a blind space- FMOs are objects that move over a large distance com-
variant deconvolution problem with occlusion. pared to their size during the exposure time of a single
Detection, tracking and appearance reconstruction of frame, and possibly also rotate along an arbitrary axis with
FMOs allows performing tasks with applications in diverse an unknown angular speed. For simplicity, we assume a
areas. We show, for instance, the ability to synthesize real- single object F moving over a static background B; an ex-
istic videos with higher frame rates, i.e. to perform tempo- tension to multiple objects is relatively straightforward. To
ral super-resolution. The extracted properties of the FMO, get close to the static background state, camera motion is
such as trajectory, rotation angle and velocity have applica- assumed to be compensated by video stabilization.
tion, e.g. in sports analytics. Let a recorded video sequence consist of frames
The rest of the paper is organized as follows: Related I1 (x), . . . In (x), where x ∈ R2 is a pixel coordinate. Frame
work is discussed in Section 2. Section 3 defines the main It is formed as
concepts arising in the problem. Section 4 explains in detail
the proposed method for FMO localization. The estimation It (x) = (1 − [Ht M ](x))B(x) + [Ht F ](x) , (1)
of intrinsic and extrinsic properties formulated as an opti- where M is the indicator function of F . In general, the
mization problem is presented in Section 5. In Section 6, operator Ht models the blur caused by object motion and
the FMO annotated dataset of 16 videos is introduced. Last, rotation, and performs the 3D→2D projection of the object
the method is evaluated and its different applications are representation F onto the image plane. This operator de-
demonstrated in Section 7. pends mainly on three parameters, {Pt , at , φt }, which are
the FMO trajectory (path), and the axis and angle of ro-
2. Related Work tation, respectively. The [Ht M ](x) function corresponds
Tracking is a key problem in video processing. A range to the object visibility map (alpha matte, relative duration
of methods has been proposed based on diverse principles, of object presence during exposure) and appears in (1) to
such as correlation [4, 9, 8], feature point tracking [24], merge the blurred object and the partially visible back-
mean-shift [6, 25], and tracking-by-detection [29, 12]. The ground.
literature sometimes refer to fast moving objects, but only The object trajectory Pt can be represented in the im-
the case with no significant blur is considered, e.g. [28, 17]. age plane as a path (set of pixels) along which the object
Object blur is a cue for object motion, since the blur moves during the frame exposure. In the case of no rotation
size and shape encode information about motion. However, or when F is homogeneous, i.e. the surface is uniform and
classical tracking methods suffer from blur, yet FMOs con- thus rotation is not perceivable, Ht simplifies to a convolu-
sist predominantly of blur. Most motion deblurring meth- tion in the image plane, i.e. [Ht F ](x) = |P1t | [Pt ∗ F ](x),
ods assume that the degradation can be modeled locally by where |Pt | is the path length – F can then be viewed as a
a linear motion. One category of methods works with oc- 2D image.
clusion and considers the object’s blurred transparency map Finding all the intrinsic and extrinsic properties of arbi-
[13]. Blind deconvolution of the transparency map is easier, trary FMOs means estimating both F and Ht , which is, at
this moment, an intractable task. To alleviate this problem, Detector Redetector Tracker
some prior knowledge of F is necessary. In our case, the It−1 , It , It+1 It , ε
IN It−1 , It , It+1
prior is in the form of object shape. Since in most sport µ, r, Pt−1 , ε µ, r, Pt−1
videos the FMOs are spheres (balls), we continue our the-
OUT µ, r, Pt Pt Pt
oretical analysis focusing on spherical objects, although, as
we further demonstrate, the proposed localization method high contrast, high contrast, linear traj.,
can also successfully handle objects of significantly differ- fast movement, fast movement, model
ent shapes. ASM no contact with model
We propose methods for two tasks: (i) efficient and reli- moving objects,
able FMO localization, i.e. detection and tracking, and (ii) no occlusion,
reconstruction of the FMO appearance, and the axis and an-
gle of the object rotation, which requires the precise output Table 1: Inputs, outputs and assumptions of each algorithm.
of (i). For tracking, we use a simplified version of (1) and Image frame at time t is denoted by It . Symbols µ and r
approximate the FMO by a homogeneous circle determined are used for FMO intrinsics – mean color and radius. FMO
by two parameters: color µ and radius r. The tracker out- trajectory in It is marked by Pt , camera exposure fraction
put (trajectory Pt and radius r) is then used to initialize the by ε.
precise estimation of appearance using the full model (1).
4. Localization of FMOs sure period and time difference between consecutive frames
(e.g. 25fps video with 1/50s exposure has ε = 0.5). This
The proposed FMO localization pipeline consists of can be done from any two subsequent FMO detections and
three algorithms that differ in their prerequisites and speed. we use average over multiple observations.
First, the pipeline runs the fastest algorithm and terminates We need three consecutive video frames to localize the
if a fast moving object is localized; otherwise, it proceeds FMO in the second of the three frames, which causes a con-
to run the remaining two more complex and general al- stant delay of one frame in real-time processing, but this
gorithms. This strategy produces an efficient localization does not present any obstacle for practical use.
method that operates successfully in a broader range of con-
ditions than either of the three algorithms alone. We call 4.1. Detector
them detector, re-detector, and tracker, and their basic prop- The detector is the only generic algorithm for FMO lo-
erties are outlined in Tab. 1. calization that requires no input, except for three consecu-
The first algorithm, detector, discovers previously un- tive image frames It−1 , It , It+1 . First we compute differ-
seen FMOs and establishes their properties. It requires suf- ential images ∆+ = |It − It−1 |, ∆0 = |It+1 − It−1 |, and
ficient contrast between the object and the background, an ∆− = |It − It+1 |. These are binarized (denoted by su-
unoccluded view in three consecutive frames, and no inter- perscript b) by thresholding, and the resulting images are
ference with other moving objects. Then the FMO can be combined by a boolean operation to a single binary image
tracked by either of the other two algorithms. The second
algorithm, re-detector, is applied in a region predicted by ∆ = ∆b+ ∧ ∆b− ∧ ¬∆b0 . (2)
the FMO trajectory in the previous frames. It handles the
problems of partial occlusions and object-background ap- This image contains all objects, which are present in the
pearance similarity while being as fast as the detector. Fi- frame It , but not in the frames It−1 and It+1 (i.e. moving
nally, the tracker searches for the object by synthesizing its objects in It ).
appearance with the background at the predicted locations. The second step is to identify all objects which can be
All three algorithms require a static background or a explained by the FMO motion model. We calculate the tra-
registration of consecutive frames. To this end, we apply jectory Pt and radius r for each FMO candidate and deter-
video stabilization by estimating the affine transformation mine if it satisfies the motion model. For each connected
between frames using RANSAC [10] by matching FREAK component C in ∆, we compute the distance transform to
descriptors [1] of FAST features [20]. get the minimal distance d(x) for each inner pixel x ∈ C
The detector also updates the FMO model properties re- to a pixel on its component’s contour. Then the maxi-
quired by the re-detector and tracker, namely FMO’s color µ mum of such distances for each component is its radius,
and radius r. For increased stability, the new value of any of r = max d(x), x ∈ C. Next, we determine the trajec-
these parameters is a weighted average of the detected value tory by morphologically thinning the pixels x that satisfy
and the previous value using a forgetting function proposed d(x) > ψr, where the threshold ψ is set to 0.7. Now we
in [26]. For each video sequence we also need to determine decide whether the object satisfies the FMO motion model
the so called exposure fraction ε, which is the ratio of expo- by verifying two conditions: (i) the trajectory Pt must be a
b∆b
∆0 0
(a) Detection
It
(b) Re-detection
It (c) Tracking
Figure 3: (a) FMO detection, (b) redetection where detection failed because FMO is not a single connected component,
(c) tracking where both algorithms failed due to imprecise ∆. Top row: cropped It with Pt−1 (blue) and Pt (green) with
contours. Bottom row: binary differential image ∆.
4.2. Re-detector
∆∆
The re-detector requires the knowledge of the FMO, but
∆b+ ∆b0 ∆b−
allows one FMO occurrence to be composed of several con-
nected components in ∆ (e.g. the FMO passes in front of
background with similar color). Fig. 3 shows an example,
where the re-detector finds an FMO missed by the detector.
The re-detector operates on a rectangular window of ∆
∆0 ∆0
a binary differential image ∆ in (2), restricted to the lo-
cal neighborhood of the previous FMO localization. Let
Pt−1 be the trajectory from the previous frame It−1 , then
the re-detector works in the square neighborhood with side
4 1ε |Pt−1 | and centered on the position of the previous lo- Figure 4: Detection of FMOs. Three differential images
calization. Note that 1ε |Pt−1 |, where ε is the exposure frac- of three consecutive frames are binarized, segmented by
tion, is the full trajectory length between It−1 and It . For boolean operation, and connected components are checked
each connected component in this region, the trajectory Pt if they satisfy the FMO model. The two detected FMOs
and radius r are computed in the same way as in the detec- on this frame are the ball and the white stripe on player’s
tion algorithm. The mean color µ is obtained by averaging t-shirt. However, only the ball passed the check and was
all pixel values on the trajectory. In this region, connected marked as an FMO.
components with model parameters (µ, r) are selected if the
Euclidean distance in RGB kµ − µ0 k2 and the normalized
difference |r − r0 | /r0 are below prescribed thresholds 0.3 alpha value [Ht M ](x) from (1) is
and 0.4, respectively. Here, the previous FMO parameters
are denoted by (µ0 , r0 ). 1 1
Z
A(x|Pt ) = [Pt ∗ M ](x) = Pt (x − z)dz.
4.3. Tracker |Pt | |Pt | |z|≤r
(3)
The final attempt to find the FMO after both the detector For linear trajectories this integral can be solved analyti-
and re-detector have failed is the tracker, which uses image cally. Let D(x, Pt ) denote the distance function from x to
synthesis. The tracker is based on the simplified formation the trajectory Pt , then A is
model (1) by assuming an object F with color µ and radius r
b ∆b
moving along a linear trajectory Pt . The indicator function 2 p
∆0 0
M is then a ball of radius r, and given the trajectory Pt , the A(x|Pt ) ≈ max ((r2 − D2 (x, Pt )), 0). (4)
|Pt |
by the FMO detector. The objective is to estimate the ap-
pearance F , which is essentially a (modified) blind image
deblurring task. One has to first estimate the blur-and-
projection operator H, and then solve the non-blind deblur-
ring task for F . As mentioned in Sec. 3, to make the esti-
(1) (2) (3) mation of H tractable, we focus on ball-like objects moving
(approximately) parallel to the camera while undergoing ar-
Figure 5: Tracking steps. (1) detection of orientation, (2) bitrary 3D rotation. As in the FMO tracking, if the object
detection of starting point, (3) detection of ending point. rotation is negligible or unperceivable, the H operator is
Previous detection is in blue. Green cross denotes the min- fully determined by the object trajectory and we can pro-
imizer, red crosses the initial guess. All sampled points ceed directly to the non-blind estimation of F . Let us first
(gray) are scaled by their cost (6) (the darker the higher focus on the estimation of F and then on the problem of
cost). obtaining H.
Let F denote some representation of the object appear-
ance – in the absence of rotation, this can be directly the
This approximation is inaccurate only in the neighborhood image of the object projected in the video frame, and when
of the starting and ending point of the trajectory, and for 3D rotation is present we use the spherical parametrization
FMOs this area is small compared to the central section of to capture the whole surface. Following the model (1) we
the trajectory. Using the above relation, It in (1) can be solve the problem
written in a simpler form as
min k(1 − [HM ])B + [HF ] − Ik1 + αkDF k1 , (7)
Iˆt (x|Pt ) = (1 − A(x|Pt ))B(x) + µA(x|Pt ). (5) F
MEEM[29]
ASMS[25]
SRDCF[8]
Method
Proposed
DSST[9]
1 volleyball 50 100.0 45.5 62.5
2 volleyball passing 66 21.8 10.4 14.1
3 darts 75 100.0 26.5 41.7 Sq. name
4 darts window 50 25.0 50.0 33.3
5 softball 96 66.7 15.4 25.0 volleyball 80 0 50 0 10 46
6 archery 119 0.0 0.0 0.0 volleyball passing 12 6 95 88 8 10
7 tennis serve side 68 100.0 58.8 74.1 darts 3 0 6 0 0 27
8 tennis serve back 156 28.6 5.9 9.8 darts window 0 0 0 0 0 50
9 tennis court 128 0.0 0.0 0.0 softball 0 0 0 0 0 15
10 hockey 350 100.0 16.1 27.7 archery 5 5 5 5 0 0
11 squash 250 0.0 0.0 0.0 tennis serve side 7 0 0 0 6 59
12 frisbee 100 100.0 100.0 100.0 tennis serve back 5 0 0 0 3 6
13 blue ball 53 100.0 52.4 68.8 tennis court 0 0 3 3 0 0
14 ping pong tampere 120 100.0 88.7 94.0 hockey 0 0 0 0 0 16
15 ping pong side 445 12.1 7.3 9.1 squash 0 0 0 0 0 0
16 ping pong top 350 92.6 87.8 90.1 frisbee 65 0 6 6 0 100
Average – 59.2 35.5 40.6 blue ball 30 0 0 0 25 52
ping pong tampere 0 0 0 0 0 89
Table 2: Performance of the proposed method on the FMO ping pong side 1 0 0 0 0 7
dataset. We report precision, recall and F-score. The num- ping pong top 0 0 0 0 1 88
ber of frames is indicated by #. Average 17 1 1 1 3 36
Figure 8: Reconstruction of an FMO blurred by motion and rotation. a) Input video frame. b) Top row: actual frames from
a high-speed camera (250fps). Bottom row: frames at corresponding times reconstructed from a single frame of a regular
camera (25fps), i.e. 10x temporal super-resolution. The top left image shows the rotation axis position estimated from the
blurred frame (blue cross) and from the highspeed video (red circle).