0% found this document useful (0 votes)
62 views7 pages

Yolactedge

YolactEdge is a real-time instance segmentation method that can run on edge devices like the NVIDIA Jetson AGX Xavier. It improves upon the existing real-time method YOLACT in two ways: (1) it applies TensorRT optimization to quantize model parameters while balancing speed and accuracy, and (2) it uses a novel feature warping module to exploit temporal redundancy in videos by transforming and propagating features across frames. Experiments show YolactEdge achieves a 3-5x speed improvement over other real-time methods while maintaining competitive mask and box detection accuracy on standard datasets. It is the first video-based real-time instance segmentation approach.

Uploaded by

Đức Anh SOne
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views7 pages

Yolactedge

YolactEdge is a real-time instance segmentation method that can run on edge devices like the NVIDIA Jetson AGX Xavier. It improves upon the existing real-time method YOLACT in two ways: (1) it applies TensorRT optimization to quantize model parameters while balancing speed and accuracy, and (2) it uses a novel feature warping module to exploit temporal redundancy in videos by transforming and propagating features across frames. Experiments show YolactEdge achieves a 3-5x speed improvement over other real-time methods while maintaining competitive mask and box detection accuracy on standard datasets. It is the first video-based real-time instance segmentation approach.

Uploaded by

Đức Anh SOne
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

YolactEdge: Real-time Instance Segmentation on the Edge

Haotian Liu∗ , Rafael A. Rivera Soto∗ , Fanyi Xiao, and Yong Jae Lee

Abstract— We propose YolactEdge, the first competitive in- parameters to fewer bits while systematically balancing any
stance segmentation approach that runs on small edge devices tradeoff in accuracy, and (2) we leverage temporal redundancy
at real-time speeds. Specifically, YolactEdge runs at up to 30.8 in video (i.e., temporally nearby frames are highly correlated),
FPS on a Jetson AGX Xavier (and 172.7 FPS on an RTX
2080 Ti) with a ResNet-101 backbone on 550x550 resolution and learn to transform and propagate features over time so that
images. To achieve this, we make two improvements to the the deep network’s expensive backbone feature computation
state-of-the-art image-based real-time method YOLACT [1]: does not need to be fully computed on every frame.
arXiv:2012.12259v2 [cs.CV] 1 Apr 2021

(1) applying TensorRT optimization while carefully trading off The proposed shift to video from static image processing
speed and accuracy, and (2) a novel feature warping module makes sense from a practical standpoint, as the real-time
to exploit temporal redundancy in videos. Experiments on
the YouTube VIS and MS COCO datasets demonstrate that aspect matters much more for video applications that require
YolactEdge produces a 3-5x speed up over existing real-time low latency and real-time response than for image applica-
methods while producing competitive mask and box detection tions; e.g., for real-time control in robotics and autonomous
accuracy. We also conduct ablation studies to dissect our driving, or real-time object/activity detection in security and
design choices and modules. Code and models are available augmented reality, where the system must process a stream
at https://github.com/haotian-liu/yolact_edge.
of video frames and generate instance segmentation outputs
I. I NTRODUCTION in real-time. Importantly, all existing real-time instance
Instance segmentation is a challenging problem that re- segmentation methods (including YOLACT) are static image-
quires the correct detection and segmentation of each object based, which makes YolactEdge the first video-dedicated
instance in an image. A fast and accurate instance segmenter real-time instance segmentation method.
would have many useful applications in robotics, autonomous In sum, our contributions are: (1) we apply TensorRT
driving, image/video retrieval, healthcare, security, and others. optimization while carefully trading off speed and accuracy,
In particular, a real-time instance segmenter that can operate (2) we propose a novel feature warping module to exploit
on small edge devices is necessary for many real-world temporal redundancy in videos, (3) we perform experiments
scenarios. For example, in safety critical applications in on the benchmark image MS COCO [9] and video YouTube
complex environments, robots, drones, and other autonomous VIS [10] datasets, demonstrating a 3-5x faster speed compared
machines may need to perceive objects and humans in real- to existing real-time instance segmentation methods while
time on device – without having access to the cloud, and in being competitive in accuracy, and (4) we publicly release our
resource constrained settings where bulky and power hungry code and models to facilitate progress in robotics applications
GPUs (e.g., Titan Xp) are impractical. However, while there that require on device real-time instance segmentation.
has been great progress in real-time instance segmentation
II. R ELATED W ORK
research [1], [2], [3], [4], [5], [6], [7], thus far, there is no
method that can run accurately at real-time speeds on small Real-time instance segmentation in images.
edge devices like the Jetson AGX Xavier. YOLACT [1] is the first real-time instance segmentation
In this paper, we present YolactEdge, a novel real-time method to achieve competitive accuracy on the challenging
instance segmentation approach that runs accurately on edge MS COCO [9] dataset. Recently, CenterMask [2],
devices at real-time speeds. Specifically, with a ResNet-101 BlendMask [5], and SOLOv2 [3] have improved accuracy
backbone, YolactEdge runs at up to 30.8 FPS on a Jetson in part by leveraging more accurate object detectors (e.g.,
AGX Xavier (and 172.7 FPS on an RTX 2080 Ti GPU), FCOS [11]). All existing real-time instance segmentation
which is 3-5x faster than existing state-of-the-art real-time approaches [1], [2], [5], [6], [3] are image-based and require
methods, while being competitive in accuracy. bulky GPUs like the Titan Xp / RTX 2080 Ti to achieve
In order to perform inference at real-time speeds on edge real-time speeds. In contrast, we propose the first video-based
devices, we build upon the state-of-the-art image-based real- real-time instance segmentation approach that can run on
time instance segmentation method, YOLACT [1], and make small edge devices like the Jetson AGX Xavier.
two fundamental improvements, one at the system-level and Feature propagation in videos has been used to improve
the other at the algorithm-level: (1) we apply NVIDIA’s speed and accuracy for video classification and video object
TensorRT inference engine [8] to quantize the network detection [12], [13], [14]. These methods use off-the-shelf
optical flow networks [15] to estimate pixel-level object
1 Fanyi Xiao is with Amazon Web Services, Inc., the rest are with the Uni-
motion and warp feature maps from frame to frame. However,
versity of California, Davis. {lhtliu, riverasoto, fyxiao,
yongjaelee}@ucdavis.edu (* Haotian Liu and Rafael A. Rivera even the most lightweight flow networks [15], [16] require
Soto are co-first authors.) non-negligible memory and compute, which are obstacles
for real-time speeds on edge devices. In contrast, our model Backbone FPN ProtoNet PredHead TensorRT mAP FPS
estimates object motion and performs feature warping directly FP32 FP32 FP32 FP32 N 29.8 6.4
FP16 FP16 FP16 FP16 N 29.7 12.1
at the feature level (as opposed to the input pixel level), which FP32 FP32 FP32 FP32 Y 29.6 19.1
enables real-time speeds. FP16 FP16 FP16 FP16 Y 29.7 21.9
INT8 FP16 FP16 FP16 Y 29.9 26.3
Improving model efficiency. Designing lightweight yet INT8 FP16 INT8 FP16 Y 29.9 26.5
performant backbones and feature pyramids has been one INT8 INT8 FP16 FP16 Y 29.7 27.7
INT8 INT8 INT8 FP16 Y 29.8 27.4
of the main thrusts in improving deep network efficiency. INT8 FP16 FP16 INT8 Y 25.4 26.2
MobileNetv2 [17] introduces depth-wise convolutions and INT8 FP16 INT8 INT8 Y 25.4 25.9
INT8 INT8 FP16 INT8 Y 25.2 26.9
inverted residuals to design a lightweight architecture for INT8 INT8 INT8 INT8 Y 25.2 26.5
mobile devices. MobileNetv3 [18], NAS-FPN [19], and
EfficientNet [20] use neural architecture search to automat- TABLE I: Effect of Mixed Precision on YOLACT [1] with a
ically find efficient architectures. Others utilize knowledge ResNet-101 backbone on the MS COCO val2017 dataset with
distillation [21], [22], [23], model compression [24], [25], or a Jetson AGX Xavier using 100 calibration images. Mixing
binary networks [26], [27]. The CVPR Low Power Computer precision across the modules results in different instance
Vision Challenge participants have used TensorRT [8], a segmentation mean Average Precision (mAP) and FPS for
deep learning inference optimizer, to quantize and speed up each instantiation of YOLACT. All results are averaged over
object detectors such as Faster-RCNN on the NVIDIA Jetson 5 runs, with a standard deviation less than 0.6 FPS.
TX2 [28]. In contrast to most of these approaches, YolactEdge
retains large expressive backbones, and exploits temporal
redundancy in video together with a TensorRT optimization ProtoNet, and (4) a Prediction Head; see Fig. 1 (right) for
for fast and accurate instance segmentation. the network architecture. (More details on YOLACT will be
provided in Sec. III-B.) The second row in Table I represents
III. A PPROACH YOLACT, with all components in FP32 (i.e., no TensorRT
Our goal is to create an instance segmentation model, optimization), and results in only 6.6 FPS on the Jetson
YolactEdge, that can achieve real-time (>30 FPS) speeds on AGX Xavier with a ResNet-101 backbone. From there, INT8
edge devices. To this end, we make two improvements to or FP16 conversion on different model components leads
the image-based real-time instance segmentation approach to various improvements in speed and changes in accuracy.
YOLACT [1]: (1) applying TensorRT optimization, and (2) Notably, conversion of the Prediction Head to INT8 (last four
exploiting temporal redundancy in video. rows) always results in a large loss of instance segmentation
accuracy. We hypothesize that this is because the final box
A. TensorRT Optimization and mask predictions require more than 28 = 256 bins to be
The edge device that we develop our model on is the encoded without loss in the final representation. Converting
NVIDIA Jetson AGX Xavier. The Xavier is equipped with every component to INT8 except for the Prediction Head and
an integrated Volta GPU with Tensor Cores, dual deep FPN (row highlighted in gray) achieves the highest FPS with
learning accelerator, 32GB of memory, and reaches up to 32 little mAP degradation. Thus, this is the final configuration
TeraOPS at a cost of $699. Importantly, the Xavier is the we go with for our model in our experiments, but different
only architecture from the NVIDIA Jetson series that supports configurations can easily be chosen based on need.
both FP16 and INT8 Tensor Cores, which are needed for In order to quantize model components to INT8 precision,
TensorRT [29] optimization. a calibration step is necessary: TensorRT collects histograms
TensorRT is NVIDIA’s deep learning inference optimizer of activations for each layer, generates several quantized
that provides mixed-precision support, optimal tensor layout, distributions with different thresholds, and compares each of
fusing of network layers, and kernel specializations [8]. A them to the reference distribution using KL Divergence [31].
major component of accelerating models using TensorRT is This step ensures that the model loses as little performance as
the quantization of model weights to INT8 or FP16 precision. possible when converted to INT8 precision. Table VIa shows
Since FP16 has a wider range of precision than INT8, it the effect of the calibration dataset size. We observe that
yields better accuracy at the cost of more computational calibration is necessary for accuracy, and generally a larger
time. Given that the weights of different deep network calibration set provides a better speed-accuracy trade-off.
components (backbone, prediction module, etc.) have different
ranges, this speed-accuracy trade-off varies from component B. Exploiting Temporal Redundancy in Video
to component. Therefore, we convert each model component The TensorRT optimization leads to a ∼4x improvement
to TensorRT independently and explore the optimal mix in speed, and when dealing with static images, this is the
between INT8 and FP16 weights that maximizes FPS while version of YolactEdge one should use. However, when dealing
preserving accuracy. with video, we can exploit temporal redundancy to make
Table I shows this analysis for YOLACT [1], which is the YolactEdge even faster, as we describe next.
baseline model that YolactEdge directly builds upon. Briefly, Given an input video as a sequence of frames {Ii }, we
YOLACT can be divided into 4 components: (1) a feature aim to predict masks for each object instance in each frame
backbone, (2) a feature pyramid network [30] (FPN), (3) a {yi = N (Ii )}, in a fast and accurate manner. For our video
Feature Pyramid
transform

Feature Backbone transform

Prediction
Head

Post-
process

Protonet

computed
transformed
previous keyframe current non-key frame not computed

Fig. 1: YolactEdge extends YOLACT [1] to video by transforming a subset of the features from keyframes (left) to non-
keyframes (right), to reduce expensive backbone computation. Specifically, on non-keyframes, we compute C3 features that are
cheap while crucial for mask prediction given its high-resolution. This largely accelerates our method while retaining accuracy
on non-keyframes. We use blue, orange, and grey to indicate computed, transformed, and skipped blocks, respectively.

instance segmentation network N , we largely follow the tasks like instance segmentation. In this work, we propose to
YOLACT [1] design for its simplicity and impressive speed- perform partial feature transforms to improve the quality of
accuracy tradeoff. Specifically, on each frame, we perform the transformed features while still maintaining a fast runtime.
two parallel tasks: (1) generating a set of prototype masks, Specifically, unlike [12], which transforms all features (P3k ,
and (2) predicting per-instance mask coefficients. Then, the P4 , P5k in our case) from a keyframe I k to a non-keyframe
k
final masks are assembled through linearly combining the I n , our method computes the backbone features for a non-
prototypes with the mask coefficients. keyframe only up through the high-resolution C3n level (i.e.,
For clarity of presentation, we decompose N into Nf eat skipping C4n , C5n and consequently P4n , P5n computation),
and Npred , where Nf eat denotes the feature backbone stage and only transforms the lower resolution P4k /P5k features
and Npred is the rest (i.e., prediction heads for class, box, from the previous keyframe to approximate P4n /P5n (denoted
mask coefficients, and ProtoNet for generating prototype as W4n /W5n ) in the current non-keyframe, as shown in Fig. 1
masks) which takes the output of Nf eat and make instance (right). It computes P6n /P7n by downsampling W5n in the
segmentation predictions. We selectively divide frames in a same way as YOLACT. With the computed C3n features and
video into two groups: keyframes I k and non-keyframes I n ; transformed W4n features, it then generates P3n as P3n = C3n +
the behavior of our model on these two groups of frames up(W4n ), where up(·) denotes upsampling. Finally, we use the
only varies in the backbone stage. P3n features to generate pixel-accurate prototypes. This way,
in contrast to [12], we can preserve high-resolution details
y k = Npred (Nf eat (I k )) (1)
for generating the mask prototypes, as the high-resolution C3
y n = Npred (N
ef eat (I n )) (2) features are computed instead of transformed and thus are
immune to errors in flow estimation.
For keyframes I k , our model computes all backbone and
pyramid features (C1 − C5 and P3 − P7 in Fig. 1). Whereas Importantly, although we compute the C1 -C3 backbone
for non-keyframes I n , we compute only a subset of the features for every frame (i.e., both key and non-keyframes),
features, and transform the rest from the temporally closest we avoid computing the most expensive part of the backbone,
previous keyframe using the mechanism that we elaborate as the computational costs in different stages of pyramid-like
on next. This way, we strike a balance between producing networks are highly imbalanced. As shown in Table II, more
accurate predictions while maintaining a fast runtime. than 66% of the computation cost of ResNet-101 lies in C4 ,
while more than half of the inference time is occupied by
Partial Feature Transform. Transforming (i.e., warping) backbone computation. By computing only lower layers of
features from neighboring keyframes was shown to be an the feature pyramid and transforming the rest, we can largely
effective strategy for reducing backbone computation to yield accelerate our method to reach real-time performance.
fast video bounding box object detectors in [12]. Specifically,
In summary, our partial feature transform design produces
[12] transforms all the backbone features using an off-the-
higher quality feature maps that are required for instance
shelf optical flow network [15]. However, due to inevitable
segmentation, while also enabling real-time speeds.
errors in optical flow estimation, we find that it fails to
provide sufficiently accurate features required for pixel-level Efficient Motion Estimation. In this section, we describe
convs convs
backbone

prediction prediction
refinement refinement
backbone
flow flow

(a) FlowNetS (b) FeatFlowNet

Fig. 2: Flow estimation. Illustration of the difference between FlowNetS [15] (a) and our FeatFlowNet (b).

C1 C2 C3 C4 C5 Stage % Stage %
# of convs 1 9 12 69 9 Backbone 54.7 FPN 6.4 Mask
R-CNN
TFLOPS 0.1 0.7 1.0 5.2 0.8 ProtoNet 7.8 Pred 10.6
% 1.5 8.7 13.2 66.2 10.3 Detect 6.6 Other 13.1
(a) ResNet-101 Backbone (b) YOLACT
YOLACT
TABLE II: Computational cost breakdown for different
stages of (a) ResNet-101 backbone, and (b) YOLACT.

Ours
how we efficiently compute flow between a keyframe and
non-keyframe. Given a non-keyframe I n and its preceding
keyframe I k , our model first encodes object motion between Fig. 3: Mask quality. Our masks are as high quality as
them as a 2-D flow field M(I k , I n ). It then uses the flow YOLACT even on non-keyframes, and are typically higher
field to transform the features F k = {P4k , P5k } from frame quality than those of Mask R-CNN [32].
I k to align with frame I n to produce the warped features
Fen = {W4n , W5n } = T (F k , M(I k , I n )).
In order to perform fast feature transformation, we need to value
P is then computed via bilinear interpolation F k→n (x) =
k
estimate object motion efficiently. Existing frameworks [12], u θ(u, x + δx)F (x), where θ is the bilinear interpolation
[13] that perform flow-guided feature transform directly adopt weight at different spatial locations.
off-the-shelf pixel-level optical flow networks for motion Loss Functions. For the instance segmentation task, we
estimation. FlowNetS [15] (Fig. 2a), for example, performs use the same losses as YOLACT [1] to train our model:
flow estimation in three stages: it first takes in raw RGB classification loss Lcls , box regression loss Lbox , mask loss
frames as input and computes a stack of features; it then Lmask , and auxiliary semantic segmentation loss Laux . For
refines a subset of the features by recursively upsampling flow estimation network pre-training, like [15], we use the
and concatenating feature maps to generate coarse-to-fine endpoint error (EPE).
features that carry both high-level (large motion) and fine
local information (small motion); finally, it uses those features IV. R ESULTS
to predict the final flow map. In this section, we analyze YolactEdge’s instance segmenta-
In our case, to save computation costs, instead of taking an tion accuracy and speed on the Jetson AGX Xavier and RTX
off-the-shelf flow network that processes raw RGB frames, 2080 Ti. We compare to state-of-the-art real-time instance
we reuse the features computed by our model’s backbone segmentation methods, and perform ablation studies to dissect
network, which already produces a set of semantically rich our various design choices and modules.
features. To this end, we propose FeatFlowNet (Fig. 2b),
Implementation details. We train with a batch size of 32 on
which generally follows the FlowNetS architecture, but in the
4 GPUs using ImageNet pre-trained weights. We leave the
first stage, instead of computing feature stacks from raw RGB
pre-trained batchnorm (bn) unfrozen and do not add any extra
image inputs, we re-use features from the ResNet backbone
bn layers. We first pre-train YOLACT with SGD for 500k
(C3 ) and use fewer convolution layers. As we demonstrate in
iterations with 5 × 10−4 initial learning rate. Then, we freeze
our experiments, our flow estimation network is much faster
YOLACT weights, and train FeatFlowNet on FlyingChairs
while being equally effective.
[33] with 2 × 10−4 initial learning rate. Finally, we fine-tune
Feature Warping. We use FeatFlowNet to estimate the flow all weights except ResNet backbone for 200k iterations with
map M(I k , I n ) between the previous keyframe I k and the 2 × 10−4 initial learning rate. When pre-training YOLACT,
current non-keyframe I n , and then transform the features we apply all data augmentations used in YOLACT; during
from I k to I n via inverse warping: by projecting each pixel fine-tuning, we disable random expand to allow the warping
x in I n to I k as x + δx, where δx = Mx (I k , I n ). The pixel module to model larger motions. For all training stages, we
Fig. 4: YolactEdge results on YouTube VIS on non-keyframes whose subset of features are warped from a keyframe 4
frames away (farthest in sampling window). Our mask predictions can tightly fit the objects, due to partial feature transform.

Method Backbone mask AP box AP RTX FPS Method Backbone mask AP box AP AGX FPS RTX FPS
Mask R-CNN [32] R-101-FPN 43.1 47.3 14.1 YOLACT [1] R-50-FPN 44.7 46.2 8.5 59.8
CenterMask-Lite [2] V-39-FPN 41.6 45.9 34.4 YolactEdge (w/o TRT) R-50-FPN 44.2 45.2 10.5 67.0
YolactEdge (w/o video) R-50-FPN 44.5 46.0 32.0 185.7
BlendMask-RT [5] R-50-FPN 44.0 47.9 49.3
YolactEdge R-50-FPN 44.0 45.1 32.4 177.6
SOLOv2-Light [3] R-50-FPN 46.3 – 43.9 YOLACT [1] R-101-FPN 47.3 48.9 5.9 42.6
YOLACT [1] R-50-FPN 44.7 46.2 59.8 YolactEdge (w/o TRT) R-101-FPN 46.9 47.8 9.5 61.2
YOLACT [1] R-101-FPN 47.3 48.9 42.6 YolactEdge (w/o video) R-101-FPN 46.9 48.4 27.9 158.2
Ours YolactEdge R-101-FPN 46.2 47.1 30.8 172.7
YolactEdge (w/o TRT) R-50-FPN 44.2 45.2 67.0
YolactEdge (w/o TRT) R-101-FPN 46.9 47.8 61.2 TABLE V: YolactEdge ablation results on Youtube VIS.
YolactEdge R-50-FPN 44.0 45.1 177.6
YolactEdge R-101-FPN 46.2 47.1 172.7

TABLE III: Comparison to state-of-the-art real-time meth- instance segmentation ground-truth masks. Since we only
ods on YouTube VIS. We use our sub-training and sub- perform instance segmentation (without tracking), we cannot
validation splits for YouTube VIS and perform joint training directly use the validation server of YouTube VIS to evaluate
with COCO using a 1:1 data sampling ratio. (Box AP is not our method. Instead, we further divide the training split into
evaluated in the authors’ code base of SOLOv2.) two train-val splits with a 85%-15% ratio (1904 and 334
videos). To demonstrate the validity of our own train-val split,
Method Backbone mask AP box AP AGX FPS RTX FPS
we created two more splits, and configured them so that any
YOLACT [1] MobileNet-V2 22.1 23.3 15.0 35.7 two splits have video overlap of less than 18%. We evaluated
YolactEdge (w/o video) MobileNet-V2 20.8 22.7 35.7 161.4
YOLACT [1] R-50-FPN 28.2 30.3 9.1 45.0
Mask R-CNN, YOLACT, and YolactEdge on all three splits,
YolactEdge (w/o video) R-50-FPN 27.0 30.1 30.7 140.3 the AP variance is within ±2.0.
YOLACT [1] R-101-FPN 29.8 32.3 6.6 36.5 We also evaluate our approach on the MS COCO [9]
YolactEdge (w/o video) R-101-FPN 29.5 32.1 27.3 124.8
dataset, which is an image instance segmentation benchmark,
TABLE IV: YolactEdge (w/o video) comparision to using the standard metrics. We train on the train2017 set and
YOLACT on MS COCO [9] test-dev split. AGX: Jetson evaluate on the val2017 and test-dev sets.
AGX Xavier; RTX: RTX 2080 Ti.
A. Instance Segmentation Results
We first compare YolactEdge to state-of-the-art real-time
methods on YouTube VIS using the RTX 2080 Ti GPU in
use cosine learning rate decay schedule, with weight decay 5×
Table III. YOLACT [1] with a R101 backbone produces the
10−4 , and momentum 0.9. We pick the first of every 5 frames
highest box detection and instance segmentation accuracy
as the keyframes. We use 100 images from the training set to
over all competing methods. Our approach, YolactEdge, offers
calibrate our INT8 model components (backbone, prototype,
competitive accuracy to YOLACT, while running at a much
FeatFlowNet) for TensorRT, and the remaining components
faster speed (177.6 FPS on a R50 backbone). Even without the
(prediction head, FPN) are converted to FP16. We do not
TensorRT optimization, it still achieves over 60 FPS for both
convert the warping module to TensorRT, as the conversion
R50 and R101 backbones, demonstrating the contribution of
of the sampling function (needed for inverse warp) is not
our partial feature transform design which allows the model
natively supported, and is also not a bottleneck for our feature
to skip a large amount of redundant computation in video.
propagation to be fast. We limit the output resolution to be a
In terms of mask quality, because YOLACT/YolactEdge
maximum of 640x480 while preserving the aspect ratio.
produce a final mask of size 138x138 directly from the
Datasets. YouTube VIS [10] is a video instance segmentation feature maps without repooling (which potentially misalign
dataset for detection, segmentation, and tracking of object the features), their masks for large objects are noticeably
instances in videos. It contains 2883 high-resolution YouTube higher quality than Mask R-CNN. For instance, in Fig. 3,
videos of 40 common objects such as person, animals, and both YOLACT and YolactEdge produce masks that follow
vehicles, at a frame rate of 30 FPS. The train, validation, the boundary of the feet of lizard and zebra, while those
and test set contain 2238, 302, and 343 videos, respectively. of Mask R-CNN have more artifacts. This also explains
Every 5th frame of each video is annotated with pixel-level YOLACT/YolactEdge’s stronger quantitative performance
#Calib. Img. mAP FPS Warp layers mAP FPS Channels mAP FPS Method mAP FPS
0 24.4 – C4 , C5 39.2 59.7 1x 47.0 48.3 w/o flow 31.8 72.5
5 29.6 27.4 P4 , P5 39.2 63.2 1/2x 46.9 53.6 FlowNetS 39.2 43.3
50 29.8 27.4 C3 , C4 , C5 37.8 59.1 1/4x 46.9 61.2 FeatFlowNet 39.2 61.2
100 29.7 27.5 P3 , P4 , P5 38.0 64.1 1/8x – 62.2
(a) INT8 calibration Effect of the (b) Partial feature transform We warp (c) FeatFlowNet We reduce chan- (d) FeatFlowNet is faster and equally
number of calibration images. P4 & P5 as it is both fast and accurate. nels for accuracy/speed tradeoff. effective compared to FlowNetS.

TABLE VI: Ablations. (a) is on COCO val2017 using YOLACT with a R101 backbone. (b-d) are YolactEdge (w/o TRT) on
our YouTube VIS sub-train/sub-val split ((b)&(d) without COCO joint training). We highlight our design choices in gray.

over Mask R-CNN on YouTube VIS, which has many large faster. If we further decrease it to 1/8, the FPS does not
objects. Moreover, our proposed partial feature transform increase by a large margin, and flow pre-training does not
allows the network to take the computed high resolution C3 converge well. As shown in Table VId, accurate flow maps
features to help generate prototypes. In this way, our method are crucial for transforming features across frames. Notably,
is less prone to artifacts brought by misalignment compared our FeatFlowNet is equally effective for mask prediction as
to warping all features (as in [12]) and thus can maintain FlowNetS [15], while being faster as it reuses C3 features for
similar accuracy to YOLACT which processes all frames pixel motion estimation (whereas FlowNetS computes flow
independently. See Fig. 4 for more qualitative results. starting from raw RGB pixels).
We next compare YolactEdge to YOLACT on the MS D. Temporal Stability
COCO [9] dataset in Table IV. Here YolactEdge is without
Finally, although YolactEdge does not perform explicit
video optimization since MS COCO is an image dataset.
temporal smoothing, it produces temporally stable masks.1 In
We compare three backbones: MobileNetv2, ResNet-50,
particular, we observe less mask jittering than YOLACT. We
and ResNet-101. Every YolactEdge configuration results
believe this is due to YOLACT only training on static images,
in a loss of AP when compared to YOLACT due to the
whereas YolactEdge utilizes temporal information in videos
quantization of network parameters performed by TensorRT.
both during training and testing. Specifically, when producing
This quantization, however, comes at an immense gain of
prototypes, our partial feature transform implicitly aggregates
FPS on the Jetson AGX and RTX 2080 Ti. For example,
information from both the previous keyframe and current non-
using ResNet-101 as a backbone results in a loss of 0.3
keyframe, and thus “averages out” noise to produce stable
mask mAP from the unquantized model but results in a
segmentation masks.
20.7/88.3 FPS improvement on the AGX/RTX. We note that
the MobileNetv2 backbone has the fastest speed (35.7 FPS V. D ISCUSSION OF L IMITATIONS
on AGX) but has a very low mAP of 20.8 when compared Despite YolactEdge’s competitiveness, it still falls behind
to the other configurations. YOLACT in mask mAP. We discuss two potential causes.
Finally, Table V shows ablations of YolactEdge. Starting a) Motion blur: We believe part of the reason lies
from YOLACT, which is equivalent to YolactEdge without in the feature transform procedure – although our partial
TensorRT and video optimization, we see that with a ResNet- feature transform corrects certain errors caused by imperfect
101 backbone, both our video and TensorRT optimizations flow maps (Table VIb), there can still be errors caused
lead to significant improvements in speed with a bit of by motion blur which lead to mis-localized detections.
degradation in mask/box mAP. The speed improvement Specifically, for non-keyframes, P4 and P5 features are
for instantiations with a ResNet-50 backbone are not as derived by transforming features of previous keyframes. It is
prominent, because video optimization mainly exploits the not guaranteed that the randomly selected keyframes are free
redundancy of computation in the backbone stage and its from motion blur. A smart way to select keyframes would
effect diminishes in smaller backbones. be interesting future work.
b) Mixed-precision conversion: The accuracy gap can
B. Which feature layers should we warp? also be attributed to mixed precision conversion – even
As shown in Table VIb, computing C3 /P3 features (rows with the optimal conversion and calibration configuration
2-3) yields 1.2-1.4 higher AP than warping C3 /P3 features (Table I,VIa), the precision gap between training (FP32) and
(rows 4-5). We choose to perform partial feature transform inference (FP16/INT8) is not fully addressed. An interesting
over P instead of C features, as there is no obvious difference direction is to explore training with mixed-precision, with
in accuracy while it is much faster to warp P features. which the model could potentially learn to compensate for
the precision loss and adapt better during inference.
C. FeatFlowNet
Acknowledgements. This work was supported in part by NSF
To encode pixel motion, FeatFlowNet takes as input C3
IIS-1751206, IIS-1812850, and AWS ML research award. We
features from the ResNet backbone. As shown in Table VIc,
thank Joohyung Kim for helpful discussions.
we choose to reduce the channels to 1/4 before it enters
FeatFlowNet as the AP only drops slightly while being much 1 See supplementary video: https://youtu.be/GBCK9SrcCLM.
R EFERENCES [28] Sergei Alyamkin, Matthew Ardi, Alexander C. Berg, Achille Brighton,
Bo Chen, Yiran Chen, Hsin-Pai Cheng, Zichen Fan, Chen Feng, Bo Fu,
[1] Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. Yolact: Kent Gauen, Abhinav Goel, Alexander Goncharenko, Xuyang Guo,
real-time instance segmentation. In ICCV, 2019. Soonhoi Ha, Andrew Howard, Xiao Hu, Yuanjun Huang, Donghyun
[2] Youngwan Lee and Jongyoul Park. Centermask: Real-time anchor-free Kang, Jaeyoun Kim, Jong-gook Ko, Alexander Kondratyev, Junhyeok
instance segmentation. arXiv preprint arXiv:1911.06667, 2019. Lee, Seungjae Lee, Suwoong Lee, Zichao Li, Zhiyu Liang, Juzheng Liu,
[3] Xinlong Wang, Rufeng Zhang, Tao Kong, Lei Li, and Chunhua Shen. Xin Liu, Yang Lu, Yung-Hsiang Lu, Deeptanshu Malik, Hong Hanh
Solov2: Dynamic, faster and stronger. arXiv preprint arXiv:2003.10152, Nguyen, Eunbyung Park, Denis Repin, Liang Shen, Tao Sheng, Fei
2020. Sun, David Svitov, George K. Thiruvathukal, Baiwu Zhang, Jingchi
[4] Rufeng Zhang, Zhi Tian, Chunhua Shen, Mingyu You, and Youliang Zhang, Xiaopeng Zhang, and Shaojie Zhuo. Low-power computer
Yan. Mask encoding for single shot instance segmentation. arXiv vision: Status, challenges, opportunities. CoRR, abs/1904.07714, 2019.
preprint arXiv:2003.11712, 2020. [29] Tensorrt hardware support matrix. https://docs.nvidia.
[5] Hao Chen, Kunyang Sun, Zhi Tian, Chunhua Shen, Yongming Huang, com/deeplearning/tensorrt/support-matrix/index.
and Youliang Yan. Blendmask: Top-down meets bottom-up for instance html#hardware-precision-matrix. Accessed: 2020.
segmentation. arXiv preprint arXiv:2001.00309, 2020. [30] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath
[6] Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. Yolact++: Hariharan, and Serge Belongie. Feature pyramid networks for object
Better real-time instance segmentation. TPAMI, 2020. detection. In CVPR, 2017.
[7] Sida Peng, Wen Jiang, Huaijin Pi, Hujun Bao, and Xiaowei Zhou. [31] Tensorrt int8 calibration. https://on-demand.
Deep snake for real-time instance segmentation. arXiv preprint gputechconf.com/gtc/2017/presentation/
arXiv:2001.01629, 2020. s7310-8-bit-inference-with-tensorrt.pdf. Accessed:
[8] Nvidia tensorrt. https://developer.nvidia.com/ 2020.
tensorrt. Accessed: 2020. [32] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask
[9] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Per- R-CNN. In ICCV, 2017.
ona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft [33] A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazırbaş, V. Golkov,
coco: Common objects in context. In ECCV, 2014. P. v.d. Smagt, D. Cremers, and T. Brox. Flownet: Learning optical
[10] Linjie Yang, Yuchen Fan, and Ning Xu. Video instance segmentation. flow with convolutional networks. In ICCV, 2015.
In ICCV, 2019.
[11] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully
convolutional one-stage object detection. In ICCV, 2019.
[12] Xizhou Zhu, Yuwen Xiong, Jifeng Dai, Lu Yuan, and Yichen Wei.
Deep feature flow for video recognition. In CVPR, 2017.
[13] Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. Flow-
guided feature aggregation for video object detection. In ICCV, 2017.
[14] Xizhou Zhu, Jifeng Dai, Lu Yuan, and Yichen Wei. Towards high
performance video object detection. In CVPR, 2018.
[15] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner
Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers,
and Thomas Brox. Flownet: Learning optical flow with convolutional
networks. In ICCV, 2015.
[16] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. Pwc-net:
Cnns for optical flow using pyramid, warping, and cost volume. In
CVPR, 2018.
[17] Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov,
and Liang-Chieh Chen. Inverted residuals and linear bottlenecks:
Mobile networks for classification, detection and segmentation. CoRR,
abs/1801.04381, 2018.
[18] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen,
Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang,
Vijay Vasudevan, Quoc V. Le, and Hartwig Adam. Searching for
mobilenetv3. CoRR, abs/1905.02244, 2019.
[19] Golnaz Ghiasi, Tsung-Yi Lin, Ruoming Pang, and Quoc V. Le. NAS-
FPN: learning scalable feature pyramid architecture for object detection.
CoRR, abs/1904.07392, 2019.
[20] Mingxing Tan and Quoc V. Le. Efficientnet: Rethinking model scaling
for convolutional neural networks. CoRR, abs/1905.11946, 2019.
[21] Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the
knowledge in a neural network. In NIPS Deep Learning and
Representation Learning Workshop, 2015.
[22] Antonio Polino, Razvan Pascanu, and Dan Alistarh. Model compression
via distillation and quantization. CoRR, abs/1802.05668, 2018.
[23] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf.
Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.
ArXiv, abs/1910.01108, 2019.
[24] Song Han, Huizi Mao, and William J Dally. Deep compression:
Compressing deep neural networks with pruning, trained quantization
and huffman coding. ICLR, 2016.
[25] Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf,
William J. Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy
with 50x fewer parameters and <0.5mb model size. arXiv:1602.07360,
2016.
[26] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali
Farhadi. Xnor-net: Imagenet classification using binary convolutional
neural networks. In ECCV, 2016.
[27] Adrian Bulat and Georgios Tzimiropoulos. Xnor-net++: Improved
binary neural networks. In BMVC, 2019.

You might also like