Semantically Video Coding: Instill Static-Dynamic Clues Into Structured Bitstream For AI Tasks
Semantically Video Coding: Instill Static-Dynamic Clues Into Structured Bitstream For AI Tasks
Decoder
Encoder
Decoder
bitstream level. Semantically Structured Image Coding (SSIC)
Encoder
arXiv:2201.10162v1 [cs.CV] 25 Jan 2022
High-level
Feature
High-level object information of the image, which seriously limits its
Bitstream Generation
features
Extraction practical application to a larger scope, especially for video-
Encoding
Input based intelligent applications.
Data
Therefore, in this paper, we extend the idea of semantically
Low-level
Feature
Low-level structured coding from video coding perspective and propose
features
Extraction
Adaptive
a new paradigm of video coding for machines (VCM). Specifi-
Transportation cally, we introduce an advanced Semantically Structured Video
Bitstream De-generation
AI task Bitstream
Accuracy ?
support analysis Coding (SSVC) framework to directly support heterogeneous
Human
intelligent multimedia applications. As illustrated in Fig. 2, in
Interaction order to generate a semantic-sensing bitstream that could be
Quality ?
Human
perception
Bitstream
decoding
directly used for supporting downstream intelligent analytics
without fully decoding and also could be reconstructed for
human perception, SSVC codec encodes the input media data
Fig. 2: Overview of our idea. The boxes marked as red denote (i.e., image or video) into a semantically structured bitstream
the novel designs compared to existing codecs. In specific, (SSB). SSB generally consists of hierarchical information of
these red solid boxes mean the new designs that aim to achieve high-level features, (e.g., the category and spatial location
semantically structured bitstream (SSB) for machine analytics information of each object detected in the video) and low-
supporting. level features, (e.g., the content information of each object or
data decompression, but also can be directly handled by the rest background in the video).
machine learning algorithms with much less decompression In detail, for these video key frames of the intra-coded
complexity or even no decompression procedure. This could frames, we herein leverage a simple and effective object
significantly save the bitstream transmission and decoding detection technique to help instantiate the static information
cost. Recently, MPEG has also initiated the standard activity of SSB. We integrate the recently proposed CenterNet [21]
on video coding for machine (VCM) 1 , which attempts to in the encoder of our SSVC framework, which aims to
identify the opportunities and challenges of developing col- locating objects and obtaining their corresponding class ID
laborative compression techniques for humans and machines, and spatial location (e.g., bbox) information in the feature
while establishing a new coding standard for both machine domain. Then, we re-organize such features to form a part of
vision and hybrid machine-human vision scenarios. SSB, by which some specific objects can be reconstructed and
In recent years, with the fast development of the deep several image-based intelligent analysis tasks such as object
learning based compression techniques [14]–[16], several stud- classification/detection could achieve similar or better results
ies contribute some new compression schemes that could than fully-decompressed images.
directly support downstream intelligent tasks without decoding Except for the static semantic information that derived from
all the compressed bitstream [17]. Torfason et al. , [18] objects of i-frames, the motion characteristics is also very
use a neural network to generate a compressed bitstream improtant for video compression [22]. Therefore, our SSVC
as input for supporting the downstream tasks directly, such further integrates motion clues, denoted by optical flow and
as classification and segmentation, which bypasses decoding content residues, of the continuous video frames (i.e., p-
of the compressed representation into RGB space, and thus frames, inter-coded using reference frames from the past) into
reducing computational cost. Similar ideas can be found in SSB to enable a wider of video tasks supporting. For example,
the video-based schemes, CoViAR [19] and DMC-Net [20], for a video-based multimedia intelligent analysis task, e.g.,
they directly leverage the motion vectors and residuals that video action recognition, only the person-related content of
are both readily available in the compressed video to represent key frame (i.e., i-frame) and the corresponding optical flow of
motion at no cost to support the downstream action recognition continuous frames adjacent to i-frame in the SSB are required,
task. However, these schemes are still task-specific, or said, which could further save most of the decompression time and
designed for a limited range of applications, they cannot transmission bandwidth.
meet the higher and more general requirements for flexibility In short, our SSVC could directly support heterogeneous
and efficiency. Because they did not consider the intrinsic multimedia analysis tasks just based on partial data decoding,
semantics contained in the compressed bitstream, and can not which is achieved and benefited by of semantic-structured
leverage different structural bitstream for different tasks. coding process and bitstream deployment. We did not jointly
Sun et al. [1] first introduce a new concept of semantically train the entire compression framework and subsequent AI
structured coding for image compression field (abbreviated application/task models, which is different from previous joint-
as SSIC), and generate a semantically structured bitstream training based literature [19], [20], [23], [24].
(SSB), where each part of the bitstream represents a spe- Last but not least, we experimentally show how to leverage
cific object and can be directly used for the aforementioned the semantically structured bitstream (SSB) to better adap-
intelligent image tasks (including object detection, pose es- tively support downstream intelligent tasks in an adjustable
timation etc.) However, this work only considers the image manner (shown in Fig. 2 and Fig. 8). Such scalable func-
coding framework, the generated SSB only contains the static tionality bridges the gap between the video high-efficient
compression and machine vision supporting. In summary, the
1 https://lists.aau.at/mailman/listinfo/mpeg-vcm contributions of this paper can be summarized as follows:
SUBMITTED TO IEEE TRANSACTIONS ON ON IMAGE PROCESSING 3
• We propose an advanced Semantically Structured Video H.265/MPEG-H (Part 2) High Efficiency Video Coding
Coding (SSVC) framework to meet the fast-growing re- (HEVC) [32] standards.
quirements of intelligence multimedia analysis. As a new All these standards follow the block-based video coding
paradigm for intelligent video compression, SSVC could strategy. Based on this, the intra and inter-prediction tech-
support heterogeneous multimedia analysis tasks just niques are applied based on the corresponding contexts, i.e.,
based on partial data decoding, and thus greatly reducing neighboring blocks and reference frames in the intra and inter
the transmission bandwidth and storage resources. This is modes, to remove temporal and spatial statistical redundancies
achieved by the semantic-structured coding process and of video frames. However, these kinds of designed patterns,
bitstream deployment. e.g., block partition, make the prediction only could cover parts
• In order to efficiently support video downstream tasks of the context information, which limits its modeling capacity.
based on partially decoded bitstream, we leverage op- Besides, the block-wise prediction, along with transform and
tical flow and residual to describe the dynamic temporal lossy quantization, causes the blocking effect in the decoding
motion information of video, and add them into the results. As most of the traditional coding architectures generate
semantic-structured bitstream (SSB), which goes beyond the bistream in units of the entire image or video, they
the image-based semantic compression framework [1] cannot support partial bitstream decoding or partial objects
and makes our SSVC more general and scalable. We reconstruction for intelligent video analysis tasks. Besides,
instantiate SSVC framework integrated with action recog- different from most of the codecs, the MPEG-4 Visual de-
nition and video object segmentation as a video-based composes video into video object planes (VOPs) and encodes
embodiment to reveal the superiority of our coding them sequentially. Though MPEG-4 Visual tries to achieve
scheme. object-oriented bitstream, its implementation must be based
• Experimentally, we provide evidences to reveal that our on accurate pixel-level segmentation results, which is difficult
SSVC is more flexible and scalable, which could better to achieve at the moment.
adaptively support heterogeneous downstream intelligent
tasks with the structured bitstream.
B. Learning Based Image/Video Coding Approach
The remaining part of this paper is organized as follows:
we introduce recent progress on video compression in Section The great success of deep learning techniques significantly
II, including traditional hybrid coding pipelines and learning promotes the development of end-to-end learned video coding.
based compression schemes. The details of the proposed For the deep learning based coding methods, they do not rely
Semantically Structured Video Coding (SSVC) framework on the partition scheme and support full-resolution coding,
are introduced in Section III. Comprehensive experiments which naturally avoids the blocking artifacts. Generally, the
are conducted and illustrated in Section IV and Section V. representative and powerful feature is extracted via a hierar-
We conclude our coding architecture and discuss its future chical network and jointly optimized with the reconstruction
directions in the last section Section VI. task for high efficient coding. For instance, the early work [16]
focuses on motion predictive coding and proposes the con-
cept of PixelMotionCNN (PMCNN) to model spatiotemporal
II. R ELATED W ORK coherence to effectively perform predictive coding inside the
In the current information age, the fast-growing multimedia learning network. Similarly, recurrent neural network [33],
videos take up most of the daily life of people. It is critical for [34], VAE generative model [14], [15] and non-local atten-
humans to record, store, and view the image/videos efficiently. tion [35], [36] are employed to remove the unnecessary spatial
For the past decades, lots of academic and industrial efforts redundancy from the latent representations to make features
have been devoted to video compression, which aims to compact, and thus leading to improved coding performance.
achieve a trade-off on the rate-distortion optimization problem. For another mainstream branch, lots of efforts are devoted
Below we first review the advance of traditional video coding to improving the performance of neural network based video
frameworks, as well as the recent booming, developed deep coding frameworks by increasing the prediction ability of deep
learning based compression schemes. Then, we introduce networks for intra- [37], [38] or inter-prediction of video
several task-driven coding schemes on visual data for machine codec [39]–[41]. Meanwhile, the end-to-end learned video
vision in a general sense, revealing its growing importance. compression frameworks, such as DVC [42] and HLVC [43],
further push the compression efficiency up along this route. All
these methods could reduce the overall R-D cost on large-scale
A. Traditional Image/Video Coding Approach video data. Besides, as the entire coding pipeline is optimized
From the 1970s, the hybrid video coding architecture [25] in an end-to-end manner, it is also flexible to adapt the rate
is proposed to lead the mainstream direction and occupy the and distortion to accommodate a variety of end applications,
major industry proportion during the next few decades. Based e.g., machine vision analytics tasks.
on this, the following popular video coding standards have However, mentioned learning based compression methods
kept evolving through the development of the ITU-T and typically fail to handle the situation when tremendous volumes
ISO/IEC standards, including H.261 [26], H.263 [27], MPEG- of data need to be processed and analyzed fast, because they
1 [28], MPEG-4 Visual [29], H.262/MPEG-2 Video [30], need to reconstruct the whole picture. The semantics-unknown
H.264/MPEG-4 Advanced Video Coding (AVC) [31], and data still constitutes a major part of the bitstream. So these
SUBMITTED TO IEEE TRANSACTIONS ON ON IMAGE PROCESSING 4
methods cannot fulfill the emerging requirement of real-time These methods mostly adopt a joint training scheme to not
video content analytics when dealing with large-scale video only optimize compression rate but also optimize the accuracy
data. But, these learning based coding frameworks actually for AI applications. This joint-training based optimization
provide opportunities to develop effective VCM architectures lacks of flexibility, because they need to adjust the compres-
to address these challenges. sion encoder according to the different subsequent supporting
AI tasks. However, in actual applications, it is unrealistic to
enforce/adjust the encoder and decoder to be combined with
C. Task-driven Image/Video Coding Approach the task. Once their coding framework is well trained on a
Deep learning algorithms have achieved great success in specific task, it is difficult to adapt it to the other vision tasks.
the actual computer vision tasks, promoting the development Therefore, in this paper, we present the concept of seman-
of media industry in recent years. Correspondingly, more and tically structured bitstream (SSB), which contains hierarchical
more captured videos are directly handled/analyzed by ma- information that represents partial objects existed in the videos
chine algorithms, instead of being perceived by human eyes. and can be directly used for various tasks. Note that, the
Therefore, recent works tend to optimize their compression proposed SSVC video coding framework in this paper is
pipelines according to the feedbacks derived from real task- an extension of our previous image coding pipeline SSIC
driven applications rather than the original quality fidelity that reported in [1]. SSVC goes beyond SSIC [1] on at least
aims to meet human perception. four perspectives: 1) SSIC only supports image coding and
Built upon the traditional codecs, Pu et al. [44] apply a task- only could be employed for image-based intelligent anaylstic.
specific metric into JPEG 2000. Liu et al. [45] enhance the On the contrary, our SSVC framework could support image
compression scheme for intelligent applications by minimizing and video coding together, while could be directly employed
the distortion of frequency features that are important to for image-based and video-based intelligent analysis. 2) the
neural network. CDVS [46] and CDVA [46] aim at efficiently SSB of SSIC only contains static object information of the
supporting the search task through compact descriptors using image, while the counterpart of our SSVC not only encodes
both the traditional method and learning-based method. Li et static object information contained in the key frames/images,
al. [47] implement semantic-aware bit-allocation for the but also integrates motion clues (i.e., optical flow between
traditional codec based on reinforcement learning. On the other neighboring frames) and content residues into bitstream. In
hand, based on the learning-based coding schemes, Chen et general, the SSB of our SSVC framework is compounded with
al. [48] propose a learning based facial image compression static object semantics information and dynamic motion clues
(LFIC) framework with a novel regionally adaptive pooling between adjacent video frames. 3) beyond SSIC, we replace
(RAP) module that can be automatically optimized according the original backbone which is based on a conditional proba-
to gradient feedback from an integrated hybrid semantic fi- bility model [52] with a stronger VAE-based backbone [53],
delity metric. Alvar et al. [49] study a bit allocation method and thus improving the basic compression performance of
for feature compression in a multi-task problem. These tradi- SSVC. 4) in terms of validation experiments, we add more
tional hybrid video coding framework and the aforementioned analysis and experiments on the video-based intelligent tasks,
learning-based methods both encode the video into binary revealing the superiority of SSVC compared to SSIC.
stream without any semantic structure, which makes such
bitstream failed to directly support intelligent tasks. Zhang et III. S EMANTICALLY S TRUCTURED V IDEO C ODING
al. [23] propose a hybrid content-plus-feature coding scheme F RAMEWORK
framework of jointly compressing the feature descriptors and
In this section, we will introduce the architecture of our pro-
visual content. A novel rate-accuracy optimization technique
posed Semantically Structured video coding (SSVC) frame-
is proposed to accurately estimate the retrieval performance
work. The pipeline is illustrated in Fig 3. In the following
degradation in feature coding. Duan et al. [50] carry out ex-
sub-sections, we first begin with an overview of the proposed
ploration in the new video coding for machines (VCM) area by
SSVC framework, and then we introduce the details of each
building a bridge between feature coding for machine vision
component sequentially.
and video coding for human vision. They propose a task-
Given a video X that is composed of multiple frames
specific compression pipeline that jointly trains the feature
x1 , x2 ...xN where N denotes the length of such video clip,
compression and intelligent tasks. Xin Li et al. [51] implement
the video compression process can be formulated/deemed as a
task-driven semantic coding by implementing semantic bit
rate-distortion (R-D) optimization (RDO) problem [54], [55].
allocation based on reinforcement learning (RL) via designing
The target of such RDO can be understood from two sides,
semantic maps for different tasks to extract the pixelwise
one is minimizing bit-rate cost, i.e.transmission/storage cost,
semantic fidelity for videos/images. Ma et al. [24] provide
while not increasing fidelity distortion, the other is minimizing
a systematical overview and analysis on the joint feature and
distortion with a fixed bit-rate. The Lagrangian formulation of
texture representation framework, which aims to smartly and
the minimization RDO problem is given by:
coherently represent the visual information with the front-end
intelligence in the scenario of video big data applications.
minJ, where J = R + λD, (1)
Besides, the future joint coding scheme by incorporating the
deep learning features is envisioned, and future challenges where the Lagrangian rate-distortion functional J is minimized
toward seamless and unified joint compression are discussed. for a particular value of the Lagrange multiplier. More details
SUBMITTED TO IEEE TRANSACTIONS ON ON IMAGE PROCESSING 5
Intra-mode
Location
Object
Parsing Bit-stream
Intelligent
Class ID image tasks:
L: location L
I-frame Object
Semantic Overhead Detection
𝒙𝒕 Guidance
C: class ID C
Object
𝒛𝒕 𝒛ො 𝒕 Segmentation
Quantization
Bitstream Generation
Image
Entropy Coding
Image 𝒛 𝒕 Object 1 Enhancement
Encoder Object 2
Image
Object N
…… Understanding
𝒛 𝒕
Inter-mode Optical
P-frames ෝ𝒕
Recon. 𝒙 Image Decoder flow 1 Intelligent
Optical video tasks:
𝒙𝒕+𝟏 flow 2
… Object tracking,
Optical Event recog.,
Event pred.,
𝒙𝒕+𝟐 Motion Estimation
Optical Flow flow N
Anomaly detect.,
Residue 1 Retrieval…
…… & Compensation
Residue 2
𝒙𝒕+𝑵 Content Residue … Human
Perception
Residue N
Fig. 3: The overall pipeline of our proposed semantically structured video coding (SSVC) framework, and the illustration of
examples of downstream intelligent tasks analytics.
on Lagrangian optimization are discussed in [56]. We go downstream intelligent tasks, we pre-define a common/general
beyond the traditional hybrid video coding framework by semantic bitstream deployment. As shown in Fig 3, we divide
building up our compression pipeline upon the learning based the bitstream into three groups: 1) header that contains object
codecs, in which the modules can be jointly optimized for spatial location and category information, 2) i-frame bitstream
better implementing R-D optimization. We attempt to define that contains different object information, and 3) p-frame
the pipeline of video coding for machine (VCM) to bridge bitstream that includes motion clues/information of videos.
the gap between coding semantic features for machine vision
tasks and coding pixel features for human vision. A. Intra-mode Coding
As shown in Fig. 3, in the compression process, the data Intra-mode coding is designed for key frames, i.e., i-frames
encoding has two modes, intra-mode and inter-mode. Follow- of traditional codecs, and can be regarded as a kind of image-
ing the traditional hybrid video coding codecs [57] and the based semantics feature compression method. Given a key
exiting learning-based methods [42], [43], we first divide the frame image, that is the t-th frame xt of a video clip X, it is
original video sequence into groups of pictures (GoP). Let first fed into two branches in parallel. One branch employs a
x = {x1 , x2 , ..., xt , xt+1 , ..., xN } denote the frames of one feature extractor module to obtain a hidden feature zt , which is
GoP unit, where N means the GoP length. Assumed that xt semantics-unknown and contains raw content information. The
has been coded by intra-mode, in the next inter-mode coding other branch leverages object parsing technique, such as Cen-
process, xt+1 , xt+2 , ..., xN is encoded frame-by-frame in a terNet [21], to extract high-level semantic features from key
sequential order. frame xt , which contains object spatial location information
Then, a differentiable quantizer is applied on ẑt to obtain and category information. Such high-level features are not only
quantized features z̃t to reduce redundant information in the deployed in the bitstream, but also are used to partition the
data. After being applied to the entropy coding module, encoded hidden feature zt into different groups (i.e., different
z̃t is encoded into the bitstream that can be transmitted spatial areas) ẑt according to different categories.
or stored. Notably, the extracted high-level semantics (i.e., 1) Object Parsing: Given t-th i-frame, which is noticed
location and class information) are also saved into bitstream as as xt ∈ RW ×H×3 of a video clip X. Our goal is to
overhead, which can be used to directly support downstream extract semantic features from xt , which are represented by
intelligent analysis and also guide the partial/specific bitstream bounding box (ak1 , bk1 , ak2 , bk2 ) and class ID ck respectively
decoding (i.e., partial/specific reconstruction). In summary, the for object k. Following the method in [21], xt is first fed
quantized features z̃t (can be regarded as low-level content into deep layer aggregation (DLA) network [58] to predict
information) and high-level features together constitute the a center point heatmap Ŷ ∈ [0, 1]W/R×H/R×C , where R is
semantically structured bitstream (SSB). the output stride and C is the number of predefined object
For the semantically structured bitstream (SSB) deploy- categories. In Ŷ , a prediction of 1 corresponds predicted center
ment, instead of adapting bitstream generation to different point of an object, while a prediction of 0 corresponds to
SUBMITTED TO IEEE TRANSACTIONS ON ON IMAGE PROCESSING 6
predicted background. Notably, the DLA network can be other generate semantically structured bitstream (SSB), which will
fully-convolutional encoder-decoder networks, such as stacked be used for storage and transmission. Notably, the entropy
hourglass network [59], [60] and up-convolutional residual encoding part of the background ŷbg has been improved in
networks (ResNet) [61], [62]. Based on the predicted heatmap, order to minimize the duplication region when encoding the
a branch network is introduced to regress the size of all the background as [1]. We fill the inside of the object region
objects in image Ŝ ∈ RW/R×H/R×2 . When output stride with the pixels which are at the left of the border. And in
R > 1, another additional branch is needed to predict a local entropy coding, the duplicate parts are coded only once. The
offset Ô ∈ RW/R×H/R×2 to compensate the error caused by Arithmetic Decoder (AD) could transform bitstream back into
rounding, following [21]. the latent representation which can be used for image analysis
During training stage, the ground truth center point p ∈ tasks, and the Decooder also could reconstruct the partial
R2 is converted from bounding box and further mapped image or the whole image from SSB [1].
to a low-resolution equivalent that is p̃ = bp/Rc. Then During training stage, only the compression of the whole
the ground truth center point is splat to a heatmap version image is considered, following [14], [53]. Then the RDO
Y ∈ [0, 1]W/R×H/R×C using a Gaussian kernel as it does problem in Equation 1 can further be formulated as following:
in [60]. The ground truth of object size is computed as
sk = (ak2 − ak1 , bk2 − bk1 ). To optimize the center point R + λ · D =Ex∼px [−log2 pŷ (bf (x)e)]
(6)
heatmap prediction network, we use a penalty-reduced pixel- +λ · Ex∼px [d(x, g(bf (x)e))],
wise logistic regression with focal loss [63] following [21]:
( where px is the unknown distribution of natural images, b·e
1 X (1 − Ŷabc )α log(Ŷabc ), if Yabc = 1; denotes quantization, f (·) and g(·) denote encoder and decoder
Lk = −
N
abc
(1 − Yabc )β (Ŷabc )α log(1 − Ŷabc ), otherwise, respectively, pŷ (·) is a discrete entropy model used to estimate
(2) the rate by approximating the real marginal distribution of the
where α and β are hyper-parameters and N is the number of latent, d(·) is the metric to measure the distortion such as mean
center point in an image. squared error (MSE) and MS-SSIM, and λ is the Lagrange
The prediction of size and local offset are learned by multiplier to determine the desired trade-off between rate and
applying L1 loss respectively: distortion.
N To estimate the rate for optimization, following [14], [53],
1 X the latent ŷi is modeled as a Gaussian convolved with a
Lsize = |Ŝp̃k − sk |; (3)
N unit uniform distribution to ensure a good match between
k=1
1 X p the actual discrete entropy and the continuous entropy model
Lof f = |Ôp̃ − ( − p̃)|. (4)
N p R used during training. Then the distribution of latent is modeled
by predicting the mean and scale parameters conditioned on
Therefore, the total loss function is the weighted sum of all the quantized hyperprior ẑ and causal context of each latent
the loss functions with weights {1, λsize , λof f }. element ŷ<i (e.g., left and upper latent elements).
In inference stage, with the predicted heatmap, the peaks is The entropy model for hyperprior is a non-parametric, fully
extracted independently for each category using max pooling factorized density model, as ẑ is proved to comprises only a
operation. With P̂c denoting the set of n detected center points very small percentage of the total bit-rate.
P̂ = {(âi , b̂i )}ni=1 of class ID c. Combined with predicted size In inference stage, to generate SSB, given the set of latent
Ŝâi ,b̂i = (ŵi , ĥi ) and local offset Ôâi ,b̂i = (4âi , 4b̂i ), the from a specific input image {ŷob1 , ŷob2 , ..., ŷobK , ŷbgd }, AE
predicted bounding box can be represented as follow: code each of them individually based on their respective
(âi + 4âi − ŵi /2, b̂i + 4b̂i − ĥi /2), hyperprior ẑobk (or ẑbg ) and casual contextŷobk ,<i (or ẑbg,<i ).
(5) Notably, in order to reduce the coding redundancy caused
âi + 4âi + ŵi /2, b̂i + 4b̂i + ĥi /2)). by re-organization of latent, we introduce two optimization
2) Image Compression and Bitstream Disployment: The strategies: 1) when objects overlap each other, the union
compression network for i-frame xt can be divided into two of them is fed into AE; 2) when coding the ŷbg , each of
sub-networks as [53]. One is a core autoencoder (including the spatially discontinuous part will be padded with the left
Encoder and Decoder module), and the other is a sub-network boundary of the current discontinuous part as [1].
that contains a context model and a hyper-network (including
Hyper Encoder and Hyper Decoder module), as is shown in
B. Inter-mode Coding
Fig 4.
Specifically, the input xt is first transformed into latent
representation y by Encoder module. Then y is re-organized The inter-mode coding is designed for continual frames.
and quantized as {ŷob1 , ŷob2 , ..., ŷobK , ŷbg } according to the For inter-mode coding of our SSVC, we focus on low-latency
the K pairs of spatial location information and category video streaming, which means all inter-frames are coded as p-
information extracted from object parsing branch, in which ŷbg frame. Given previously decoded frame x̂t (named as reference
represents the latent representation of background. Then the frame following traditional codecs), the current frame xt+1
Arithmetic Encoder (AE) module codes the symbols coming sequentially perform motion estimation and motion compen-
from the quantizer into binary bitstream for each ŷobi to sation with x̂t as reference frame. As a consequence, we could
SUBMITTED TO IEEE TRANSACTIONS ON ON IMAGE PROCESSING 7
SSB
Location
𝒚𝒐𝒃𝟐 𝒚𝒐𝒃𝒌 𝒚𝒐𝒃𝒌 𝒙𝒐𝒃𝟐 ,𝒕
Class ID
𝒙𝒕
Q AE AD Decoder
SSB
𝒚𝒃𝒈 𝒙𝒐𝒃𝟏 ,𝒕
Padding
Encoder
𝒚 𝒙𝒕
Alignment
Fig. 4: Image encoder pipeline. Encoder, Hyper Encoder, AE and quantization operation are needed in image encoder, while
Context Model, Entropy Parameters, AD, Factorized Entropy Model, Decoder and Hyper Decoder are needed in image decoder
to recover an image from bitstream.
get motion clues, i.e., optical flow and content residue from
the encoded frames.
We build our p-frame coding framework based on recent
learning-based video coding methods [64]. As shown in Fig. 5,
the overall coding pipeline contains four basic components:
Motion Estimation (ME), Motion Compression (MC), Motion
Compensation (MCP) and Residual Compression (RC). We
employ the optical flow network PWC-Net [65] as our ME
network. The original output of PWC-Net is in the down-
scaled domain with a factor of 4, therefore we upsample it to
pixel domain using bilinear interpolation. For the compression
of optical flow (i.e., motion compression), we use the i-frame Fig. 5: The overall flowchart of our p-frame compression
compression framework and simply change the number of procedure in the inter coding mode.
input/output channels. MCP module first warps the reference
frame to the current frame by decoded optical flow and then
refine the warped frame using a U-Net-like networks. decoder networks, and therefore can be independently decoded
Given the previously decoded frame x̂t−1 and current frame from the corresponding part of bitstream. In other words, the
xt , the ME network generates optical flow mt . The MC bitstream of motion information and content information is
network, which is similar to our image coding network, first structured in our coding framework.
non-linearly maps the optical flow mt into quantized latent 1) Training Procedure: It is difficult to train the whole
representations and then transforms it back to reconstruction models from scratch using the rate-distortion loss in Eq. (7).
m̂t . The latent representations are encoded into bitstream by Thus, we separately pretrain the i-frame coding models (intra-
entropy coding. After reconstructing m̂t , the reference frame mode of SSVC) and p-frame coding models (inter-mode of
x̂t−1 is first bilinearly warped towards the current frame and SSVC). For the pretraining of our p-frame codec, we first fix
then refined with a processing network to obtain the motion the weights of the pretrained Motion Estimation (ME) network
compensation frame xt . Finally, we compress the feature and then pretrain the Motion Compression (MC) network with
residual between xt and xt to remove the remaining spatial the R-D loss of compensation frame xt : Rt,m + λm D(xt , xt ),
redundancy, by using the RC network proposed in [64]. More where Rt,m denotes the rate of optical flow, D is measured
details can be seen in [64]. using MSE and λm is empirically set to 512. Later, the weights
The whole framework is end-to-end trainable. To better of the ME network are relaxed and we add the Residual
adaptively allocate bits between i-frame and p-frame, we opti- Compression (RC) network for joint training. In the end, we
mize the whole model (including our i-frame coding network) jointly fine-tuned both the i-frame 208 and p-frame models
for the rate-distortion loss of a GoP: with the proposed R-D loss in Eq. (7).
2) Bitstream Disployment: As mentioned before and shown
in Fig 5, the compressed optical flow m̂t and residual r̂t are
T T
1X 1X separately encoded into bit-stream by two encoder-decoder
R + λD = Rt + λ D(xt , x̂t ), (7)
networks, and therefore can be independently decoded from
T t=1 T t=1
the corresponding part of SSB, which enables our SSVC could
where Rt denotes rate, D(xt , x̂t ) denotes distortion and T is directly support more video tasks. For example, only based
the length of the GoP. The rate term for p-frame consists of on objects of key frames (i-frames) and their corresponding
the rate of optical flow and residual. Note that optical flow and motion clues (i.e., optical flows), terminal users could success-
residual are separately encoded into bitstream by two encoder- fully conduct action recognition.
SUBMITTED TO IEEE TRANSACTIONS ON ON IMAGE PROCESSING 8
Fig. 7: Quantitative compression performance comparison with the existing learning-based video frameworks and the mainstream
traditional video codecs. The curve of SSVC contains all the overheads including class ID and location.
B. Evaluation Metrics and Experimental Setup D. Compression Performance Comparison and Analysis
We measure the quality of reconstructed frames using both 1) Quantitative Analysis: We evaluate our model with
PSNR and MS-SSIM [71]. The bits per pixel (bpp) is used many state-of-the-art video compression approaches, including
to measure the number of coding bits. Following the common learning-based coding framework and the traditional video
evaluation setting in [42], the GoP sizes for the UVG dataset coding methods (e.g., H.264, H.265 and H.266). The compared
and HEVC standard Common Test Sequences are set to 12 and learning-based video compression approaches include the p-
10, respectively. In most previous methods for learned video frame based methods of [42], [74]–[76], the B-frame based
compression, the H.264/H.265 is evaluated by using FFmpeg methods of [43], [77], [78], and the transform-based method
implementation, which performance is much lower than of- [79]. Among them, [42], [74], [76]–[78] are optimized for
ficial implementation. In this paper, we evaluate H.265 and MSE and [75], [79] are optimized for MS-SSIM.
H.266 by using the implementation of the standard reference The corresponding quantitative comparison results are
software HM 16.21 [57] and VTM 8.0 [72], respectively. We shown in Fig. 7, we observe that 1). Our proposed video cod-
would like to highlight that H.266 [VTM-8.0] is the latest ing framework significantly outperforms the exiting learning-
mainstream video coding standard. based video compression methods in both PSNR and MS-
SUBMITTED TO IEEE TRANSACTIONS ON ON IMAGE PROCESSING 9
I- frames
……
Optical
flow
……
Residual
R EFERENCES
[1] S. Sun, T. He, and Z. Chen, “Semantic structured image coding
framework for multiple intelligent applications,” IEEE TCSVT, 2020.
[2] A. Liu, W. Lin, M. Paul, F. Zhang, and C. Deng, “Optimal compression
plane for efficient video coding,” IEEE TIP, vol. 20, no. 10, pp. 2788–
Fig. 12: Performance comparison of compression efficiency 2799, 2011.
[3] R. L. de Queiroz and P. A. Chou, “Motion-compensated compression of
& video object segmentation mIoU with two traditional video dynamic voxelized point clouds,” IEEE TIP, vol. 26, no. 8, pp. 3886–
codecs H.265 and H.266. Our SSVC performs worse than 3895, 2017.
the other two when reconstructing all video frames but much [4] L.-H. Chen, C. G. Bampis, Z. Li, A. Norkin, and A. C. Bovik,
“Proxiqa: A proxy approach to perceptual optimization of learned image
better with the setting of transferring only partial i-frames and compression,” IEEE TIP, vol. 30, pp. 360–373, 2020.
all optical flows. [5] M. Li, K. Ma, J. You, D. Zhang, and W. Zuo, “Efficient and effective
context-based convolutional entropy modeling for image compression,”
TABLE II: The compression performance (BD-rate, %) of IEEE TIP, vol. 29, pp. 5900–5911, 2020.
different schemes for the video object segmentation task. [6] R. Forchheimer, “Differential transform coding: A new hybrid coding
vs. H.265 vs. H.266 scheme,” in Proc. Picture Coding Symp.(PCS-81), Montreal, Canada,
SSVC (One i-frame each GoP) -35.96 -34.37 1981, pp. 15–16.
SSVC (One i-frame each two GoPs) -56.39 -54.28 [7] V. V. C. V. Standard, “Quantization and entropy coding in the versatile
SSVC (Only optical flow) -69.81 -69.29 video coding (vvc) standard.”
[8] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview
of the h. 264/avc video coding standard,” IEEE TCSVT, vol. 13, no. 7,
pp. 560–576, 2003.
much better than traditional codecs, achieving a superior [9] M. Wang, K. N. Ngan, and L. Xu, “Efficient h. 264/avc video coding
trade-off between compression efficiency and segmentation with adaptive transforms,” IEEE TMM, vol. 16, no. 4, pp. 933–946,
2014.
accuracy. 3) When only decoding out the optical flows for [10] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the
segmentation that just cost a little bitstream (i.e., the scheme high efficiency video coding (hevc) standard,” IEEE TCSVT, vol. 22,
of SSVC (Only optical flow)), the bit-rate is pretty low but still no. 12, pp. 1649–1668, 2012.
[11] W. Zhu, W. Ding, J. Xu, Y. Shi, and B. Yin, “Screen content coding
achieve a satisfying segmentation performance. based on hevc framework,” IEEE TMM, vol. 16, no. 5, pp. 1316–1326,
Moreover, the coding backbone of SSVC remains a large 2014.
improvement space since the learning-based video coding [12] J. Zhang, S. Kwong, T. Zhao, and Z. Pan, “Ctu-level complexity control
for high efficiency video coding,” IEEE TMM, vol. 20, no. 1, pp. 29–44,
technique is going through a fast development. Thus, we 2017.
believe that the global performance of SSVC w.r.t the video [13] S.-H. Tsang, Y.-L. Chan, W. Kuang, and W.-C. Siu, “Reduced-
object segmentation task could be further improved from at complexity intra block copy (intrabc) mode with early cu splitting and
pruning for hevc screen content coding,” IEEE TMM, vol. 21, no. 2, pp.
least two aspects: 1). using the more advanced video coding 269–283, 2018.
backbone. 2). the estimation of optical flows in SSVC is [14] J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimized image
optimized with the R-D objective constrain, which may not be compression,” in ICLR, 2017.
[15] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational
consistent with the “true motion” of video objects [91], leading image compression with a scale hyperprior,” ICLR, 2018.
inaccurate segmentation results. We leave these challenges for [16] Z. Chen, T. He, X. Jin, and F. Wu, “Learning for video compression,”
our future work. IEEE TCSVT, vol. 30, no. 2, pp. 566–576, 2019.
[17] L.-Y. Duan, J. Liu, W. Yang, T. Huang, and W. Gao, “Video coding
Table II demonstrates the BD-rate saving for video object for machines: A paradigm of collaborative compression and intelligent
segmentation task. We observe that, compared with the tradi- analytics,” arXiv preprint arXiv:2001.03569, 2020.
tional codecs H.265 or H.266, the proposed SSVC variants all [18] R. Torfason, F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and
L. Van Gool, “Towards image understanding from deep compression
consistently achieve obvious BD-rate saving. without decoding,” arXiv preprint arXiv:1803.06131, 2018.
[19] C.-Y. Wu, M. Zaheer, H. Hu, R. Manmatha, A. J. Smola, and
VI. C ONCLUSION P. Krähenbühl, “Compressed video action recognition,” in CVPR, 2018,
pp. 6026–6035.
As a response to the emerging MPEG standardization efforts [20] Z. Shou, X. Lin, Y. Kalantidis, L. Sevilla-Lara, M. Rohrbach, S.-F.
VCM, in this paper, we propose a learning-based semantically Chang, and Z. Yan, “Dmc-net: Generating discriminative motion cues
for fast compressed video action recognition,” in CVPR, 2019, pp. 1268–
structured video coding (SSVC) framework, which formulates 1277.
a new paradigm of video coding for human and machine [21] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” arXiv
visions. SSVC encodes video into a semantically structured preprint arXiv:1904.07850, 2019.
[22] J. Mao and L. Yu, “Convolutional neural network based bi-prediction
bitstream (SSB), which includes both of the static object se- utilizing spatial and temporal information in video coding,” IEEE
mantics characteristics and dynamic object motion clues. The TCSVT, vol. 30, no. 7, pp. 1856–1870, 2019.
SUBMITTED TO IEEE TRANSACTIONS ON ON IMAGE PROCESSING 13
[23] X. Zhang, S. Ma, S. Wang, X. Zhang, H. Sun, and W. Gao, “A joint [49] S. R. Alvar and I. V. Bajić, “Pareto-optimal bit allocation for collabo-
compression scheme of video feature descriptors and visual content,” rative intelligence,” IEEE TIP, vol. 30, pp. 3348–3361, 2021.
IEEE TIP, vol. 26, no. 2, pp. 633–647, 2016. [50] L. Duan, J. Liu, W. Yang, T. Huang, and W. Gao, “Video coding
[24] S. Ma, X. Zhang, S. Wang, X. Zhang, C. Jia, and S. Wang, “Joint for machines: A paradigm of collaborative compression and intelligent
feature and texture coding: Toward smart video representation via front- analytics,” IEEE TIP, vol. 29, pp. 8680–8695, 2020.
end intelligence,” IEEE TCSVT, vol. 29, no. 10, pp. 3095–3105, 2018. [51] X. Li, J. Shi, and Z. Chen, “Task-driven semantic coding via reinforce-
[25] J. A. Roese and G. S. Robinson, “Combined spatial and temporal ment learning,” IEEE TIP, vol. 30, pp. 6307–6320, 2021.
coding of digital image sequences,” in Efficient Transmission of Pictorial [52] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool,
Information, vol. 66. International Society for Optics and Photonics, “Conditional probability models for deep image compression,” in CVPR,
1975, pp. 172–181. 2018, pp. 4394–4402.
[26] C. S. W. P. XV et al., “Video codec for audiovisual services at px64 [53] D. Minnen, J. Ballé, and G. D. Toderici, “Joint autoregressive and
kbit/s,” Draft Revision of Recommendation H, vol. 261, 1989. hierarchical priors for learned image compression,” in NeurIPS, 2018,
[27] I.-T. SG15, “Video coding for low bitrate communication,” Draft ITU-T pp. 10 771–10 780.
Rec. H. 263, 1996. [54] G. J. Sullivan and T. Wiegand, “Rate-distortion optimization for video
[28] I. 11172-2, “Information technology-coding of moving pictures and compression,” IEEE signal processing magazine, vol. 15, no. 6, pp. 74–
associated audio for digital storage media up to about 1.5 mbit/s: Part 90, 1998.
2 video,” 1993. [55] M. Kalluri, M. Jiang, N. Ling, J. Zheng, and P. Zhang, “Adaptive rd
[29] I. JTC, “Coding of audio-visual objects-part 2: Visual,” ISO/IEC, pp. optimal sparse coding with quantization for image compression,” IEEE
14 496–2. TMM, vol. 21, no. 1, pp. 39–50, 2018.
[30] I. ITU-T and I. JTC, “Generic coding of moving pictures and associated [56] A. Ortega and K. Ramchandran, “Rate-distortion methods for image and
audio information-part 2: video,” 1995. video compression,” IEEE signal processing magazine, vol. 15, no. 6,
[31] I. Telecom et al., “Advanced video coding for generic audiovisual pp. 23–50, 1998.
services,” ITU-T Recommendation H. 264, 2003. [57] “Hevc offical test model hm. https://hevc.hhi.fraunhofer.de.”
[32] V. Sze, M. Budagavi, and G. J. Sullivan, “High efficiency video coding [58] F. Yu, D. Wang, E. Shelhamer, and T. Darrell, “Deep layer aggregation,”
(hevc),” in Integrated circuit and systems, algorithms and architectures. in CVPR, 2018, pp. 2403–2412.
Springer, 2014, vol. 39, pp. 49–90. [59] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for
[33] G. Toderici, D. Vincent, N. Johnston, S. Jin Hwang, D. Minnen, J. Shor, human pose estimation,” in ECCV. Springer, 2016, pp. 483–499.
and M. Covell, “Full resolution image compression with recurrent neural [60] H. Law and J. Deng, “Cornernet: Detecting objects as paired keypoints,”
networks,” in CVPR, 2017, pp. 5306–5314. in ECCV, 2018, pp. 734–750.
[34] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen, [61] B. Xiao, H. Wu, and Y. Wei, “Simple baselines for human pose
S. Baluja, M. Covell, and R. Sukthankar, “Variable rate image compres- estimation and tracking,” in ECCV, 2018, pp. 466–481.
sion with recurrent neural networks,” arXiv preprint arXiv:1511.06085, [62] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
2015. recognition,” in CVPR, 2016, pp. 770–778.
[35] H. Liu, T. Chen, Q. Shen, and Z. Ma, “Practical stacked non-local [63] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for
attention modules for image compression.” in CVPR Workshops, 2019, dense object detection,” in ICCV, 2017, pp. 2980–2988.
p. 0. [64] R. Feng, Y. Wu, Z. Guo, Z. Zhang, and Z. Chen, “Learned video
[36] H. Liu, T. Chen, P. Guo, Q. Shen, X. Cao, Y. Wang, and Z. Ma, compression with feature-level residuals,” in CVPR Workshops, 2020,
“Non-local attention optimized deep image compression,” arXiv preprint pp. 120–121.
arXiv:1904.09757, 2019. [65] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “Pwc-net: Cnns for optical
[37] J. Li, B. Li, J. Xu, and R. Xiong, “Efficient multiple-line-based intra flow using pyramid, warping, and cost volume,” in CVPR, 2018, pp.
prediction for hevc,” IEEE TCSVT, vol. 28, no. 4, pp. 947–957, 2016. 8934–8943.
[38] Y. Hu, W. Yang, M. Li, and J. Liu, “Progressive spatial recurrent neural [66] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
network for intra prediction,” IEEE TMM, vol. 21, no. 12, pp. 3024– A large-scale hierarchical image database,” in CVPR. Ieee, 2009, pp.
3037, 2019. 248–255.
[39] S. Xia, W. Yang, Y. Hu, S. Ma, and J. Liu, “A group variational [67] T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman, “Video enhance-
transformation neural network for fractional interpolation of video ment with task-oriented flow,” in IJCV, vol. 127, no. 8, pp. 1106–1125,
coding,” in 2018 Data Compression Conference. IEEE, 2018, pp. 127– 2019.
136. [68] L. Davisson, “Rate distortion theory: A mathematical basis for data
[40] N. Yan, D. Liu, H. Li, T. Xu, F. Wu, and B. Li, “Convolutional neural compression,” IEEE Transactions on Communications, vol. 20, no. 6,
network-based invertible half-pixel interpolation filter for video coding,” pp. 1202–1202, 1972.
in 2018 25th IEEE International Conference on Image Processing [69] Y. Blau and T. Michaeli, “Rethinking lossy compression: The rate-
(ICIP). IEEE, 2018, pp. 201–205. distortion-perception tradeoff,” in International Conference on Machine
[41] L. Zhao, S. Wang, X. Zhang, S. Wang, S. Ma, and W. Gao, “Enhanced Learning. PMLR, 2019, pp. 675–685.
motion-compensated video coding with deep virtual reference frame [70] A. Mercat, M. Viitanen, and J. Vanne, “Uvg dataset: 50/120fps 4k
generation,” IEEE TIP, vol. 28, no. 10, pp. 4832–4844, 2019. sequences for video codec analysis and development,” in ACM MM,
[42] G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao, “Dvc: An 2020, pp. 297–302.
end-to-end deep video compression framework,” in CVPR, 2019, pp. [71] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural
11 006–11 015. similarity for image quality assessment,” in The Thrity-Seventh Asilomar
[43] R. Yang, F. Mentzer, L. V. Gool, and R. Timofte, “Learning for video Conference on Signals, Systems & Computers, 2003, vol. 2. Ieee, 2003,
compression with hierarchical quality and recurrent enhancement,” in pp. 1398–1402.
CVPR, 2020, pp. 6628–6637. [72] “Vvc offical test model vtm. https://jvet.hhi.fraunhofer.de.”
[44] L. Pu, M. W. Marcellin, A. Bilgin, and A. Ashok, “Image compres- [73] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
sion based on task-specific information,” in 2014 IEEE International arXiv preprint arXiv:1412.6980, 2014.
Conference on Image Processing (ICIP). IEEE, 2014, pp. 4817–4821. [74] G. Lu, C. Cai, X. Zhang, L. Chen, W. Ouyang, D. Xu, and Z. Gao,
[45] Z. Liu, T. Liu, W. Wen, L. Jiang, J. Xu, Y. Wang, and G. Quan, “Deepn- “Content adaptive and error propagation aware deep video compression,”
jpeg: A deep neural network favorable jpeg-based image compression arXiv preprint arXiv:2003.11282, 2020.
framework,” in Proceedings of the 55th Annual Design Automation [75] H. Liu, L. Huang, M. Lu, T. Chen, Z. Ma et al., “Learned video
Conference, 2018, pp. 1–6. compression via joint spatial-temporal correlation exploration,” arXiv
[46] D. Pau, G. Cordara, M. Bober, S. Paschalakis, K. Iwamoto, G. Francini, preprint arXiv:1912.06348, 2019.
V. Chandrasekhar, and G. Takacs, “White paper on compact descrip- [76] J. Lin, D. Liu, H. Li, and F. Wu, “M-lvc: Multiple frames prediction for
tors for visual search,” International Organization For Standardization learned video compression,” arXiv preprint arXiv:2004.10290, 2020.
ISO/IEC JTC1/SC29/WG11, Tech. Rep, 2013. [77] C.-Y. Wu, N. Singhal, and P. Krahenbuhl, “Video compression through
[47] X. Li, J. Shi, and Z. Chen, “Task-driven semantic coding via reinforce- image interpolation,” in ECCV, 2018, pp. 416–431.
ment learning,” IEEE TIP, vol. 30, pp. 6307–6320, 2021. [78] A. Djelouah, J. Campos, S. Schaub-Meyer, and C. Schroers, “Neural
[48] Z. Chen and T. He, “Learning based facial image compression with inter-frame compression for video coding,” in ICCV, 2019, pp. 6421–
semantic fidelity metric,” Neurocomputing, vol. 338, pp. 16–25, 2019. 6429.
SUBMITTED TO IEEE TRANSACTIONS ON ON IMAGE PROCESSING 14