0% found this document useful (0 votes)
73 views14 pages

Semantically Video Coding: Instill Static-Dynamic Clues Into Structured Bitstream For AI Tasks

This document discusses a proposed framework called Semantically Structured Video Coding (SSVC) that aims to encode video into a structured bitstream containing both static object characteristics and dynamic motion clues. This would allow the bitstream to directly support various intelligent video analysis tasks without needing full decompression. Traditional video codecs require full decompression before supporting tasks like object detection or action recognition. The proposed SSVC framework introduces optical flow encoding of motion and reorganizes residual information into the bitstream to better support tasks in an adaptive way. Experiments show SSVC can directly support multiple tasks using just a partially decoded bitstream, avoiding full decompression and reducing computational requirements.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views14 pages

Semantically Video Coding: Instill Static-Dynamic Clues Into Structured Bitstream For AI Tasks

This document discusses a proposed framework called Semantically Structured Video Coding (SSVC) that aims to encode video into a structured bitstream containing both static object characteristics and dynamic motion clues. This would allow the bitstream to directly support various intelligent video analysis tasks without needing full decompression. Traditional video codecs require full decompression before supporting tasks like object detection or action recognition. The proposed SSVC framework introduces optical flow encoding of motion and reorganizes residual information into the bitstream to better support tasks in an adaptive way. Experiments show SSVC can directly support multiple tasks using just a partially decoded bitstream, avoiding full decompression and reducing computational requirements.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

SUBMITTED TO IEEE TRANSACTIONS ON ON IMAGE PROCESSING 1

Semantically Video Coding: Instill Static-Dynamic


Clues into Structured Bitstream for AI Tasks
Xin Jin† , Ruoyu Feng† , Simeng Sun† , Runsen Feng, Tianyu He, Zhibo Chen∗ , Senior Member, IEEE

Abstract—Traditional media coding schemes typically encode


image/video into a semantic-unknown binary stream, which Bitstream More flexible bitstream
fails to directly support downstream intelligent tasks at the Machine

Decoder
Encoder

Decoder
bitstream level. Semantically Structured Image Coding (SSIC)

Encoder
arXiv:2201.10162v1 [cs.CV] 25 Jan 2022

framework [1] makes the first attempt to enable decoding-free or


Video
partial-decoding image intelligent task analysis via a Semantically Acquisition Video Codec
Human Video
Perception Acquisition New Video Codec
Structured Bitstream (SSB). However, the SSIC only considers Human
Perception
image coding and its generated SSB only contains the static object (a) (b)
information. In this paper, we extend the idea of semantically Fig. 1: Motivation illustration: (a) the traditional coding frame-
structured coding from video coding perspective and propose
works typically only focus on satisfying human perception. (b)
an advanced Semantically Structured Video Coding (SSVC)
framework to support heterogeneous intelligent applications. as a high-efficient coding framework that serves for the AI era,
Video signals contain more rich dynamic motion information it should satisfy both human perception and machine analytics.
and exist more redundancy due to the similarity between adja-
cent frames. Thus, we present a reformulation of semantically transmission burden and save storage resources, the video
structured bitstream (SSB) in SSVC which contains both of static
object characteristics and dynamic motion clues. Specifically, we
content is typically compressed into a compact representation
introduce optical flow to encode continuous motion information (i.e., bitstream), during the transmission procedure. Once the
and reduce cross-frame redundancy via a predictive coding raw video content information needs displaying for human
architecture, then the optical flow and residual information are eyes or employing for multimedia analysis applications, a
reorganized into SSB, which enables the proposed SSVC could reverse decoding operation will be applied to recover such
better adaptively support video-based downstream intelligent ap-
plications. Extensive experiments demonstrate that the proposed
compact representations to the raw pixel deployment. In
SSVC framework could directly support multiple intelligent tasks specific, the traditional hybrid video coding (HVC) frame-
just depending on a partially decoded bitstream. This avoids works [6] have evolved over the decades with gradually
the full bitstream decompression and thus significantly saves integrating the efficient transformation, quantization, and en-
bitrate/bandwidth consuming for intelligent analytics. We verify tropy coding into the compression procedure, achieving a
this point on the tasks of image object detection, pose estimation,
video action recognition, video object segmentation, etc.
trade-off on the rate-distortion optimization [7]. Particularly,
the mainstream traditional video coding frameworks, e.g.,
Index Terms—video coding, semantically structured bitstream, MPEG-4 AVC/H.264 [8], [9], High Efficiency Video Coding
media intelligent analytics.
(HEVC) [10]–[13] and the recently-proposed versatile video
coding (VVC) [7], have achieved great success. They all
I. I NTRODUCTION
try to promise less distortion between the raw image and

T HE multimedia industry, where image/video content


plays a pivotal role, is developing rapidly [2]–[5]. The
emergence of next-generation mobile networks will bring
reconstructed image with lower bit-rate cost.
For supporting the fast-developing intelligent tasks, these
traditional codecs need to fully decode all the compressed code
greater opportunities and challenges to the traditional multime- streams to reconstruct the raw data. However, the decoding
dia industry. Meanwhile, with the human society moving from procedure of existing HVC codecs is inevitably faced with
informatization to intelligence, more and more image/video unexpected high computational complexity and large time
intelligent applications are applied to public safety monitoring, consuming. This severely restricts the practical applications
autonomous driving, remote machine control, Internet medical of these coding schemes. For example, in order to support the
treatment, military defense, etc. In the above-mentioned sce- machine learning based multimedia algorithms, e.g., detection,
narios, it is necessary to ensure the interpretability and interop- recognition, tracking, etc, the traditional coding frameworks
erability of intelligent analysis results. Therefore, introducing typically need first decompresses all the encoded bitstream into
new multimedia analytics paradigms for machine intelligence the raw RGB/YUV format, and then feed the decompressed
is attracting more and more attention. This will become an video content into downstream tasks for the further analy-
important development trend of artificial intelligence in the sis, which inevitably consumes a large amount of decoding
future. computations when meeting the large-scale intelligent media
As a pivotal role in the modern multimedia industry, video applications at the edge server side (terminal). Therefore, as
occupies most of the communication bandwidth. To alleviate shown in Fig. 1, a high-efficient coding framework should
†: The first three authors contribute equally to this work compress the captured media content into a more flexible
*: Corresponding author format, which not only can be perceived by humans through
SUBMITTED TO IEEE TRANSACTIONS ON ON IMAGE PROCESSING 2

High-level
Feature
High-level object information of the image, which seriously limits its

Bitstream Generation
features
Extraction practical application to a larger scope, especially for video-

Encoding
Input based intelligent applications.
Data
Therefore, in this paper, we extend the idea of semantically
Low-level
Feature
Low-level structured coding from video coding perspective and propose
features
Extraction
Adaptive
a new paradigm of video coding for machines (VCM). Specifi-
Transportation cally, we introduce an advanced Semantically Structured Video

Bitstream De-generation
AI task Bitstream
Accuracy ?
support analysis Coding (SSVC) framework to directly support heterogeneous
Human
intelligent multimedia applications. As illustrated in Fig. 2, in
Interaction order to generate a semantic-sensing bitstream that could be
Quality ?
Human
perception
Bitstream
decoding
directly used for supporting downstream intelligent analytics
without fully decoding and also could be reconstructed for
human perception, SSVC codec encodes the input media data
Fig. 2: Overview of our idea. The boxes marked as red denote (i.e., image or video) into a semantically structured bitstream
the novel designs compared to existing codecs. In specific, (SSB). SSB generally consists of hierarchical information of
these red solid boxes mean the new designs that aim to achieve high-level features, (e.g., the category and spatial location
semantically structured bitstream (SSB) for machine analytics information of each object detected in the video) and low-
supporting. level features, (e.g., the content information of each object or
data decompression, but also can be directly handled by the rest background in the video).
machine learning algorithms with much less decompression In detail, for these video key frames of the intra-coded
complexity or even no decompression procedure. This could frames, we herein leverage a simple and effective object
significantly save the bitstream transmission and decoding detection technique to help instantiate the static information
cost. Recently, MPEG has also initiated the standard activity of SSB. We integrate the recently proposed CenterNet [21]
on video coding for machine (VCM) 1 , which attempts to in the encoder of our SSVC framework, which aims to
identify the opportunities and challenges of developing col- locating objects and obtaining their corresponding class ID
laborative compression techniques for humans and machines, and spatial location (e.g., bbox) information in the feature
while establishing a new coding standard for both machine domain. Then, we re-organize such features to form a part of
vision and hybrid machine-human vision scenarios. SSB, by which some specific objects can be reconstructed and
In recent years, with the fast development of the deep several image-based intelligent analysis tasks such as object
learning based compression techniques [14]–[16], several stud- classification/detection could achieve similar or better results
ies contribute some new compression schemes that could than fully-decompressed images.
directly support downstream intelligent tasks without decoding Except for the static semantic information that derived from
all the compressed bitstream [17]. Torfason et al. , [18] objects of i-frames, the motion characteristics is also very
use a neural network to generate a compressed bitstream improtant for video compression [22]. Therefore, our SSVC
as input for supporting the downstream tasks directly, such further integrates motion clues, denoted by optical flow and
as classification and segmentation, which bypasses decoding content residues, of the continuous video frames (i.e., p-
of the compressed representation into RGB space, and thus frames, inter-coded using reference frames from the past) into
reducing computational cost. Similar ideas can be found in SSB to enable a wider of video tasks supporting. For example,
the video-based schemes, CoViAR [19] and DMC-Net [20], for a video-based multimedia intelligent analysis task, e.g.,
they directly leverage the motion vectors and residuals that video action recognition, only the person-related content of
are both readily available in the compressed video to represent key frame (i.e., i-frame) and the corresponding optical flow of
motion at no cost to support the downstream action recognition continuous frames adjacent to i-frame in the SSB are required,
task. However, these schemes are still task-specific, or said, which could further save most of the decompression time and
designed for a limited range of applications, they cannot transmission bandwidth.
meet the higher and more general requirements for flexibility In short, our SSVC could directly support heterogeneous
and efficiency. Because they did not consider the intrinsic multimedia analysis tasks just based on partial data decoding,
semantics contained in the compressed bitstream, and can not which is achieved and benefited by of semantic-structured
leverage different structural bitstream for different tasks. coding process and bitstream deployment. We did not jointly
Sun et al. [1] first introduce a new concept of semantically train the entire compression framework and subsequent AI
structured coding for image compression field (abbreviated application/task models, which is different from previous joint-
as SSIC), and generate a semantically structured bitstream training based literature [19], [20], [23], [24].
(SSB), where each part of the bitstream represents a spe- Last but not least, we experimentally show how to leverage
cific object and can be directly used for the aforementioned the semantically structured bitstream (SSB) to better adap-
intelligent image tasks (including object detection, pose es- tively support downstream intelligent tasks in an adjustable
timation etc.) However, this work only considers the image manner (shown in Fig. 2 and Fig. 8). Such scalable func-
coding framework, the generated SSB only contains the static tionality bridges the gap between the video high-efficient
compression and machine vision supporting. In summary, the
1 https://lists.aau.at/mailman/listinfo/mpeg-vcm contributions of this paper can be summarized as follows:
SUBMITTED TO IEEE TRANSACTIONS ON ON IMAGE PROCESSING 3

• We propose an advanced Semantically Structured Video H.265/MPEG-H (Part 2) High Efficiency Video Coding
Coding (SSVC) framework to meet the fast-growing re- (HEVC) [32] standards.
quirements of intelligence multimedia analysis. As a new All these standards follow the block-based video coding
paradigm for intelligent video compression, SSVC could strategy. Based on this, the intra and inter-prediction tech-
support heterogeneous multimedia analysis tasks just niques are applied based on the corresponding contexts, i.e.,
based on partial data decoding, and thus greatly reducing neighboring blocks and reference frames in the intra and inter
the transmission bandwidth and storage resources. This is modes, to remove temporal and spatial statistical redundancies
achieved by the semantic-structured coding process and of video frames. However, these kinds of designed patterns,
bitstream deployment. e.g., block partition, make the prediction only could cover parts
• In order to efficiently support video downstream tasks of the context information, which limits its modeling capacity.
based on partially decoded bitstream, we leverage op- Besides, the block-wise prediction, along with transform and
tical flow and residual to describe the dynamic temporal lossy quantization, causes the blocking effect in the decoding
motion information of video, and add them into the results. As most of the traditional coding architectures generate
semantic-structured bitstream (SSB), which goes beyond the bistream in units of the entire image or video, they
the image-based semantic compression framework [1] cannot support partial bitstream decoding or partial objects
and makes our SSVC more general and scalable. We reconstruction for intelligent video analysis tasks. Besides,
instantiate SSVC framework integrated with action recog- different from most of the codecs, the MPEG-4 Visual de-
nition and video object segmentation as a video-based composes video into video object planes (VOPs) and encodes
embodiment to reveal the superiority of our coding them sequentially. Though MPEG-4 Visual tries to achieve
scheme. object-oriented bitstream, its implementation must be based
• Experimentally, we provide evidences to reveal that our on accurate pixel-level segmentation results, which is difficult
SSVC is more flexible and scalable, which could better to achieve at the moment.
adaptively support heterogeneous downstream intelligent
tasks with the structured bitstream.
B. Learning Based Image/Video Coding Approach
The remaining part of this paper is organized as follows:
we introduce recent progress on video compression in Section The great success of deep learning techniques significantly
II, including traditional hybrid coding pipelines and learning promotes the development of end-to-end learned video coding.
based compression schemes. The details of the proposed For the deep learning based coding methods, they do not rely
Semantically Structured Video Coding (SSVC) framework on the partition scheme and support full-resolution coding,
are introduced in Section III. Comprehensive experiments which naturally avoids the blocking artifacts. Generally, the
are conducted and illustrated in Section IV and Section V. representative and powerful feature is extracted via a hierar-
We conclude our coding architecture and discuss its future chical network and jointly optimized with the reconstruction
directions in the last section Section VI. task for high efficient coding. For instance, the early work [16]
focuses on motion predictive coding and proposes the con-
cept of PixelMotionCNN (PMCNN) to model spatiotemporal
II. R ELATED W ORK coherence to effectively perform predictive coding inside the
In the current information age, the fast-growing multimedia learning network. Similarly, recurrent neural network [33],
videos take up most of the daily life of people. It is critical for [34], VAE generative model [14], [15] and non-local atten-
humans to record, store, and view the image/videos efficiently. tion [35], [36] are employed to remove the unnecessary spatial
For the past decades, lots of academic and industrial efforts redundancy from the latent representations to make features
have been devoted to video compression, which aims to compact, and thus leading to improved coding performance.
achieve a trade-off on the rate-distortion optimization problem. For another mainstream branch, lots of efforts are devoted
Below we first review the advance of traditional video coding to improving the performance of neural network based video
frameworks, as well as the recent booming, developed deep coding frameworks by increasing the prediction ability of deep
learning based compression schemes. Then, we introduce networks for intra- [37], [38] or inter-prediction of video
several task-driven coding schemes on visual data for machine codec [39]–[41]. Meanwhile, the end-to-end learned video
vision in a general sense, revealing its growing importance. compression frameworks, such as DVC [42] and HLVC [43],
further push the compression efficiency up along this route. All
these methods could reduce the overall R-D cost on large-scale
A. Traditional Image/Video Coding Approach video data. Besides, as the entire coding pipeline is optimized
From the 1970s, the hybrid video coding architecture [25] in an end-to-end manner, it is also flexible to adapt the rate
is proposed to lead the mainstream direction and occupy the and distortion to accommodate a variety of end applications,
major industry proportion during the next few decades. Based e.g., machine vision analytics tasks.
on this, the following popular video coding standards have However, mentioned learning based compression methods
kept evolving through the development of the ITU-T and typically fail to handle the situation when tremendous volumes
ISO/IEC standards, including H.261 [26], H.263 [27], MPEG- of data need to be processed and analyzed fast, because they
1 [28], MPEG-4 Visual [29], H.262/MPEG-2 Video [30], need to reconstruct the whole picture. The semantics-unknown
H.264/MPEG-4 Advanced Video Coding (AVC) [31], and data still constitutes a major part of the bitstream. So these
SUBMITTED TO IEEE TRANSACTIONS ON ON IMAGE PROCESSING 4

methods cannot fulfill the emerging requirement of real-time These methods mostly adopt a joint training scheme to not
video content analytics when dealing with large-scale video only optimize compression rate but also optimize the accuracy
data. But, these learning based coding frameworks actually for AI applications. This joint-training based optimization
provide opportunities to develop effective VCM architectures lacks of flexibility, because they need to adjust the compres-
to address these challenges. sion encoder according to the different subsequent supporting
AI tasks. However, in actual applications, it is unrealistic to
enforce/adjust the encoder and decoder to be combined with
C. Task-driven Image/Video Coding Approach the task. Once their coding framework is well trained on a
Deep learning algorithms have achieved great success in specific task, it is difficult to adapt it to the other vision tasks.
the actual computer vision tasks, promoting the development Therefore, in this paper, we present the concept of seman-
of media industry in recent years. Correspondingly, more and tically structured bitstream (SSB), which contains hierarchical
more captured videos are directly handled/analyzed by ma- information that represents partial objects existed in the videos
chine algorithms, instead of being perceived by human eyes. and can be directly used for various tasks. Note that, the
Therefore, recent works tend to optimize their compression proposed SSVC video coding framework in this paper is
pipelines according to the feedbacks derived from real task- an extension of our previous image coding pipeline SSIC
driven applications rather than the original quality fidelity that reported in [1]. SSVC goes beyond SSIC [1] on at least
aims to meet human perception. four perspectives: 1) SSIC only supports image coding and
Built upon the traditional codecs, Pu et al. [44] apply a task- only could be employed for image-based intelligent anaylstic.
specific metric into JPEG 2000. Liu et al. [45] enhance the On the contrary, our SSVC framework could support image
compression scheme for intelligent applications by minimizing and video coding together, while could be directly employed
the distortion of frequency features that are important to for image-based and video-based intelligent analysis. 2) the
neural network. CDVS [46] and CDVA [46] aim at efficiently SSB of SSIC only contains static object information of the
supporting the search task through compact descriptors using image, while the counterpart of our SSVC not only encodes
both the traditional method and learning-based method. Li et static object information contained in the key frames/images,
al. [47] implement semantic-aware bit-allocation for the but also integrates motion clues (i.e., optical flow between
traditional codec based on reinforcement learning. On the other neighboring frames) and content residues into bitstream. In
hand, based on the learning-based coding schemes, Chen et general, the SSB of our SSVC framework is compounded with
al. [48] propose a learning based facial image compression static object semantics information and dynamic motion clues
(LFIC) framework with a novel regionally adaptive pooling between adjacent video frames. 3) beyond SSIC, we replace
(RAP) module that can be automatically optimized according the original backbone which is based on a conditional proba-
to gradient feedback from an integrated hybrid semantic fi- bility model [52] with a stronger VAE-based backbone [53],
delity metric. Alvar et al. [49] study a bit allocation method and thus improving the basic compression performance of
for feature compression in a multi-task problem. These tradi- SSVC. 4) in terms of validation experiments, we add more
tional hybrid video coding framework and the aforementioned analysis and experiments on the video-based intelligent tasks,
learning-based methods both encode the video into binary revealing the superiority of SSVC compared to SSIC.
stream without any semantic structure, which makes such
bitstream failed to directly support intelligent tasks. Zhang et III. S EMANTICALLY S TRUCTURED V IDEO C ODING
al. [23] propose a hybrid content-plus-feature coding scheme F RAMEWORK
framework of jointly compressing the feature descriptors and
In this section, we will introduce the architecture of our pro-
visual content. A novel rate-accuracy optimization technique
posed Semantically Structured video coding (SSVC) frame-
is proposed to accurately estimate the retrieval performance
work. The pipeline is illustrated in Fig 3. In the following
degradation in feature coding. Duan et al. [50] carry out ex-
sub-sections, we first begin with an overview of the proposed
ploration in the new video coding for machines (VCM) area by
SSVC framework, and then we introduce the details of each
building a bridge between feature coding for machine vision
component sequentially.
and video coding for human vision. They propose a task-
Given a video X that is composed of multiple frames
specific compression pipeline that jointly trains the feature
x1 , x2 ...xN where N denotes the length of such video clip,
compression and intelligent tasks. Xin Li et al. [51] implement
the video compression process can be formulated/deemed as a
task-driven semantic coding by implementing semantic bit
rate-distortion (R-D) optimization (RDO) problem [54], [55].
allocation based on reinforcement learning (RL) via designing
The target of such RDO can be understood from two sides,
semantic maps for different tasks to extract the pixelwise
one is minimizing bit-rate cost, i.e.transmission/storage cost,
semantic fidelity for videos/images. Ma et al. [24] provide
while not increasing fidelity distortion, the other is minimizing
a systematical overview and analysis on the joint feature and
distortion with a fixed bit-rate. The Lagrangian formulation of
texture representation framework, which aims to smartly and
the minimization RDO problem is given by:
coherently represent the visual information with the front-end
intelligence in the scenario of video big data applications.
minJ, where J = R + λD, (1)
Besides, the future joint coding scheme by incorporating the
deep learning features is envisioned, and future challenges where the Lagrangian rate-distortion functional J is minimized
toward seamless and unified joint compression are discussed. for a particular value of the Lagrange multiplier. More details
SUBMITTED TO IEEE TRANSACTIONS ON ON IMAGE PROCESSING 5

Terminal Human Interaction

Intra-mode
Location
Object
Parsing Bit-stream
Intelligent
Class ID image tasks:
L: location L
I-frame Object
Semantic Overhead Detection
𝒙𝒕 Guidance
C: class ID C
Object
𝒛𝒕 𝒛ො 𝒕 Segmentation

Quantization

Bitstream Generation
Image

Entropy Coding
Image 𝒛෤ 𝒕 Object 1 Enhancement
Encoder Object 2
Image
Object N
…… Understanding

𝒛෤ 𝒕
Inter-mode Optical
P-frames ෝ𝒕
Recon. 𝒙 Image Decoder flow 1 Intelligent
Optical video tasks:
𝒙𝒕+𝟏 flow 2
… Object tracking,
Optical Event recog.,
Event pred.,
𝒙𝒕+𝟐 Motion Estimation
Optical Flow flow N
Anomaly detect.,
Residue 1 Retrieval…
…… & Compensation
Residue 2
𝒙𝒕+𝑵 Content Residue … Human
Perception
Residue N

Fig. 3: The overall pipeline of our proposed semantically structured video coding (SSVC) framework, and the illustration of
examples of downstream intelligent tasks analytics.

on Lagrangian optimization are discussed in [56]. We go downstream intelligent tasks, we pre-define a common/general
beyond the traditional hybrid video coding framework by semantic bitstream deployment. As shown in Fig 3, we divide
building up our compression pipeline upon the learning based the bitstream into three groups: 1) header that contains object
codecs, in which the modules can be jointly optimized for spatial location and category information, 2) i-frame bitstream
better implementing R-D optimization. We attempt to define that contains different object information, and 3) p-frame
the pipeline of video coding for machine (VCM) to bridge bitstream that includes motion clues/information of videos.
the gap between coding semantic features for machine vision
tasks and coding pixel features for human vision. A. Intra-mode Coding
As shown in Fig. 3, in the compression process, the data Intra-mode coding is designed for key frames, i.e., i-frames
encoding has two modes, intra-mode and inter-mode. Follow- of traditional codecs, and can be regarded as a kind of image-
ing the traditional hybrid video coding codecs [57] and the based semantics feature compression method. Given a key
exiting learning-based methods [42], [43], we first divide the frame image, that is the t-th frame xt of a video clip X, it is
original video sequence into groups of pictures (GoP). Let first fed into two branches in parallel. One branch employs a
x = {x1 , x2 , ..., xt , xt+1 , ..., xN } denote the frames of one feature extractor module to obtain a hidden feature zt , which is
GoP unit, where N means the GoP length. Assumed that xt semantics-unknown and contains raw content information. The
has been coded by intra-mode, in the next inter-mode coding other branch leverages object parsing technique, such as Cen-
process, xt+1 , xt+2 , ..., xN is encoded frame-by-frame in a terNet [21], to extract high-level semantic features from key
sequential order. frame xt , which contains object spatial location information
Then, a differentiable quantizer is applied on ẑt to obtain and category information. Such high-level features are not only
quantized features z̃t to reduce redundant information in the deployed in the bitstream, but also are used to partition the
data. After being applied to the entropy coding module, encoded hidden feature zt into different groups (i.e., different
z̃t is encoded into the bitstream that can be transmitted spatial areas) ẑt according to different categories.
or stored. Notably, the extracted high-level semantics (i.e., 1) Object Parsing: Given t-th i-frame, which is noticed
location and class information) are also saved into bitstream as as xt ∈ RW ×H×3 of a video clip X. Our goal is to
overhead, which can be used to directly support downstream extract semantic features from xt , which are represented by
intelligent analysis and also guide the partial/specific bitstream bounding box (ak1 , bk1 , ak2 , bk2 ) and class ID ck respectively
decoding (i.e., partial/specific reconstruction). In summary, the for object k. Following the method in [21], xt is first fed
quantized features z̃t (can be regarded as low-level content into deep layer aggregation (DLA) network [58] to predict
information) and high-level features together constitute the a center point heatmap Ŷ ∈ [0, 1]W/R×H/R×C , where R is
semantically structured bitstream (SSB). the output stride and C is the number of predefined object
For the semantically structured bitstream (SSB) deploy- categories. In Ŷ , a prediction of 1 corresponds predicted center
ment, instead of adapting bitstream generation to different point of an object, while a prediction of 0 corresponds to
SUBMITTED TO IEEE TRANSACTIONS ON ON IMAGE PROCESSING 6

predicted background. Notably, the DLA network can be other generate semantically structured bitstream (SSB), which will
fully-convolutional encoder-decoder networks, such as stacked be used for storage and transmission. Notably, the entropy
hourglass network [59], [60] and up-convolutional residual encoding part of the background ŷbg has been improved in
networks (ResNet) [61], [62]. Based on the predicted heatmap, order to minimize the duplication region when encoding the
a branch network is introduced to regress the size of all the background as [1]. We fill the inside of the object region
objects in image Ŝ ∈ RW/R×H/R×2 . When output stride with the pixels which are at the left of the border. And in
R > 1, another additional branch is needed to predict a local entropy coding, the duplicate parts are coded only once. The
offset Ô ∈ RW/R×H/R×2 to compensate the error caused by Arithmetic Decoder (AD) could transform bitstream back into
rounding, following [21]. the latent representation which can be used for image analysis
During training stage, the ground truth center point p ∈ tasks, and the Decooder also could reconstruct the partial
R2 is converted from bounding box and further mapped image or the whole image from SSB [1].
to a low-resolution equivalent that is p̃ = bp/Rc. Then During training stage, only the compression of the whole
the ground truth center point is splat to a heatmap version image is considered, following [14], [53]. Then the RDO
Y ∈ [0, 1]W/R×H/R×C using a Gaussian kernel as it does problem in Equation 1 can further be formulated as following:
in [60]. The ground truth of object size is computed as
sk = (ak2 − ak1 , bk2 − bk1 ). To optimize the center point R + λ · D =Ex∼px [−log2 pŷ (bf (x)e)]
(6)
heatmap prediction network, we use a penalty-reduced pixel- +λ · Ex∼px [d(x, g(bf (x)e))],
wise logistic regression with focal loss [63] following [21]:
( where px is the unknown distribution of natural images, b·e
1 X (1 − Ŷabc )α log(Ŷabc ), if Yabc = 1; denotes quantization, f (·) and g(·) denote encoder and decoder
Lk = −
N
abc
(1 − Yabc )β (Ŷabc )α log(1 − Ŷabc ), otherwise, respectively, pŷ (·) is a discrete entropy model used to estimate
(2) the rate by approximating the real marginal distribution of the
where α and β are hyper-parameters and N is the number of latent, d(·) is the metric to measure the distortion such as mean
center point in an image. squared error (MSE) and MS-SSIM, and λ is the Lagrange
The prediction of size and local offset are learned by multiplier to determine the desired trade-off between rate and
applying L1 loss respectively: distortion.
N To estimate the rate for optimization, following [14], [53],
1 X the latent ŷi is modeled as a Gaussian convolved with a
Lsize = |Ŝp̃k − sk |; (3)
N unit uniform distribution to ensure a good match between
k=1
1 X p the actual discrete entropy and the continuous entropy model
Lof f = |Ôp̃ − ( − p̃)|. (4)
N p R used during training. Then the distribution of latent is modeled
by predicting the mean and scale parameters conditioned on
Therefore, the total loss function is the weighted sum of all the quantized hyperprior ẑ and causal context of each latent
the loss functions with weights {1, λsize , λof f }. element ŷ<i (e.g., left and upper latent elements).
In inference stage, with the predicted heatmap, the peaks is The entropy model for hyperprior is a non-parametric, fully
extracted independently for each category using max pooling factorized density model, as ẑ is proved to comprises only a
operation. With P̂c denoting the set of n detected center points very small percentage of the total bit-rate.
P̂ = {(âi , b̂i )}ni=1 of class ID c. Combined with predicted size In inference stage, to generate SSB, given the set of latent
Ŝâi ,b̂i = (ŵi , ĥi ) and local offset Ôâi ,b̂i = (4âi , 4b̂i ), the from a specific input image {ŷob1 , ŷob2 , ..., ŷobK , ŷbgd }, AE
predicted bounding box can be represented as follow: code each of them individually based on their respective
(âi + 4âi − ŵi /2, b̂i + 4b̂i − ĥi /2), hyperprior ẑobk (or ẑbg ) and casual contextŷobk ,<i (or ẑbg,<i ).
(5) Notably, in order to reduce the coding redundancy caused
âi + 4âi + ŵi /2, b̂i + 4b̂i + ĥi /2)). by re-organization of latent, we introduce two optimization
2) Image Compression and Bitstream Disployment: The strategies: 1) when objects overlap each other, the union
compression network for i-frame xt can be divided into two of them is fed into AE; 2) when coding the ŷbg , each of
sub-networks as [53]. One is a core autoencoder (including the spatially discontinuous part will be padded with the left
Encoder and Decoder module), and the other is a sub-network boundary of the current discontinuous part as [1].
that contains a context model and a hyper-network (including
Hyper Encoder and Hyper Decoder module), as is shown in
B. Inter-mode Coding
Fig 4.
Specifically, the input xt is first transformed into latent
representation y by Encoder module. Then y is re-organized The inter-mode coding is designed for continual frames.
and quantized as {ŷob1 , ŷob2 , ..., ŷobK , ŷbg } according to the For inter-mode coding of our SSVC, we focus on low-latency
the K pairs of spatial location information and category video streaming, which means all inter-frames are coded as p-
information extracted from object parsing branch, in which ŷbg frame. Given previously decoded frame x̂t (named as reference
represents the latent representation of background. Then the frame following traditional codecs), the current frame xt+1
Arithmetic Encoder (AE) module codes the symbols coming sequentially perform motion estimation and motion compen-
from the quantizer into binary bitstream for each ŷobi to sation with x̂t as reference frame. As a consequence, we could
SUBMITTED TO IEEE TRANSACTIONS ON ON IMAGE PROCESSING 7

Object Parsing 𝒛𝒐𝒃𝒌 𝒛𝒐𝒃𝒌 𝒛𝒐𝒃𝒌


Object 𝒚𝒐𝒃𝟏 Hyper Hyper
Detection Q AE AD
Encoder Decoder

SSB
Location
𝒚𝒐𝒃𝟐 𝒚𝒐𝒃𝒌 𝒚𝒐𝒃𝒌 𝒙𝒐𝒃𝟐 ,𝒕
Class ID
𝒙𝒕
Q AE AD Decoder

SSB
𝒚𝒃𝒈 𝒙𝒐𝒃𝟏 ,𝒕
Padding
Encoder

𝒚 𝒙𝒕
Alignment

Fig. 4: Image encoder pipeline. Encoder, Hyper Encoder, AE and quantization operation are needed in image encoder, while
Context Model, Entropy Parameters, AD, Factorized Entropy Model, Decoder and Hyper Decoder are needed in image decoder
to recover an image from bitstream.

get motion clues, i.e., optical flow and content residue from
the encoded frames.
We build our p-frame coding framework based on recent
learning-based video coding methods [64]. As shown in Fig. 5,
the overall coding pipeline contains four basic components:
Motion Estimation (ME), Motion Compression (MC), Motion
Compensation (MCP) and Residual Compression (RC). We
employ the optical flow network PWC-Net [65] as our ME
network. The original output of PWC-Net is in the down-
scaled domain with a factor of 4, therefore we upsample it to
pixel domain using bilinear interpolation. For the compression
of optical flow (i.e., motion compression), we use the i-frame Fig. 5: The overall flowchart of our p-frame compression
compression framework and simply change the number of procedure in the inter coding mode.
input/output channels. MCP module first warps the reference
frame to the current frame by decoded optical flow and then
refine the warped frame using a U-Net-like networks. decoder networks, and therefore can be independently decoded
Given the previously decoded frame x̂t−1 and current frame from the corresponding part of bitstream. In other words, the
xt , the ME network generates optical flow mt . The MC bitstream of motion information and content information is
network, which is similar to our image coding network, first structured in our coding framework.
non-linearly maps the optical flow mt into quantized latent 1) Training Procedure: It is difficult to train the whole
representations and then transforms it back to reconstruction models from scratch using the rate-distortion loss in Eq. (7).
m̂t . The latent representations are encoded into bitstream by Thus, we separately pretrain the i-frame coding models (intra-
entropy coding. After reconstructing m̂t , the reference frame mode of SSVC) and p-frame coding models (inter-mode of
x̂t−1 is first bilinearly warped towards the current frame and SSVC). For the pretraining of our p-frame codec, we first fix
then refined with a processing network to obtain the motion the weights of the pretrained Motion Estimation (ME) network
compensation frame xt . Finally, we compress the feature and then pretrain the Motion Compression (MC) network with
residual between xt and xt to remove the remaining spatial the R-D loss of compensation frame xt : Rt,m + λm D(xt , xt ),
redundancy, by using the RC network proposed in [64]. More where Rt,m denotes the rate of optical flow, D is measured
details can be seen in [64]. using MSE and λm is empirically set to 512. Later, the weights
The whole framework is end-to-end trainable. To better of the ME network are relaxed and we add the Residual
adaptively allocate bits between i-frame and p-frame, we opti- Compression (RC) network for joint training. In the end, we
mize the whole model (including our i-frame coding network) jointly fine-tuned both the i-frame 208 and p-frame models
for the rate-distortion loss of a GoP: with the proposed R-D loss in Eq. (7).
2) Bitstream Disployment: As mentioned before and shown
in Fig 5, the compressed optical flow m̂t and residual r̂t are
T T
1X 1X separately encoded into bit-stream by two encoder-decoder
R + λD = Rt + λ D(xt , x̂t ), (7)
networks, and therefore can be independently decoded from
T t=1 T t=1
the corresponding part of SSB, which enables our SSVC could
where Rt denotes rate, D(xt , x̂t ) denotes distortion and T is directly support more video tasks. For example, only based
the length of the GoP. The rate term for p-frame consists of on objects of key frames (i-frames) and their corresponding
the rate of optical flow and residual. Note that optical flow and motion clues (i.e., optical flows), terminal users could success-
residual are separately encoded into bitstream by two encoder- fully conduct action recognition.
SUBMITTED TO IEEE TRANSACTIONS ON ON IMAGE PROCESSING 8

Raw AVC HEVC VVC SSVC

BPP/MS-SSIM 0.2054/0.9695 0.2765/0.9807 0.2370/0.9844 0.2273/0.9830

BPP/MS-SSIM 0.6195/0.9719 0.6358/0.9805 0.5805/0.9823 0.5770/0.9867

BPP/MS-SSIM 0.5995/0.9557 0.5573/0.9660 0.5575/0.9713 0.4905/0.9774


Fig. 6: Qualitative compression performance comparison with the existing compression frameworks.

Fig. 7: Quantitative compression performance comparison with the existing learning-based video frameworks and the mainstream
traditional video codecs. The curve of SSVC contains all the overheads including class ID and location.

IV. E XPERIMENTS ON C OMPRESSION P ERFORMANCE C. Implement Details


A. Datasets We train four models optimized for MSE with different λ
To train our i-frame compression models, we use a subset of values (256, 512, 1024, 2048), and four models optimized
ImageNet database [66]. To train the whole video compression for MS-SSIM with λ values (4, 8, 16, 32). That is, we
models, we use the Vimeo-90k septuplets dataset [67] which experimentally get 4 bit-rate points for codec evaluation. The
consists of 89,800 video clips with diverse content. To report GoP size is set to 6 during the training. We use the Adam
the rate-distortion (R-D) performance [68], [69], we evaluate optimizer [73]. In the pretraining procedure, we randomly crop
our proposed method on the UVG dataset [70] which includes the training data into 128×128 images/video clips and set the
seven 1080p video sequences, and HEVC standard Common learning rate to 5e-5. In the fine-tuning procedure, the crop
Test Sequences [10] known as Class B (1920×1080), Class C size is 192×192 and the learning rate is reduced to 1e-5. The
(832×480), Class D (416×240) and Class E (1280×720). batch size is set to 8 and 6 for the two procedures, respectively.

B. Evaluation Metrics and Experimental Setup D. Compression Performance Comparison and Analysis
We measure the quality of reconstructed frames using both 1) Quantitative Analysis: We evaluate our model with
PSNR and MS-SSIM [71]. The bits per pixel (bpp) is used many state-of-the-art video compression approaches, including
to measure the number of coding bits. Following the common learning-based coding framework and the traditional video
evaluation setting in [42], the GoP sizes for the UVG dataset coding methods (e.g., H.264, H.265 and H.266). The compared
and HEVC standard Common Test Sequences are set to 12 and learning-based video compression approaches include the p-
10, respectively. In most previous methods for learned video frame based methods of [42], [74]–[76], the B-frame based
compression, the H.264/H.265 is evaluated by using FFmpeg methods of [43], [77], [78], and the transform-based method
implementation, which performance is much lower than of- [79]. Among them, [42], [74], [76]–[78] are optimized for
ficial implementation. In this paper, we evaluate H.265 and MSE and [75], [79] are optimized for MS-SSIM.
H.266 by using the implementation of the standard reference The corresponding quantitative comparison results are
software HM 16.21 [57] and VTM 8.0 [72], respectively. We shown in Fig. 7, we observe that 1). Our proposed video cod-
would like to highlight that H.266 [VTM-8.0] is the latest ing framework significantly outperforms the exiting learning-
mainstream video coding standard. based video compression methods in both PSNR and MS-
SUBMITTED TO IEEE TRANSACTIONS ON ON IMAGE PROCESSING 9

I- frames

……
Optical
flow

……
Residual

Fig. 8: We use the task of action recognition as an example to


illustrate how to adaptively support downstream tasks based
on the proposed SSVC framework.

SSIM. 2). As an end-to-end learning-based codec, our cod-


Fig. 9: Performance comparison of compression efficiency &
ing framework outperforms the most mainstream traditional
detection accuracy with traditional video codec H.266. We can
hybrid coding framework–H.266 [VTM-8.0] in terms of MS-
see that our SSVC framework do not need to decode all the
SSIM. 3). Compared to H.265 [HM-16.21], our codec that
streams to perform object detection well, which enables our
is optimized with MSE provides competitive results in PSNR
framework to have a higher decoding efficiency and simulta-
but achieves better results in MS-SSIM. We analyse that is
neously gets a satisfactory object detection performance.
because the autoencoder-based compression modules are easy
to train with MS-SSMI as an objective, and the same trend
has been observed in other learning-based codecs [15], [53]. COCO2014 contains 82,783 samples for training and 40,504
2) Qualitative Analysis: The qualitative comparisons with samples for validation, in both of which 80 classes/categories
the existing codecs are shown in Fig. 6, where we deliber- are included/covered. Each sample contains at least one ob-
ately enlarge some regions with complex texture for clear ject. Here we compare the performance of different coding
comparison. We can easily observe that the quality of the frameworks on minival set, which is a subset selected from
reconstruction that using our compression framework is much the validation set, containing 5,000 samples.
better than that of using AVC (H.264), and comparable with The intra codec of SSVC is realized with PyTorch. The
H.265 or H.266. model is trained on ImageNet dataset with Adam optimizer
and a learning rate of 1e-5 for 150w iterations. During the
V. E XPERIMENTS ON S UPPORTING I NTELLIGENT training, we randomly crop the input image into a patch of
A PPLICATIONS WITH PARTIAL B ITSTREAM 256 × 256 and the batch size is 8. When testing, the input
needs to be padded into an image whose length and width are
It is important to show how to leverage the generated se-
both multiples of 64.
mantically structured bitstream (SSB) of our SSVC framework
to better adaptively support downstream intelligent machine 2) Results of Detection: In this subsection, we evaluate the
tasks. Experimentally, in Fig. 8, we take the task of video performance of our SSVC codec on the object detection task.
action recognition as an example for explanation. Several i- We compare with the mainstream traditional codec – H.266
frames (or partial-decoded i-frames) and several optical flows intra codec. For evaluating both of our framework and H.266,
have already met the needs of most computer vision tasks and the officially released object detection network CenterNet [81]
can be adjusted in quantity according to the performance-rate with HG network is adopted as a critic. For VVC codec,
trade-off. Also, the whole video can be reconstructed com- we test the performance with QP set as 27, 32, 37 and 42.
pletely for human eyes if needed. Such adjustable decompres- Correspondingly, We test the performance of our SSVC with
sion could help to achieve an optimal balance between video 4 different bit-rates, where the λ is set as 192, 512, 786
compression efficiency and intelligent applications supporting. and 1024. The result is shown in Fig. 9, the performance
In summary, considering the practical industrial value and of detection task is evaluated by mAP with intersection of
wide application prospects, we take multiple representative union (IoU) set as 0.50 : 0.95. We have the following
heterogeneous machine tasks, including image-based object observations: thanks to the object detection network/function
detection and pose estimation, video-based action recognition that included in the encoder of our SSVC intra-mode codec,
and object segmentation, to validate the superiority and scal- the object detection results (i.e., bounding box positions) are
ability of the proposed semantically structured video coding coded as a part of the semantically structured bitstream (SSB).
(SSVC) framework and the corresponding semantically struc- Therefore, our SSVC framework could directly perform object
tured bitstream (SSB). detection without decoding all the bitstream, which enables our
framework to have a higher decoding efficiency and promises
a better object detection performance. Note that, the execution
A. Image-based Downstream Task Evaluation of the object detection task with our semantically structured
1) Dataset and Implement Details: We evaluate our seman- bit-stream is a special case. For the detection task we can
tically structured video coding (SSVC) framework on object decompress the detection result directly from the header and
detection. We use COCO2014 [80] as evaluation dataset. this part is independent of compression rate. That is because
SUBMITTED TO IEEE TRANSACTIONS ON ON IMAGE PROCESSING 10

B. Video-based Downstream Task Evaluation


Except for image-based downstream tasks, our SSVC also
could directly support heterogeneous downstream video-based
intelligent tasks with the dynamic motion information included
SSB. To prove that, we use two classic/representative video
tasks of video action recognition and video object segmenta-
tion to evaluate SSVC.
1) Dataset and Implementation Details: We evaluate our
semantically structured video coding (SSVC) framework for
video-based action recognition on a widely-used dataset UCF-
101 [82], which contains 13,320 video clips (mostly shorter
than 10 seconds) that cover 101 action categories. Each video
clip is annotated with one exact action label.
Due to the original videos of UCF-101 are all in AVI format,
Fig. 10: Performance comparison of compression efficiency & we first utilize FFmpeg tool to extract frames (i.e., RGB im-
pose estimation accuracy with traditional video codec H.266. ages) from raw videos. And then, we leverage PWC-Net [65]
We can see that our SSVC framework achieves a higher to generate the corresponding optical flow for each frame.
decoding efficiency with partial decompression, and promises Following the previous temporal segment action recognition
a better pose estimation performance at the same time. network TSN [83], we train two independent CNNs for RGB
image and optical flow, respectively. The backbone of both
streams is ResNet-152 [62]. The entire model, i.e., TSN, are
that the IDs and bboxes happend to be compressed without first pretrained on the ImageNet dataset, then finetuned on the
loss and stored in the header, which are extracted in the UCF-101 using Adam [73] optimizer with a batch size of 64.
encoder side for decomposition and recombination of the latent The learning rate starts from 0.001 and drops by 0.1 when
features. Therefore, the dot in the Fig. 9 represent the BPP of the accuracy has stopped rising and such a trend has been
the header and the accuracy of the IDs and bboxes that is kept for several training epochs. We leverage color jittering
already present in the header. and random cropping for data augmentation. In the inference
phase, we take the average accuracy score of five tests as the
3) Results of Pose Estimation: Based on COCO2014 final action recognition results.
dataset, we further take the pose estimation task to indicate the When comparing to the traditional codecs where optical
superiority of the proposed SSVC coding framework. Specifi- flows are used for supporting some AI applications, all frames
cally, SSVC supports the partial bitstream decompression, and need to be reconstructed at first and then the optical flow can
we take these decompressed partial images to conduct pose be estimated. For fairness, we use the same tool (PWC-Net) to
estimation. Similarly, we also take the VVC (H.266) codec estimate optical flow for both traditional codecs and SSVC, all
as an anchor to compare. The pose estimation network is the the predicted results are inferenced by the same model trained
stacked hourglass network [59]. The QP setting is consistent on uncompressed optical flow.
with that of object detection task. During the training, we omit For the task of video object segmentation, we evaluate
all the data augmentation techniques and just train models our proposed SSVC video coding framework on the DAVIS-
from scratch on the original COCO2014 dataset. We use the 16 [84] dataset, which contains 50 high-resolution videos with
RMSprop optimizer with learning rate set as 0.0025. All the 3,455 frames in total, where 30 sequences for training and
decompressed images are resized to 256 × 256. 20 sequences for online validation. The task of video object
As described in the methodology section of Intra-mode segmentation requires segmenting all the object instances from
Coding, the high-level information (ID and bbox) are stored background for each video sequence. Note that, the segmented
in header and corresponding low-level feature is stored as well result/mask for the first frame of each video sequence has been
in our semantic structure bitstream (SSB), we can search the provided in the setting of this task.
entire bitstream to find bitstream related to person. The partial We use the OSVOS network [85] as segmentation backbone,
bitstream is first entropy-decoded to latent feature and fed into which is first pretrained on the ImageNet [86], and then
decoder to obtain pixel-level reconstruction. Then the partially trained on the DAVIS training set. In the end, for each
decompressed images in which almost only person is included test sequence, OSVOS would be fine-tuned on the provided
can be fed into pose estimation task. In the inference stage, segmented results that correspond to the first frame.
we take the PCK (percentage of correct keypoints) as a metric During the evaluation, all the video frames that are recon-
to evaluate the performance of different schemes. Results are structed through the traditional codecs are directly sent into
shown in Fig. 10, we can observe that the proposed SSVC the OSVOS network. For the proposed SSVC framework, in
framework greatly improves the coding efficiency. That’s be- each GoP, we only need to decode out the key frames (i.e., i-
cause our SSVC framework could support partially decoding frames) of video as input to generate the corresponding binary
out these task-specific regions (i.e., the regions contain human masks, then we refine these masks according to the decoded
body skeleton), which saves a large transmission bit cost in optical flows of p-frames through a simple mask refinement
comparison with the fully-decompressed VVC. module based on U-Net [87], which is inspired by [88] and
SUBMITTED TO IEEE TRANSACTIONS ON ON IMAGE PROCESSING 11

TABLE I: The compression performance (BD-rate, %) of


different schemes. Note that we use the average bit-rate that
calculated over the entire action recognition dataset (i.e., UCF-
101) to get this BD-rate saving.
vs. H.264 vs. H.265
SSVC (45% i-frame + 10% flow) -69.31 -40.41
SSVC (15% i-frame + 10% flow) -88.12 -76.75
SSVC (5% i-frame + 10% flow) -94.06 -84.25

framework achieves a better compression efficiency and a


better action recognition accuracy at the same time. 2) As
the bit-rate increases, the action recognition performance of
H.264 and H.265 is gradually approaching. We analyse that
is because the raw videos of UCF-101 dataset are lossy.
Thus, with the bit-rate increasing, the quality of reconstructed
Fig. 11: Performance comparison of compression efficiency & videos would not be further improved. 3) For the proposed
action recognition accuracy with two traditional video codecs SSVC coding framework, the action recognition performance
H.264 and H.265. We can see that our coding framework of SSVC (15% i-frames + 10% optical flow) and SSVC (45%
achieves a better compression efficiency and a better recogni- i-frames + 10% optical flow) is very similar. We analyse
tion accuracy at the same time with partial bitstream decoding. that with the content information derived from i-frames in-
Besides, the whole decoding computational cost of full video creasing, the action recognition task will not be consistently
reconstruction is also reduced. influenced/affected since the motion clues (i.e., optical flow)
is more important when increasing the high bit-rate.
please refer to more details from that. Note that, for the video Besides, since the rate-distortion performance is the key
object segmentation task, we replace the PWC-Net with the performance indicator for video coding, the widely accepted
SOTA RAFT [89] for more accurate optical flow estimation, BD-rate metric [90] is also adopted in our experiment. Note
which is important for this pixel-level task. All experimental that here the “distortion” metric is replaced with “recognition
settings are consistent for a fair comparison. accuracy”, which measures the equivalent bit-rate change
2) Results of Action Recognition: The proposed SSVC (negative means performance improvement and the lower the
coding framework could directly provide both i-frames and better) under the same recognition accuracy.
optical flows of p-frames without fully decompression. Thus, Table I demonstrates the BD-rate results on the entire action
when evaluating the compression performance of the proposed recognition video dataset (i.e., UCF-101). It can be seen that,
SSVC codec on the action recognition task, the RGB stream compared with the traditional codecs H.264 or H.265, the
of TSN takes the partially decoded i-frames (e.g., 5%, 15%, proposed SSVC could achieve on average over 40% BD-
and 45% i-frames) as input and the optical flow stream of TSN rate saving. This is a quite significant improvement in video
take the partial decoded optical flow (e.g., 10% optical flow) coding research area, since it usually can reach 50% BD-rate
as input. We will prove that such design could achieve a better saving every 10 years [10] under the traditional hybrid coding
trade-off between the decompression computational cost and framework.
the action recognition accuracy in the following sections. 3) Results of Video Object segmentation: Thanks to the
For the traditional hybrid video coding framework, it is SSVC framework could separately decode out both i-frames
infeasible to only reconstruct the partial i-frames and optical and optical flows of p-frames without fully decompression,
flow, because their encoded bit-streams are semantic-unknown. we set up multiple cases to comprehensively evaluate the
To get a satisfactory action recognition result, we have to compression & segmentation performance of our SSVC frame-
decode the whole video sequences at first, and then estimate work. SSVC (One i-frame each GoP): only use one i-frame
the optical flow using PWC-Net, which process is both time- and all the optical flow of p-frames in each GoP to perform
consuming and bandwidth-wasting. segmentation. SSVC (One i-frame every two GoPs): only
To indicate the superiority of our framework in terms of use one i-frame in each two GoPs and all the optical flow
compression efficiency and recognition accuracy, we employ of p-frames to perform segmentation. SSVC (Only optical
two popular traditional video codecs, i.e., H.264 and H.265, flow): only use all the optical flow of p-frames to perform
as competitors for comparison. We test the performance with segmentation. Note that we could conduct object segmentation
several QP settings to make the comparison curve easy/clear only using optical flow since the segmented result/mask for the
to understand/read. first frame of each video sequence has been provided.
The performance comparison of compression efficiency & The results are shown in Fig. 12, we have the following ob-
action recognition accuracy is shown in Fig. 11, from which servations: 1) When directly using the complete reconstructed
we can observe that, 1). Compared to fully decompressed video frames to perform object segmentation, H.266 is slightly
video codecs H.264 and H.265, directly performing action better than H.265. 2) When only decoding out partial i-frames
recognition on the partial bitstream (e.g., 5%/15%/45% i- and all optical flows (i.e., the bottom three SSVC variants
frames and 10% optical flow) based on the proposed SSVC shown in Fig. 12), the proposed SSVC framework performs
SUBMITTED TO IEEE TRANSACTIONS ON ON IMAGE PROCESSING 12

proposed SSVC coding framework with a well-designed SSB


has the capability of explicitly supporting the heterogeneous
intelligence multimedia analytics without fully decompression.
Extensive experiments on multiple benchmarks demonstrate
that the proposed SSVC framework not only has a comparable
basic compression performance compared to mainstream video
coding schemes, but also could directly support intelligent
tasks with a large computational cost saving.

R EFERENCES
[1] S. Sun, T. He, and Z. Chen, “Semantic structured image coding
framework for multiple intelligent applications,” IEEE TCSVT, 2020.
[2] A. Liu, W. Lin, M. Paul, F. Zhang, and C. Deng, “Optimal compression
plane for efficient video coding,” IEEE TIP, vol. 20, no. 10, pp. 2788–
Fig. 12: Performance comparison of compression efficiency 2799, 2011.
[3] R. L. de Queiroz and P. A. Chou, “Motion-compensated compression of
& video object segmentation mIoU with two traditional video dynamic voxelized point clouds,” IEEE TIP, vol. 26, no. 8, pp. 3886–
codecs H.265 and H.266. Our SSVC performs worse than 3895, 2017.
the other two when reconstructing all video frames but much [4] L.-H. Chen, C. G. Bampis, Z. Li, A. Norkin, and A. C. Bovik,
“Proxiqa: A proxy approach to perceptual optimization of learned image
better with the setting of transferring only partial i-frames and compression,” IEEE TIP, vol. 30, pp. 360–373, 2020.
all optical flows. [5] M. Li, K. Ma, J. You, D. Zhang, and W. Zuo, “Efficient and effective
context-based convolutional entropy modeling for image compression,”
TABLE II: The compression performance (BD-rate, %) of IEEE TIP, vol. 29, pp. 5900–5911, 2020.
different schemes for the video object segmentation task. [6] R. Forchheimer, “Differential transform coding: A new hybrid coding
vs. H.265 vs. H.266 scheme,” in Proc. Picture Coding Symp.(PCS-81), Montreal, Canada,
SSVC (One i-frame each GoP) -35.96 -34.37 1981, pp. 15–16.
SSVC (One i-frame each two GoPs) -56.39 -54.28 [7] V. V. C. V. Standard, “Quantization and entropy coding in the versatile
SSVC (Only optical flow) -69.81 -69.29 video coding (vvc) standard.”
[8] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview
of the h. 264/avc video coding standard,” IEEE TCSVT, vol. 13, no. 7,
pp. 560–576, 2003.
much better than traditional codecs, achieving a superior [9] M. Wang, K. N. Ngan, and L. Xu, “Efficient h. 264/avc video coding
trade-off between compression efficiency and segmentation with adaptive transforms,” IEEE TMM, vol. 16, no. 4, pp. 933–946,
2014.
accuracy. 3) When only decoding out the optical flows for [10] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the
segmentation that just cost a little bitstream (i.e., the scheme high efficiency video coding (hevc) standard,” IEEE TCSVT, vol. 22,
of SSVC (Only optical flow)), the bit-rate is pretty low but still no. 12, pp. 1649–1668, 2012.
[11] W. Zhu, W. Ding, J. Xu, Y. Shi, and B. Yin, “Screen content coding
achieve a satisfying segmentation performance. based on hevc framework,” IEEE TMM, vol. 16, no. 5, pp. 1316–1326,
Moreover, the coding backbone of SSVC remains a large 2014.
improvement space since the learning-based video coding [12] J. Zhang, S. Kwong, T. Zhao, and Z. Pan, “Ctu-level complexity control
for high efficiency video coding,” IEEE TMM, vol. 20, no. 1, pp. 29–44,
technique is going through a fast development. Thus, we 2017.
believe that the global performance of SSVC w.r.t the video [13] S.-H. Tsang, Y.-L. Chan, W. Kuang, and W.-C. Siu, “Reduced-
object segmentation task could be further improved from at complexity intra block copy (intrabc) mode with early cu splitting and
pruning for hevc screen content coding,” IEEE TMM, vol. 21, no. 2, pp.
least two aspects: 1). using the more advanced video coding 269–283, 2018.
backbone. 2). the estimation of optical flows in SSVC is [14] J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimized image
optimized with the R-D objective constrain, which may not be compression,” in ICLR, 2017.
[15] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational
consistent with the “true motion” of video objects [91], leading image compression with a scale hyperprior,” ICLR, 2018.
inaccurate segmentation results. We leave these challenges for [16] Z. Chen, T. He, X. Jin, and F. Wu, “Learning for video compression,”
our future work. IEEE TCSVT, vol. 30, no. 2, pp. 566–576, 2019.
[17] L.-Y. Duan, J. Liu, W. Yang, T. Huang, and W. Gao, “Video coding
Table II demonstrates the BD-rate saving for video object for machines: A paradigm of collaborative compression and intelligent
segmentation task. We observe that, compared with the tradi- analytics,” arXiv preprint arXiv:2001.03569, 2020.
tional codecs H.265 or H.266, the proposed SSVC variants all [18] R. Torfason, F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and
L. Van Gool, “Towards image understanding from deep compression
consistently achieve obvious BD-rate saving. without decoding,” arXiv preprint arXiv:1803.06131, 2018.
[19] C.-Y. Wu, M. Zaheer, H. Hu, R. Manmatha, A. J. Smola, and
VI. C ONCLUSION P. Krähenbühl, “Compressed video action recognition,” in CVPR, 2018,
pp. 6026–6035.
As a response to the emerging MPEG standardization efforts [20] Z. Shou, X. Lin, Y. Kalantidis, L. Sevilla-Lara, M. Rohrbach, S.-F.
VCM, in this paper, we propose a learning-based semantically Chang, and Z. Yan, “Dmc-net: Generating discriminative motion cues
for fast compressed video action recognition,” in CVPR, 2019, pp. 1268–
structured video coding (SSVC) framework, which formulates 1277.
a new paradigm of video coding for human and machine [21] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” arXiv
visions. SSVC encodes video into a semantically structured preprint arXiv:1904.07850, 2019.
[22] J. Mao and L. Yu, “Convolutional neural network based bi-prediction
bitstream (SSB), which includes both of the static object se- utilizing spatial and temporal information in video coding,” IEEE
mantics characteristics and dynamic object motion clues. The TCSVT, vol. 30, no. 7, pp. 1856–1870, 2019.
SUBMITTED TO IEEE TRANSACTIONS ON ON IMAGE PROCESSING 13

[23] X. Zhang, S. Ma, S. Wang, X. Zhang, H. Sun, and W. Gao, “A joint [49] S. R. Alvar and I. V. Bajić, “Pareto-optimal bit allocation for collabo-
compression scheme of video feature descriptors and visual content,” rative intelligence,” IEEE TIP, vol. 30, pp. 3348–3361, 2021.
IEEE TIP, vol. 26, no. 2, pp. 633–647, 2016. [50] L. Duan, J. Liu, W. Yang, T. Huang, and W. Gao, “Video coding
[24] S. Ma, X. Zhang, S. Wang, X. Zhang, C. Jia, and S. Wang, “Joint for machines: A paradigm of collaborative compression and intelligent
feature and texture coding: Toward smart video representation via front- analytics,” IEEE TIP, vol. 29, pp. 8680–8695, 2020.
end intelligence,” IEEE TCSVT, vol. 29, no. 10, pp. 3095–3105, 2018. [51] X. Li, J. Shi, and Z. Chen, “Task-driven semantic coding via reinforce-
[25] J. A. Roese and G. S. Robinson, “Combined spatial and temporal ment learning,” IEEE TIP, vol. 30, pp. 6307–6320, 2021.
coding of digital image sequences,” in Efficient Transmission of Pictorial [52] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool,
Information, vol. 66. International Society for Optics and Photonics, “Conditional probability models for deep image compression,” in CVPR,
1975, pp. 172–181. 2018, pp. 4394–4402.
[26] C. S. W. P. XV et al., “Video codec for audiovisual services at px64 [53] D. Minnen, J. Ballé, and G. D. Toderici, “Joint autoregressive and
kbit/s,” Draft Revision of Recommendation H, vol. 261, 1989. hierarchical priors for learned image compression,” in NeurIPS, 2018,
[27] I.-T. SG15, “Video coding for low bitrate communication,” Draft ITU-T pp. 10 771–10 780.
Rec. H. 263, 1996. [54] G. J. Sullivan and T. Wiegand, “Rate-distortion optimization for video
[28] I. 11172-2, “Information technology-coding of moving pictures and compression,” IEEE signal processing magazine, vol. 15, no. 6, pp. 74–
associated audio for digital storage media up to about 1.5 mbit/s: Part 90, 1998.
2 video,” 1993. [55] M. Kalluri, M. Jiang, N. Ling, J. Zheng, and P. Zhang, “Adaptive rd
[29] I. JTC, “Coding of audio-visual objects-part 2: Visual,” ISO/IEC, pp. optimal sparse coding with quantization for image compression,” IEEE
14 496–2. TMM, vol. 21, no. 1, pp. 39–50, 2018.
[30] I. ITU-T and I. JTC, “Generic coding of moving pictures and associated [56] A. Ortega and K. Ramchandran, “Rate-distortion methods for image and
audio information-part 2: video,” 1995. video compression,” IEEE signal processing magazine, vol. 15, no. 6,
[31] I. Telecom et al., “Advanced video coding for generic audiovisual pp. 23–50, 1998.
services,” ITU-T Recommendation H. 264, 2003. [57] “Hevc offical test model hm. https://hevc.hhi.fraunhofer.de.”
[32] V. Sze, M. Budagavi, and G. J. Sullivan, “High efficiency video coding [58] F. Yu, D. Wang, E. Shelhamer, and T. Darrell, “Deep layer aggregation,”
(hevc),” in Integrated circuit and systems, algorithms and architectures. in CVPR, 2018, pp. 2403–2412.
Springer, 2014, vol. 39, pp. 49–90. [59] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for
[33] G. Toderici, D. Vincent, N. Johnston, S. Jin Hwang, D. Minnen, J. Shor, human pose estimation,” in ECCV. Springer, 2016, pp. 483–499.
and M. Covell, “Full resolution image compression with recurrent neural [60] H. Law and J. Deng, “Cornernet: Detecting objects as paired keypoints,”
networks,” in CVPR, 2017, pp. 5306–5314. in ECCV, 2018, pp. 734–750.
[34] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen, [61] B. Xiao, H. Wu, and Y. Wei, “Simple baselines for human pose
S. Baluja, M. Covell, and R. Sukthankar, “Variable rate image compres- estimation and tracking,” in ECCV, 2018, pp. 466–481.
sion with recurrent neural networks,” arXiv preprint arXiv:1511.06085, [62] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
2015. recognition,” in CVPR, 2016, pp. 770–778.
[35] H. Liu, T. Chen, Q. Shen, and Z. Ma, “Practical stacked non-local [63] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for
attention modules for image compression.” in CVPR Workshops, 2019, dense object detection,” in ICCV, 2017, pp. 2980–2988.
p. 0. [64] R. Feng, Y. Wu, Z. Guo, Z. Zhang, and Z. Chen, “Learned video
[36] H. Liu, T. Chen, P. Guo, Q. Shen, X. Cao, Y. Wang, and Z. Ma, compression with feature-level residuals,” in CVPR Workshops, 2020,
“Non-local attention optimized deep image compression,” arXiv preprint pp. 120–121.
arXiv:1904.09757, 2019. [65] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “Pwc-net: Cnns for optical
[37] J. Li, B. Li, J. Xu, and R. Xiong, “Efficient multiple-line-based intra flow using pyramid, warping, and cost volume,” in CVPR, 2018, pp.
prediction for hevc,” IEEE TCSVT, vol. 28, no. 4, pp. 947–957, 2016. 8934–8943.
[38] Y. Hu, W. Yang, M. Li, and J. Liu, “Progressive spatial recurrent neural [66] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
network for intra prediction,” IEEE TMM, vol. 21, no. 12, pp. 3024– A large-scale hierarchical image database,” in CVPR. Ieee, 2009, pp.
3037, 2019. 248–255.
[39] S. Xia, W. Yang, Y. Hu, S. Ma, and J. Liu, “A group variational [67] T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman, “Video enhance-
transformation neural network for fractional interpolation of video ment with task-oriented flow,” in IJCV, vol. 127, no. 8, pp. 1106–1125,
coding,” in 2018 Data Compression Conference. IEEE, 2018, pp. 127– 2019.
136. [68] L. Davisson, “Rate distortion theory: A mathematical basis for data
[40] N. Yan, D. Liu, H. Li, T. Xu, F. Wu, and B. Li, “Convolutional neural compression,” IEEE Transactions on Communications, vol. 20, no. 6,
network-based invertible half-pixel interpolation filter for video coding,” pp. 1202–1202, 1972.
in 2018 25th IEEE International Conference on Image Processing [69] Y. Blau and T. Michaeli, “Rethinking lossy compression: The rate-
(ICIP). IEEE, 2018, pp. 201–205. distortion-perception tradeoff,” in International Conference on Machine
[41] L. Zhao, S. Wang, X. Zhang, S. Wang, S. Ma, and W. Gao, “Enhanced Learning. PMLR, 2019, pp. 675–685.
motion-compensated video coding with deep virtual reference frame [70] A. Mercat, M. Viitanen, and J. Vanne, “Uvg dataset: 50/120fps 4k
generation,” IEEE TIP, vol. 28, no. 10, pp. 4832–4844, 2019. sequences for video codec analysis and development,” in ACM MM,
[42] G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao, “Dvc: An 2020, pp. 297–302.
end-to-end deep video compression framework,” in CVPR, 2019, pp. [71] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural
11 006–11 015. similarity for image quality assessment,” in The Thrity-Seventh Asilomar
[43] R. Yang, F. Mentzer, L. V. Gool, and R. Timofte, “Learning for video Conference on Signals, Systems & Computers, 2003, vol. 2. Ieee, 2003,
compression with hierarchical quality and recurrent enhancement,” in pp. 1398–1402.
CVPR, 2020, pp. 6628–6637. [72] “Vvc offical test model vtm. https://jvet.hhi.fraunhofer.de.”
[44] L. Pu, M. W. Marcellin, A. Bilgin, and A. Ashok, “Image compres- [73] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
sion based on task-specific information,” in 2014 IEEE International arXiv preprint arXiv:1412.6980, 2014.
Conference on Image Processing (ICIP). IEEE, 2014, pp. 4817–4821. [74] G. Lu, C. Cai, X. Zhang, L. Chen, W. Ouyang, D. Xu, and Z. Gao,
[45] Z. Liu, T. Liu, W. Wen, L. Jiang, J. Xu, Y. Wang, and G. Quan, “Deepn- “Content adaptive and error propagation aware deep video compression,”
jpeg: A deep neural network favorable jpeg-based image compression arXiv preprint arXiv:2003.11282, 2020.
framework,” in Proceedings of the 55th Annual Design Automation [75] H. Liu, L. Huang, M. Lu, T. Chen, Z. Ma et al., “Learned video
Conference, 2018, pp. 1–6. compression via joint spatial-temporal correlation exploration,” arXiv
[46] D. Pau, G. Cordara, M. Bober, S. Paschalakis, K. Iwamoto, G. Francini, preprint arXiv:1912.06348, 2019.
V. Chandrasekhar, and G. Takacs, “White paper on compact descrip- [76] J. Lin, D. Liu, H. Li, and F. Wu, “M-lvc: Multiple frames prediction for
tors for visual search,” International Organization For Standardization learned video compression,” arXiv preprint arXiv:2004.10290, 2020.
ISO/IEC JTC1/SC29/WG11, Tech. Rep, 2013. [77] C.-Y. Wu, N. Singhal, and P. Krahenbuhl, “Video compression through
[47] X. Li, J. Shi, and Z. Chen, “Task-driven semantic coding via reinforce- image interpolation,” in ECCV, 2018, pp. 416–431.
ment learning,” IEEE TIP, vol. 30, pp. 6307–6320, 2021. [78] A. Djelouah, J. Campos, S. Schaub-Meyer, and C. Schroers, “Neural
[48] Z. Chen and T. He, “Learning based facial image compression with inter-frame compression for video coding,” in ICCV, 2019, pp. 6421–
semantic fidelity metric,” Neurocomputing, vol. 338, pp. 16–25, 2019. 6429.
SUBMITTED TO IEEE TRANSACTIONS ON ON IMAGE PROCESSING 14

[79] A. Habibian, T. v. Rozendaal, J. M. Tomczak, and T. S. Cohen, “Video


compression with rate-distortion autoencoders,” in ICCV, 2019, pp.
7033–7042.
[80] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in
context,” in ECCV. Springer, 2014, pp. 740–755.
[81] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, “Centernet:
Keypoint triplets for object detection,” in ICCV, 2019, pp. 6569–6578.
[82] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human
actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402,
2012.
[83] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool,
“Temporal segment networks for action recognition in videos,” IEEE
transactions on pattern analysis and machine intelligence, vol. 41,
no. 11, pp. 2740–2755, 2018.
[84] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and
A. Sorkine-Hornung, “A benchmark dataset and evaluation methodology
for video object segmentation,” in CVPR, 2016, pp. 724–732.
[85] S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and
L. Van Gool, “One-shot video object segmentation,” in CVPR, 2017, pp.
221–230.
[86] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei, “Imagenet: A
large-scale hierarchical image database,” in CVPR, 2009, pp. 248–255.
[87] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks
for biomedical image segmentation,” in International Conference on
Medical image computing and computer-assisted intervention. Springer,
2015, pp. 234–241.
[88] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A. Sorkine-
Hornung, “Learning video object segmentation from static images,” in
CVPR, 2017, pp. 2663–2672.
[89] Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for
optical flow,” in ECCV. Springer, 2020, pp. 402–419.
[90] G. Bjontegaard, “Calculation of average psnr differences between rd-
curves,” VCEG-M33, 2001.
[91] Z. Chen, J. Xu, Y. He, and J. Zheng, “Fast integer-pel and fractional-pel
motion estimation for h.264/avc,” Journal of visual communication and
image representation, vol. 17, no. 2, pp. 264–290, 2006.

You might also like