On Transform Coding Tools Under Development For VP10
On Transform Coding Tools Under Development For VP10
On Transform Coding Tools Under Development For VP10
Sarah Parker , Yue Chen , Jingning Han , Zoe Liu*, Debargha Mukherjee*, Hui Su*, Yongzhe
* * *
Wang*, Jim Bankoski*, Shunyao Li+
Email: {sarahparker, yuec , jingning
, zoeliu, debargha
, huisu , yongzhe, jimbankoski
}@google.com*
[email protected] +
*
Google, Inc., 1600 Amphitheatre Parkway, Mountain View, CA, USA 94043.
+
University of California, Santa Barbara, CA 93106.
ABSTRACT
Google started the WebM Project in 2010 to develop open source, royaltyfree video codecs designed specifically for
media on the Web. The second generation codec released by the WebM project, VP9, is currently served by YouTube,
and enjoys billions of views per day. Realizing the need for even greater compression efficiency to cope with the
growing demand for video on the web, the WebM team embarked on an ambitious project to develop a next edition
codec, VP10, that achieves at least a generational improvement in coding efficiency over VP9. Starting from VP9, a set
of new experimental coding tools have already been added to VP10 to achieve decent coding gains. Subsequently,
Google joined a consortium of major tech companies called the Alliance for Open Media to jointly develop a new codec
AV1. As a result, the VP10 effort is largely expected to merge with AV1. In this paper, we focus primarily on new tools
in VP10 that improve coding of the prediction residue using transform coding techniques. Specifically, we describe tools
that increase the flexibility of available transforms, allowing the codec to handle a more diverse range or residue
structures. Results are presented on a standard test set.
Keywords: video coding, VP8, VP9, VP10, webm, H.264, HEVC, prediction, motion, transform, transform, DCT, DST,
Identity.
1. INTRODUCTION
Google embarked on the WebM project [1] to develop opensource, royalty unencumbered video codecs for the Web.
The first codec released as part of the project was called VP8 [2] and is still used extensively in Google Hangouts. The
next edition of the codec, entitled VP9 [3][4], was released in mid2013 and is the current generation codec from the
WebM project. It achieves a coding efficiency similar to the latest video codec from MPEG entitled HEVC [5]. VP9 has
found huge success with adoption by YouTube, and has delivered big improvements to the YouTube service in terms of
quality of experience metrics such as watchtime and meantimetorebuffer over the primary format H.264/AVC [6].
Specifically, VP9 streams delivered by YouTube today are not only 3040% more compact than corresponding
H.264/AVC streams but are also somewhat higher in quality. Consequently, even with predominant software decoding
on compatible browsers Chrome, Firefox, Opera on potent devices, the number of VP9 videos viewed daily by
YouTube users today is in the order of billions. As VP9 hardware decoders become more readily available on mobile
devices we expect the proliferation of VP9 to accelerate even more.
Even though the gains achieved with VP9 are tangible and significant, the continued growth in online video consumption
has made the need for efficient video coding increasingly critical. The WebM project has been focusing on developing
the next generation video codec VP10 [7] since 2014, and modest gains in coding efficiency have already been achieved.
In 2015, Google joined a consortium of major tech companies called the Alliance for Open Media to jointly develop a
Applications of Digital Image Processing XXXIX, edited by Andrew G. Tescher, Proc. of SPIE
Vol. 9971, 997119 · © 2016 SPIE · CCC code: 0277-786X/16/$18 · doi: 10.1117/12.2239105
new royaltyfree codec to be named AV1. The plan is to propose the experimental tools developed in VP10 to the AV1
process in course of time. In this paper we primarily focus on the tools developed for VP10.
Though improvements in prediction modes can successfully decrease the prediction error, more than half of the bitrate in
modern video codecs is still spent coding the residual. In this paper we discuss the new transform coding tools that have
been added in VP10 to improve the coding of the residue. First, we discuss the super transform, which allows the
application of one large transform to a predictor created by combining several prediction blocks using overlapped block
motion compensation. Next, we discuss two extensions to our transform sizes: recursive transform units and rectangular
transforms. Finally, we discuss an expanded bank of transform types available to Intra and Inter prediction blocks.
Overall, we find that increasing the flexibility of available transforms allows VP10 to better handle a wide range of
residue structures and leads to a significant reduction in BDRATE.
2. VP9 TRANSFORM CODING FRAMEWORK
3. TRANSFORM TOOL ENHANCEMENTS IN VP10
SUPERTRANSFORMS
VP9 uses a recursive blockpartitioning scheme for the purpose of prediction; however, the transform used to code the
prediction residue of a prediction block is restricted to be of a size no larger than the prediction block itself. VP10
attempts to remove this restriction for Inter modes by allowing transform blocks to span across multiple prediction
blocks. Specifically, at any level of the partition tree, the syntax can optionally indicate that a single large transform will
be used at that level, irrespective of how fine the partition tree may be below that level. Fig. 1 shows an example of a
partitiontree with two supertransform blocks indicating that the prediction residue will be coded jointly with a large
transform at these sizes.
Fig. 1. Partitiontree with supertransform blocks
Through our investigations, we found that a simple juxtaposition of prediction residues from different prediction blocks
to create one large final predictor is often nonideal. Instead, supertransform creates a new predictor based on a
recursive application of overlapped block motion compensation [8]. In particular, prediction residues from smallest
blocks within the super transform tree are aggregated together with overlapped block motion compensation successively
in a recursive fashion, until the final predictor bubbles up to the supertransform level. Note that predictors at each level
need to be extended by a width equivalent to the width of the smoothing filter across prediction boundaries.
RECURSIVE TRANSFORM UNITS
VP9 provides a wide range of available transform sizes, but each prediction block is limited to selecting only one of
these. In VP10, we remove this constraint and allow any Inter prediction block to use several different transform sizes.
Transforms within a single prediction block may now have recursive treestructured partitions. A simple 2way partition
quadtree with only squaresplit types is used to produce these recursive units. We have found that this size flexibility
allows finer targeting of highenergy regions in the resdual signal. Fig. 2 illustrates the available partition types in the
2way quadtree, as well as an example of a final transform partition tree within a single prediction block.
Fig. 2. Prediction block residue with recursively partitioned transform units using a 2way partition tree
RECTANGULAR TRANSFORMS
Fig. 3. Rectangular transform units within a superblock
EXTENDED TRANSFORM TYPES
To code Inter prediction residues, VP9 exclusively uses DCT of different sizes namely 4x4, 8x8, 16x16 and 32x32;
however, for coding of Intra prediction residues, a richer set of transforms that include a hybrid combination of DCTs
and Asymmetric DSTs (ADST) are used [9][10][11]. Intra prediction residues are likely to be smaller near the
boundaries from where they are predicted. As such, the asymmetric DST is better suited to code it. Specifically, VP9
uses DSTIV, which is an approximation to the original ADST [9], but with a faster butterfly implementation [11]. For
ease of exposition, we still refer to this transform as the ADST. In VP9, for each Intra predicted block size, 4x4, 8x8 and
16x16, up to four different separable 2D transforms may be used: DCTDCT, DCTADST, ADSTDCT, ADSTADST,
where each transform pair listed denote the horizontal and vertical transforms of a separable 2D implementation
respectively.
For VP10, we are exploring a richer set of transforms for coding Inter and Intra prediction residues. Inter prediction
residues do not have a well defined structure as in the Intra case, but we have found that using a bank of transforms, each
adapted to a specific type of residue profile within the block, is generally helpful. In VP10, we use not only the ADST
(DSTIV) but also flipped version of the ADST (FlipADST) that applies ADST in reverse order. Further, an identity
transform (IDTX) is now available, which seems to be particularly useful for coding residue with sharp lines and edges.
Previously, we experimented with a symmetric DST, namely DSTII, but found the identity transform to be more
beneficial for coding efficiency. Finally, both Inter and Intra modes continue to make use of DCT. Thus, for each coded
block, we can choose to use one of up to 16 different transforms as follows:
{DCT, ADST, FlipADST, IDTX}horizontal x {DCT, ADST, FlipADST, IDTX}vertical
or explicitly for horizontalvertical pairs:
DCTDCT, DCTADST, ADSTDCT, ADSTADST, DCTFlipADST, FlipADSTDCT, FlipADSTFlipADST,
ADSTFlipADST, FlipADSTADST, DSTDST, IDTXDCT, DCTIDTX, IDTXADST, ADSTIDTX,
IDTXFlipADST, FlipADSTIDTX, IDTXIDTX.
As block sizes get larger, some of these transforms begin to act similarly. Thus, a reduced set of transforms is used for
16x16, 32x32 and 64x64 block sizes. In the transform selection process for Inter and Intra modes, the encoder does a
search over the entire set of transforms and selects the one that produces the best rdcost. Once a transform is selected, a
transform type symbol from the set of types available at that size is used to indicate the actual transform used in the
bitstream.
Note that the onedimensional transforms DCTIDTX or IDTXDCT in the list above are similar in spirit to directional
transforms [12] or 1D trasnforms [13] in the literature. However, we chose two use only two directions horizontal and
vertical, since these seem to be the minimal set that provide the best gains. Also, note that IDTXIDTX is equivalent to
transform skip, which yields substantial benefit for screen content.
While the multiple transforms do not add any decoding complexity since all transform sizes and types are explictly
signaled, there is significant added complexity needed on the encoder side to make the best rdbased decision by
searching over the set of available transform types. We are currently experimenting with methods to mitigate this
complexity. Specifically, we are exploring classification schemes based on simple features derived from the residue
signal, to prune out transform types from the rd search set. In particular, one classifier is trained to prune out either DCT
or IDTX, and a second classifier is trained to prune out either ADST or FlipADST in each direction. The DCT vs IDTX
classifier relies on features comprised of horizontal and vertical neighboring pixel correlation in the residual , while the
ADST vs. FlipADST classifier relies on features comprised of the energy distribution in various regions of the residual
signal. We continue to explore different methods to reduce the added encoder complexity burden produced by this
expanded transform set.
4. CODING RESULTS
To evaluate our new tools, we performed a controlled bitrate test using 3 different video sets:
● lowres, which includes 40 videos of CIF resolution,
● midres, wich includes 30 videos of 480p and 360p resolution, and
● hdres, which contains 38 videos at 720p and 1080p resolution.
where we code 150 frames of each video with a single keyframe. The coding results are shown in Tables 13 below.
For quality metrics we use average sequence PSNR and SSIM [14] computed by the arithmetic average of the
combined PSNRs and SSIMs respectively for each frame. Combined PSNR for each frame is computed from the
combined MSE of the Y, Cb and Cr components. In other words:
MSEcombined = [4MSEy + MSECb + MSECr]/6, assuming 4:2:0 sampling
PSNRcombined = min ( 10log10(2552 / MSEcombined), 100 )
SSIM for each component in each frame is computed by averaging the SSIM scores computed without applying a
windowing function over 8x8 windows for each component. Combined SSIM for the frame is computed from the SSIMs
of the Y, Cb and Cr components as follows:
SSIMcombined = 0.8 SSIMy + 0.1 (SSIMCb + SSIMCr)
To compare RD curves obtained by two codecs we use a modified BDRATE [15] metric that uses piecewise cubic
Hermite polynomial interpolation (pchip) on the the ratedistortion points before integrating the difference over a fine
grid using the trapezoid method. The OVERALL number at the bottom is the arithmetic average of the BDRATE
numbers over all the videos in the same column. The BDRATE is computed separately based on the average sequence
PSNR and SSIM metrics as computed above.
For all the tables below, we use a slightly modified version of VP9 as the baseline, referred to as VP9+ for ease of
exposition, which was also the starting point of the AV1 codec. VP9+ is better than VP9 by about 0.6% becasue it
already incorporates multiple explicit transforms for INTER and INTRA with the set of four original VP9 transforms as
described in Section 2. Specifically, all the results below are generared on the nextgenv2 branch of the libvpx
repository, where the configurations tested are as follows:
VP9+ baseline:
enableav1 [very similar to the AV1 baseline codec]
Extended Transform Set:
enableav1 enableexperimental enableexttx
Extended Transform Set + Rectangular Transforms:
enableav1 enableexperimental enableexttx enablerecttx
Super Transform:
enableav1 enableexperimental enablesupertx
All New transform tools:
enableav1 enableexperimental enablesupertx enableexttx enablerecttx
At the time of writing of this paper we found some bugs in the Recursive Transform Units tool and so the results for
those are excluded.
Table 1. VP10 BDRATE results on lowres set (VP9+ baseline)
Table 2. VP10 BDRATE results on midres set (VP9+ baseline)
5. CONCLUSION
In this paper we have presented a brief overview of the new transform coding tools that are being explored as part of
VP10 development. Preliminary results indicate that increasing transform flexibility can achieve at least a 3% decrease
in BDRATE for both average PSNR and SSIM. Although this is an encouraging improvement, we are left with several
avenues to explore within the space of transform flexibility, and still have a ways to go before we reach a viable next
generation codec. VP10 development is an opensource project, and we invite the rest of the video coding community to
join the effort to create tomorrow’s royaltyfree codec.
REFERENCES
[1] http://www.webmproject.org/
[2] J. Bankoski, J. Koleszar, L. Quillio, J. Salonen, P. Wilkins, Y. Xu, VP8 Data Format and Decoding Guide, RFC
6386, http://datatracker.ietf.org/doc/rfc6386/
[3] D. Mukherjee, J. Bankoski, R. S. Bultje, A. Grange, J. Han, J. Koleszar, P. Wlkins, Y. Xu, “The latest opensource
video codec VP9 an overview and preliminary results,” Proc. IEEE Picture Coding Symp., pp. 39093, San Jose,
Dec. 2013.
[4] D. Mukherjee, J. Bankoski, R. S. Bultje, A. Grange, J. Han, J. Koleszar, P. Wlkins, Y. Xu, “A Technical Overview
of VP9 the latest opensource video codec,” SMPTE Motion Imaging Journal, Jan/Feb 2015.
[5] Gary J. Sullivan, JensRainer Ohm, WooJin Han, and Thomas Wiegand, “Overview of the High Efficiency Video
Coding (HEVC) Standard,” IEEE Trans. on Circuits and Systems for Video Technology, Vol. 22, No. 12, Dec 2012.
[6] Thomas Wiegand, Gary J. Sullivan, Gisle Bjøntegaard; Ajay Luthra. "Overview of the H.264/AVC Video Coding
Standard," IEEE Transactions on Circuits and Systems for Video Technology, Vol. 13 No. 7, Jan 2011.
[7] D. Mukherjee, H. Su, J. Bankoski, A. Converse, J. Han, Z. Liu, Y. Xu, “An overview of video coding tools under
consideration for VP10: the successor to VP9,” Proc. SPIE, Applications of Digital Image Processing XXXVIII, vol.
9599, Sep 2015.
[8] Y. Chen, K. Rose, J. Han, and D. Mukherjee, "A Prefiltering Approach to Exploit Decoupled Prediction and
Transform Block Structures in Video Coding", Proc. IEEE International Conference on Image Processing (ICIP),
Oct. 2014.
[9] J. Han, A. Saxena, and K. Rose, “Towards jointly optimal spatial prediction and adaptive transform in video/image
coding,” Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), pp. 726–729, March 2010.
[10] J. Han, A. Saxena, V. Melkote, and K. Rose, “Jointly optimized spatial prediction and block transform for video and
image coding,” IEEE Transactions on Image Processing, vol. 21, pp. 18741884, April 2012.
[11] J. Han, Y. Xu, D. Mukherjee, “A butterfly structured design of the hybrid coding scheme,” Proc. IEEE Picture
Coding Symp., pp. 14, San Jose, Dec. 2013.
[12] C.L. Chang, Mina Makar, Sam S. Tsai, B. Girod, “Directionadaptive partitioned block transform for color image
coding,” IEEE Transactions on Image Processing, vol. 19, no. 7, July 2010.
[13] F. Kamisli and J. S. Lim, “1D transforms for the motion compensated residual,” IEEE Transactions on Image
Processing, vol. 20, no. 4, April 2011.
[14] Wang, Zhou; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. (20040401). “Image quality assessment: from error
visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, No. 4, pp. 600–612, April 2004.
[15] G. Bjøntegaard, “Calculation of average psnr differences between rdcurves,” VCEGM33, 13th VCEG meeting,
Austin, Texas, March 2001.