VP9
VP9
6 (March
31, 2016), broken down into more detail for each key section:
The VP9 codec, developed by Google, is designed for efficient video compression and decoding,
intended to reduce bandwidth requirements for storing and transmitting high-resolution video (e.g.,
1920x1080 resolution at 30 frames per second). The specification defines the VP9 bitstream format
and the decoding process. While it fully describes the decoding process (which must be standardized
for playback compatibility), the encoding process is not explicitly defined, allowing flexibility in how
different VP9 encoders might choose to compress video data.
VP9 is widely used for online streaming, including platforms like YouTube, where bandwidth
efficiency and video quality are paramount.
The document provides a comprehensive list of technical definitions that are used throughout the
specification, which are important for understanding how VP9 operates. Key terms include:
Transform coefficients: Numerical values that represent frequency data for compressed pixel
blocks. They are divided into AC coefficients (non-zero frequencies) and DC coefficients (zero
frequency, average brightness).
Motion vectors: 2D vectors used to predict the movement of blocks between frames.
Intra-frame: A self-contained frame that does not rely on data from other frames.
Quantization: A lossy compression process where data is reduced by rounding, trading off
precision for smaller data sizes.
Understanding these terms helps grasp the functioning of the VP9 codec, where frames are divided
into blocks and compressed using various transform, prediction, and encoding techniques.
The bitstream consists of a structured sequence of bits that encodes video data. The bitstream is
divided into several key elements:
1. Frames: Each frame in the bitstream can either be a keyframe (independent) or an inter-
frame (dependent on previous frames).
2. Headers: Each frame contains an uncompressed header and a compressed header that
store metadata (e.g., frame size, color space, quantization settings).
3. Prediction Blocks: Each frame is partitioned into blocks or superblocks (64x64 pixels).
Smaller blocks or sub-blocks are used for areas with more detail or motion.
VP9 allows for extensive customization in how frames and blocks are represented, with varying block
sizes, prediction modes, and quantization factors to balance compression efficiency and quality.
The goal of VP9 is to provide an efficient way to store and transmit video by significantly reducing the
bandwidth required. A raw uncompressed video (e.g., 1920x1080 resolution, 30 frames per second)
could require over 700 million bits per second. VP9 compresses this by exploiting redundancies in
both spatial (within a frame) and temporal (across frames) domains.
VP9 applies transforms to pixel data, converting spatial data (pixel values) into frequency data. For
example, in a flat area of an image where most pixel values are the same, VP9 will compress this
efficiently by representing the region with fewer bits.
VP9 uses quantization to achieve lossy compression. Transform coefficients are scaled and rounded
to reduce their precision, saving bits. The larger the quantization factor, the more aggressive the
compression and the greater the quality loss.
VP9 employs intra prediction to estimate the pixel values of a block based on neighboring pixels
within the same frame. Different prediction modes (e.g., vertical, horizontal, diagonal) are used to
match patterns in the image.
Inter prediction allows blocks to be predicted from previous frames. Motion vectors are used to
track objects or camera motion between frames, further reducing the amount of data that needs to
be stored. Inter-frame prediction is a major contributor to bandwidth efficiency in VP9.
VP9 divides frames into superblocks (64x64 pixels) that are further partitioned based on image
complexity. Each superblock can be split into smaller blocks, down to 4x4 pixels. This hierarchical
partitioning allows for flexible adaptation to different levels of detail within the frame.
Large, uniform areas (e.g., skies) can be encoded as a single large block, reducing overhead.
Detailed or fast-moving areas (e.g., faces, objects in motion) are split into smaller blocks to
capture detail and motion.
Partitioning strategies can vary based on the content and desired balance between compression
efficiency and image quality.
6. Transform Techniques and Inverse Transforms (Pages 16-19)
VP9 applies various transform techniques to compress image data, converting pixel values into
frequency data using:
1. Discrete Cosine Transform (DCT): Widely used for block-based compression, the DCT
captures spatial frequency information. Low-frequency components, which represent broad
image details, are prioritized, while high-frequency components (e.g., fine details or noise)
are often discarded.
2. Asymmetric Discrete Sine Transform (ADST): Used for intra-predicted blocks, ADST better
handles directional data.
After compression, the decoder uses inverse transforms to convert the frequency data back into
pixel values. The transform size (e.g., 4x4, 8x8, 16x16, or 32x32) is chosen based on the block size
and content.
VP9 allows up to three reference frames for inter prediction. These can be selected from any of eight
available reference frames stored in memory. This flexibility allows the codec to choose the best
reference frame(s) for each block, improving compression efficiency.
Golden frames and altref (alternate reference) frames are special reference frames that can
be preserved for multiple inter frames.
Compound prediction allows two reference frames to be blended for better prediction
accuracy in some cases.
Motion vectors are used to indicate the movement of blocks from reference frames to the current
frame. VP9 supports sub-pixel motion estimation, providing fractional-pixel accuracy for improved
video quality.
After decoding, loop filters are applied to reduce the visibility of block boundaries, improving the
overall image quality. Deblocking filters smooth the transition between blocks, particularly in areas
with high contrast, preventing "blockiness" artifacts that are common in compressed video.
Loop filtering is applied after each frame is decoded but before the frame is displayed or used as a
reference for future frames.
VP9 uses arithmetic coding, which allows for highly efficient bit allocation, especially when some
symbols are more frequent than others. For example, motion vectors that are small or near-zero
occur more often, so fewer bits are allocated to represent them.
Additionally, VP9 continuously adapts the probability models used for encoding based on the actual
data being processed. These probabilities are updated in the frame headers, allowing the codec to
adjust to different types of video content dynamically.
10. Chroma Subsampling and High Bit Depth Support (Pages 22-23)
To reduce data size without a significant reduction in perceived quality, VP9 supports chroma
subsampling, where the color information (chroma) is sampled at a lower resolution than brightness
(luma). VP9 primarily uses the 4:2:0 format, where color information is halved in both horizontal and
vertical directions.
VP9 also supports high bit-depth video, allowing for color depths of 10 or 12 bits per pixel. This
enhances color fidelity, especially in HDR (High Dynamic Range) content, compared to the standard
8-bit color depth.
Superframes combine multiple frames into a single unit for easier transmission and decoding. This is
useful for applications such as streaming, where latency and synchronization are important.
Tiles allow parts of the frame to be decoded independently, facilitating parallel processing. This is
particularly useful for multi-core processors, enabling faster decoding and playback.
This section covers the syntax elements of the VP9 bitstream, including how different types of data
(e.g., motion vectors, coefficients, and prediction modes) are parsed from the bitstream during
decoding. The syntax elements are often encoded using boolean arithmetic coding for efficient
storage.
Parsing follows a hierarchical structure, where higher-level elements like frame headers are parsed
first, followed by smaller elements like transform coefficients and motion vectors.
Conclusion:
The VP9 Bitstream & Decoding Process Specification provides an in-depth look at how VP9 achieves
high-efficiency video compression through various techniques such as transform coding, prediction,
motion compensation, and probability modeling. It is a highly flexible codec designed
Objective quality metrics like PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity
Index) are used to compare visual quality after compression.
VP9 often performs better than H.264 in both PSNR and SSIM at the same bitrate, providing
better quality for a lower bitrate.
Compared to H.265, VP9 offers similar PSNR and SSIM scores, though H.265 may outperform
VP9 slightly at higher resolutions (4K and beyond) and for low-bitrate scenarios.
Overall, H.265 tends to have a slight edge over VP9 in high-quality and high-resolution video
compression, while VP9 is more comparable or even superior in web video streaming
contexts (where both quality and speed matter).
Latency
Latency is critical for real-time applications such as video conferencing and live streaming.
H.264 has lower latency compared to VP9 and H.265 because it requires fewer
computational resources.
VP9 tends to have higher encoding and decoding latency compared to H.264, making it less
ideal for real-time applications unless optimized.
H.265 can also introduce more latency than H.264 due to its increased complexity, though
hardware acceleration can help mitigate this.
H.264 is covered by patents and licensing fees, managed by the MPEG-LA. Most users are
covered under general-use terms, but licensing fees apply to certain commercial uses.
H.265 also has licensing fees, which are considered more complex and costly than those for
H.264. The need to pay royalties has slowed H.265's adoption in some cases.
Performance
VP9 H.264 (AVC) H.265 (HEVC)
Metric