Presented By: Priya Raina 13-516
Presented By: Priya Raina 13-516
Priya Raina
13-516
The Moving Picture Experts Group (MPEG) is
a working group of experts that was formed
by ISO and IEC to set standards for audio and video
compression and transmission.
Founded in 1988 by Hiroshi Yasuda and Leonardo
Chiariglione
Introduction
The ISO standards by MPEG identified by 5 digit no.
MPEG-1 is 11172.
Proposal of new work within a committee(NP)
NP approved at Subcommittee(SC29) and then at
Technical Committee level (JTC1)
Scope definiton and Calls for Proposals CfP.
Birth of a standard
For Audio and Video coding standards the first
document produced is called a Test Model, describes,
in a programming language, the operation of the
encoder and the decoder.
Used to carry out simulations to optimise the
performance of the coding scheme.
Working Draft (WD) is produced which is already in
the form of a standard :kept internal to MPEG for
revision.
Sufficiently solid WD becomes Committee Draft (CD).
It is then sent to National Bodies (NB) for ballot. If
the number of positive votes is above the quorum,
the CD becomes Final Committee Draft (FCD) and is
again submitted to NBs for the second ballot after a
thorough review that may take into account the
comments issued by NBs. If the number of positive
votes is above the quorum the FCD becomes Final
Draft International Standard (FDIS). ISO will then hold
a yes/no ballot with National Bodies where no
technical changes are allowed. The document then
becomes International Standard (IS).
NP
Approval
by SC29
and JTC1
Scope
Definition
and CfP
Test
Model
WD
CD
FCD
FDIS
IS
High-Definition Television (HDTV)
1920x1080
30 frames per second (full motion)
8 bits for each three primary colors
Total 1.5 Gb/sec!
Each cable channel is 6 MHz
Max data rate of 19.2 Mb/sec
Reduced to 18 Mb/sec w/audio + control
Compression rate must be 83:1!
The Need for Video
Compression
CD-ROM and DAT key storage devices
1-2 Mbits/sec for 1x CD-ROM
Two types of application videos:
Asymmetric (encoded once, decoded many)
Video games, Video on Demand
Symmetric (encoded once, decoded once)
Video phone, video mail
(How do you think the two types might influence design?)
Video at about 1.5 Mbits/sec
Audio at about 64-192 kbits/channel
Compatibility Goals
Random Access, Reverse, Fast Forward, Search
At any point in the stream
Can reduce quality somewhat during task, if needed
Audio/Video Synchronization
Even when under two different clocks
Robustness to errors
Not catastrophic if bits lost
Coding/Decoding delay under 150ms
For interactive applications
Editability
Modify/Replace frames
Requirements
Standards at a glance
MPEG 1
MPEG 2
MPEG 3
MPEG 4
MPEG 7
MPEG 21
MPEG A
MPEG B
MPEG C
MPEG D
MPEG E
MPEG V
MPEG M
MPEG U
MPEG H
MPEG DASH
MPEG 1
ISO/IEC 11172
Coding of moving pictures and associated audio at up to about 1.5
Mbit/s
Horizontal picture size
768 pixels
Vertical picture size
576 lines
Number of macroblocks
396
Number of macroblocks
picture rate
396 25 = 9900
Picture rate
30 pictures/s
VBV buffer size
2,621,440 bits
Bit rate
1,856,000 bits/s
provides the information about the audio and video
layers with stream identification and
synchronization information essential to the
decoding and subsequent rendering for each of
them.
required to carry not only the multiplexed audio and
video information but all of the other non-
audio/video information, and in many cases private,
data needed for a successful and pleasing user
experience.
error free environment of CDs or optical discs.
Systems Layer
Unique multiplex SYSMUX designed to precisely
deliver a clock reference and elementary streams in
such a way as to enable audio-video synchronization
due to constant delay.
Achieved by specification of System Target Decoder
(STD) model. The STD is an idealized model of a
demultiplexing and decoding complex that precisely
specifies the delivery time of each byte in an MPEG
Systems multiplex and its distribution to the
appropriate decoder or resource in the complex.
requirements: in the context of storage and replay of stored
data, mainly are related to random access i.e. fwd-bwd replay,
f.fwd mode, editing.
basic principle: hybrid coding=combination of block-wise
motion-compensated prediction and scalar-quantized DCT-
based coding of the residual. The same transform is applied
when intraframe mode is selected for a whole picture or a
macroblock.
exploits perceptual compression methods to significantly
reduce the data rate i.e. reduces or discards information in
certain frequencies and areas of the picture that the human eye
has limited ability to fully perceive.
Also exploits temporal (over time) and spatial (across a picture)
redundancy
Video Layer
Structure of the Coded Bit-Stream
GOP-1 GOP-2 GOP-n
I B B B P B B..
Slice-1
Slice-2
Slice-N
mb-1 mb-2 mb-n
0 1
2 3
4 5
Sequence layer
GOP layer
Picture layer
Slice layer
Macroblock layer
8x8 block
Sequence information
Video Params include width, height, aspect ratio of pixels,
picture rate.
Bitstream Params are bit rate, buffer size, and constrained
parameters flag (means bitstream can be decoded by most
hardware)
Two types of QTs: one for intra-coded blocks (I-frames) and
one for inter-coded blocks (P-frames).
Group of Picture (GOP)Information
Time code: bit field with SMPTE time code (hours, minutes,
seconds, frame).
GOP Params are bits describing structure of GOP.
Picture information
Type: I, P, or B-frame?
Buffer Params indicate how full decoder's buffer should be before
starting decode.
Encode Params indicate whether half pixel motion vectors are used.
Slice information
Vert Pos: what line does this slice start on?
QScale: How is the quantization table scaled in this slice?
Macroblock information
Addr Incr: number of MBs to skip.
Type: Does this MB use a motion vector? What type?
QScale: How is the quantization table scaled in this MB?
Coded Block Pattern (CBP): bitmap indicating which blocks are
coded.
The color-space is transformed to Y'CbCr (Y'=Luma, Cb=Chroma Blue,
Cr=Chroma Red). Luma(brightness, resolution) is stored separately
from chroma (color, hue, phase) and even further separated into red and
blue components. The chroma is also subsampled to 4:2:0, meaning it is
reduced by one half vertically and one half horizontally, to just one quarter
the resolution of the video.
Because the human eye is much more sensitive to small changes in
brightness (the Y component) than in color (the Cr and Cb
components), chroma subsampling is a very effective way to reduce the
amount of video data that needs to be compressed. Can manifest as
chroma aliasing artifacts
Because of subsampling, Y'CbCr video must always be stored using even
dimensions , otherwise chroma mismatch ("ghosts") will occur
8x8 blocks for quantization. However, because chroma (color) is
subsampled by a factor of 4, each pair of (red and blue) chroma blocks
corresponds to 4 different luma blocks. This set of 6 blocks, with a
resolution of 16x16, is called a macroblock.
Spatial Redundancy Reduction
Zig-Zag Scan,
Run-length
coding
Quantization
major reduction
controls quality
Intra-Frame
Encoded
Frames: I,P,B,D: D frame exclusive to MPEG. It is an I frame encoded using DC
transform coefficients only; Very low quality; never referenced by I-, P- or B-
frames; used for fast previews of video, for instance when seeking through a
video at high speed. Now obsolete.
only blocks that change are updated, (up to the maximum GOP size). This is
known as conditional replenishment.
Movement of the objects, and/or the camera may result in large portions of the
frame needing to be updated, even though only the position of the previously
encoded objects has changed. Through motion estimation the encoder can
compensate for this movement and remove a large amount of redundant
information.
The encoder compares the current frame with adjacent parts of the video from
the anchor frame (previous I- or P- frame) in a diamond pattern, up to a
(encoder-specific) predefined radius limit from the area of the current
macroblock. If a match is found, only the direction and distance (i.e.
the vector of the motion) from the previous video area to the current macroblock
need to be encoded into the inter-frame (P- or B- frame). The reverse of this
process, performed by the decoder to reconstruct the picture, is called motion
compensation.
A predicted macroblock rarely matches the current picture perfectly, however.
The differences between the estimated matching area, and the real
frame/macroblock is called the prediction error
Temporal Redundancy Reduction
I frames are independently encoded
P frames are based on previous I, P frames
B frames are based on previous and following I and P
frames
In case something is uncovered
Quantization is performed by taking each of the 64 frequency values of the DCT block,
dividing them by the frame-level quantizer, then dividing them by their
corresponding values in the quantization matrix. Finally, the result is rounded down.
This significantly reduces, or completely eliminates, the information in some
frequency components of the picture. Typically, high frequency information is less
visually important, and so high frequencies are much more strongly
quantized (drastically reduced). MPEG-1 actually uses two separate quantization
matrices, one for intra-blocks (I-blocks) and one for inter-block (P- and B- blocks) so
quantization of different block types can be done independently, and so, more
effectively
This is also the primary source of most MPEG-1 videocompression artifacts,
like blockiness, color banding, noise, ringing, discoloration, et al. This happens when
video is encoded with an insufficient bitrate, and the encoder is therefore forced to
use high frame-level quantizers (strong quantization) through much of the video.
entropy coding in the field of information theory.
The coefficients of quantized DCT blocks tend to zero towards the bottom-right.
Maximum compression can be achieved by a zig-zag scanning of the DCT block
starting from the top left and using Run-length encoding techniques.
The DC coefficients and motion vectors are DPCM-encoded.
Run-length encoding (RLE) is a very simple method of compressing repetition. A
sequential string of characters, no matter how long, can be replaced with a few bytes,
noting the value that repeats, and how many times. For example, if someone were to
say "five nines", you would know they mean the number: 99999.
RLE is particularly effective after quantization, as a significant number of the AC
coefficients are now zero (called sparse data), and can be represented with just a
couple of bytes. This is stored in a special 2-dimensional Huffman table that codes
the run-length and the run-ending character.
Huffman Coding is a very popular method of entropy coding, and used in MPEG-1
video to reduce the data size. The data is analyzed to find strings that repeat often.
Those strings are then put into a special table, with the most frequently repeating
data assigned the shortest code. This keeps the data as small as possible with this
form of compression.
[38]
Once the table is constructed, those strings in the data are
replaced with their (much smaller) codes, which reference the appropriate entry in
the table. The decoder simply reverses this process to produce the original data.
This is the final step in the video encoding process, so the result of Huffman
coding is known as the MPEG-1 video "bitstream.
. I-frame only sequences gives least compression, but is
useful for random access, FF/FR and editability. I and P
frame sequences give moderate compression but add a
certain degree of random access, FF/FR functionality. I,P
& B frame sequences give very high compression but also
increases the coding/decoding delay significantly. Such
configurations are therefore not suited for video-
telephony or video-conferencing applications.
The typical data rate of an I-frame is 1 bit per pixel while
that of a P-frame is 0.1 bit per pixel and for a B-frame,
0.015 bit per pixel.
MPEG: Video Encoding
Pre
processing
Frame
Memory
+
-
DCT
Motion
Compensation
Motion
Estimation
Frame
Memory
+
IDCT
Quantizer
(Q)
Regulator
VLC
Encoder
Buffer
Q
-1
Output
Input
P
r
e
d
i
c
t
i
v
e
f
r
a
m
e
M
o
t
i
o
n
v
e
c
t
o
r
s
Interframe predictive coding (P-pictures)
For each macroblock the motion estimator produces the
best matching macroblock
The two macroblocks are subtracted and the difference is
DCT coded
Interframe interpolative coding (B-pictures)
The motion vector estimation is performed twice
The encoder forms a prediction error macroblock from
either or from their average
The prediction error is encoded using a block-based DCT
The encoder needs to reorder pictures because B-
frames always arrive late
MPEG-1 videos are most commonly seen
using Source Input Format (SIF) resolution: 352x240,
352x288, or 320x240. These low resolutions,
combined with a bitrate less than 1.5 Mbit/s, make
up what is known as a constrained parameters
bitstream (CPB), later renamed the "Low Level" (LL)
profile in MPEG-2. This is the minimum video
specifications any decoder should be able to handle,
to be considered MPEG-1 compliant
Audio Layer
Layer I :for applications that require both low complexity decoding and
encoding. Layer II: higher compression efficiency for a slightly higher
complexity
Layer II/MP2 is a time-domain encoder. It uses a low-delay 32 sub-
band polyphased filter bank for time-frequency mapping; having
overlapping ranges (i.e. polyphased) to prevent aliasing.
the 32 sub-band filter bank returns 32 amplitude coefficients, one for each
equal-sized frequency band/segment of the audio, which is about 700 Hz
wide (depending on the audio's sampling frequency). The encoder then
utilizes the psychoacoustic model to determine which sub-bands contain
audio information that is less important, and so, where quantization will
be inaudible.
can also optionally use intensity stereo coding, a form of joint stereo. This
means that the frequencies above 6 kHz of both channels are
combined/down-mixed into one single (mono) channel, but the "side
channel" information on the relative intensity (volume, amplitude) of each
channel is preserved and encoded into the bitstream separately. On
playback, the single channel is played through left and right speakers, with
the intensity information applied to each channel to give the illusion of
stereo sound.
[38][50]
This perceptual trick is known as stereo irrelevancy.
MP3 is a frequency-domain audio transform encoder.
worse temporal resolution than Layer II. This causes quantization
artifacts, due to transient sounds like percussive events and other
high-frequency events that spread over a larger window. This
results in audible smearing and pre-echo.
forced to use a hybrid time domain (filter bank) /frequency domain
(MDCT) model to fit in with Layer II simply wastes processing
time and compromises quality by introducing aliasing artifacts
MP3 can use middle/side (mid/side, m/s, MS, matrixed) joint
stereo. With mid/side stereo, certain frequency ranges of both
channels are merged into a single (middle, mid, L+R) mono
channel, while the sound difference between the left and right
channels is stored as a separate (side, L-R) channel. Unlike
intensity stereo, this process does not discard any audio
information.
MPEG 2
ISO/IEC 13818
Generic coding of moving pictures and associated audio
container formats. One is the transport stream, a data packet
format designed to transmit one data packet in four ATM data
packets for streaming digital video and audio over fixed or
mobile transmission mediums, where the beginning and the
end of the stream may not be identified.
The other is the program stream, an extended version of
the MPEG-1 container format designed for random access
storage mediums such as hard disk drives, optical
discs and flash memory.
Standard Definition and High Definition television
broadcasting over Terrestrial, satellite and cable networks, and
optical disk - specifically DVD for movie distribution.
Systems Layer
The Video section, part 2 of MPEG-2, is similar to the
previous MPEG-1 standard, but also provides support
for interlaced video, the format used by analog broadcast
TV systems. MPEG-2 video is not optimized for low bit-
rates, especially less than 1 Mbit/s at standard
definition resolutions. All standards-compliant MPEG-2
Video decoders are fully capable of playing back MPEG-1
Video streams conforming to the Constrained Parameters
Bitstream syntax. MPEG-2/Video is formally known as
ISO/IEC 13818-2 and as ITU-T Rec. H.262.
[5]
With some enhancements, MPEG-2 Video and Systems
are also used in some HDTV transmission systems.
Video Layer
Interlaced and non-interlaced frame
Different color subsampling modes e.g., 4:2:2, 4:2:0, 4:4:4
Flexible quantization schemes can be changed at
picture level
Scalable bit-streams
Profiles and levels
A number of levels and profiles have been defined for
MPEG-2 video compression. Each of these describes
a useful subset of the total functionality offered by
the MPEG-2 standards. An MPEG-2 system is
usually developed for a certain set of profiles at a
certain level. Basically:
Profile = quality of the video
Level = resolution of the video
Levels
Profiles
SNR
4:2:0
Spatial
4:2:0
High
4:2:0;4:2:2
Multiview
4:2:0
High
Enhancement 1920 X 1151/60 1920 X 1151/60
Lower 960 X 576/30 1920 X 1151/60
Bitrate 100, 80,25 130, 50, 80
High-1440
Enhancement 1440 X 1152/60 1440 X 1152/60 1920 X 1152/60
Lower 720 X 576/30 720 X 576/30 1920 X 1152/60
Bitrate 60, 40, 15 80, 60, 20 100, 40, 60
Main
Enhancement 720 X 576/30 720 X 576/30 720 X 576/30
Lower 352 X 288/30 720 X 576/30
Bitrate 15, 10 20, 15, 4 25, 10, 15
Low
Enhancement 352 X 288/30 352 X 288/30
Lower 352 X 288/30
Bitrate 4, 3 8, 4, 4
Multiview Profile
Stereoscopic view disparity prediction
Virtual walk-throughs composed from multiple
viewpoints
Supporting Interlaced
Video
MPEG-2 must support interlaced video as well since this
is one of the options for digital broadcast TV and HDTV
In interlaced video each frame consists of two fields,
referred to as the top-field and the bottom-field
In a Frame-picture, all scanlines from both fields are
interleaved to form a single frame, then divided into 1616
macroblocks and coded using MC
If each field is treated as a separate picture, then it is called
Field-picture
MPEG 2 defines Frame Prediction and Field Prediction as
well as five prediction modes
Fig. 11.6: Field pictures and Field-prediction for Field-pictures in MPEG-2.
(a) Framepicture vs. Fieldpictures, (b) Field Prediction for Fieldpictures
Zigzag and Alternate Scans of DCT Coefficients for
Progressive and Interlaced Videos in MPEG-2.
MPEG-2 layered coding
The MPEG-2 scalable coding: A base layer and one or
more enhancement layers can be defined
The base layer can be independently encoded, transmitted
and decoded to obtain basic video quality
The encoding and decoding of the enhancement layer is
dependent on the base layer or the previous enhancement
layer
Scalable coding is especially useful for MPEG-2 video
transmitted over networks with following characteristics:
Networks with very different bit-rates
Networks with variable bit rate (VBR) channels
Networks with noisy connections
MPEG-2 Scalabilities
MPEG-2 supports the following scalabilities:
1. SNR Scalabilityenhancement layer provides higher
SNR
2. Spatial Scalability enhancement layer provides
higher spatial resolution
3. Temporal Scalabilityenhancement layer facilitates
higher frame rate
4. Hybrid Scalability combination of any two of the
above three scalabilities
5. Data Partitioning quantized DCT coefficients are
split into partitions
multi-channel perceptual audio coder
appropriate for applications involving storage or transmission of mono,
stereo or multi-channel music or other audio signals where quality of the
reconstructed audio is paramount.
AAC achieves coding gain primary through three strategies. First, it uses a
high-resolution transform (a 1024-frequency-bins) to achieve redundancy
removal. This is the invertible removal of information based on purely
statistical properties of a signal. Second, it uses a continuously signal-
adaptive model of the human auditory system to determine a threshold for
the perception of quantization noise and thereby achieve irrelevancy
reduction. This is the irretrievable removal of information based on the fact
that it is not perceivable Third, entropy coding is used to match the actual
entropy of the quantized values with the entropy of their representation in
the bitstream. Additionally, AAC provided tools for the joint coding of
stereo signals and other coding tools for special classes of signals.
AAC is the default or standard audio format
for YouTube, iPhone, iPod, iPad
Audio Layer: Advanced
Audio Coding
MPEG 4
ISO/IEC 14496
Coding of audio-visual objects
Huge standard with 31 parts.
multimedia for the fixed and mobile web.
Features:
efficient across a variety of bit-rates ranging from a few kilobits per second to tens
of megabits per second. MPEG-4 provides the following functions:
Improved coding efficiency over MPEG-2
[citation needed]
Ability to encode mixed media data (video, audio, speech)
Error resilience to enable robust transmission
Ability to interact with the audio-visual scene generated at the receiver
Subsets of the MPEG-4 tool sets have been provided for use in specific
applications. These subsets, called 'Profiles', limit the size of the tool set a
decoder is required to implement.
[1]
In order to restrict computational
complexity, one or more 'Levels' are set for each Profile.
[1]
A Profile and
Level combination allows:
[1]
A codec builder to implement only the subset of the standard needed, while
maintaining interworking with other MPEG-4 devices that implement the
same combination
Systems Layer:
container formats. One is the transport stream, a data packet format designed to
transmit one data packet in four ATM data packets for streaming digital video and
audio over fixed or mobile transmission mediums, where the beginning and the end
of the stream may not be identified.
The other is the program stream, an extended version of the MPEG-1 container format
designed for random access storage mediums such as hard disk drives, optical
discs and flash memory.
Standard Definition and High Definition television broadcasting over Terrestrial,
satellite and cable networks, and optical disk - specifically DVD for movie
distribution.
The synchronized delivery of streaming information from source to destination,
exploiting different QoS as available from the network, is specified in terms of the
synchronization layer and a delivery layer containing a two-layer multiplexer, as
depicted in Figure 2.
The first multiplexing layer is managed according to the DMIF specification, part 6 of
the MPEG-4 standard. (DMIF stands for Delivery Multimedia Integration Framework)
This multiplex may be embodied by the MPEG-defined FlexMux tool, which allows
grouping of Elementary Streams (ESs) with a low multiplexing overhead.
Multiplexing at this layer may be used, for example, to group ES with similar QoS
requirements, reduce the number of network connections or the end to end delay.
The TransMux (Transport Multiplexing) layer in Figure 2 models the layer that
offers transport services matching the requested QoS. Only the interface to this layer
is specified by MPEG-4 while the concrete mapping of the data packets and control
signaling must be done in collaboration with the bodies that have jurisdiction over the
respective transport protocol. Any suitable existing transport protocol stack such as
(RTP)/UDP/IP, (AAL5)/ATM, or MPEG-2s Transport Stream over a suitable link
layer may become a specific TransMux instance. The choice is left to the end
user/service provider, and allows MPEG-4 to be used in a wide variety of operation
environments.
SL SL SL
TransMux Layer
FlexMux
TransMux Streams
FlexMux Channel
TransMux Channel
FlexMux Streams
DMIF Network Interface
DMIF Application Interface
Elementary Stream Interface
SL-Packetized Streams
Elementary Streams
FlexMux
Sync Layer
DMIF Layer
SL SL SL
FlexMux
SL
(RTP)
UDP
IP
(PES)
MPEG2
TS
AAL2
ATM
H223
PSTN
....
....
....
DAB
Mux
File
Broad-
cast
Inter-
active
(not specified in MPEG-4)
D
e
l
i
v
e
r
y
L
a
y
e
r
The systems part of the MPEG-4 addresses the description of the
relationship between the audio-visual components that constitute a
scene. The relationship is described at two main levels:
The Binary Format for Scenes (BIFS) describes the spatio-temporal
arrangements of the objects in the scene. Viewers may have the
possibility of interacting with the objects, e.g. by rearranging them on
the scene or by changing their own point of view in a 3D virtual
environment. The scene description provides a rich set of nodes for 2-
D and 3-D composition operators and graphics primitives.
At a lower level, Object Descriptors (ODs) define the relationship
between the Elementary Streams pertinent to each object (e.g the audio
and the video stream of a participant to a videoconference) ODs also
provide additional information such as the URL needed to access the
Elementary Steams, the characteristics of the decoders needed to parse
them, intellectual property and others.
Other issues addressed by MPEG-4 Systems:
A standard file format supports the exchange and authoring of MPEG-4 content
Interactivity, including: client and server-based interaction; a general event model for triggering
events or routing user actions; general event handling and routing between objects in the scene,
upon user or scene triggered events.
Java (MPEG-J) is used to be able to query to terminal and its environment support and there is also a
Java application engine to code 'MPEGlets'.
A tool for interleaving of multiple streams into a single stream, including timing information
(FlexMux tool).
A tool for storing MPEG-4 data in a file (the MPEG-4 File Format, MP4)
Interfaces to various aspects of the terminal and networks, in the form of Java APIs (MPEG-J)
Transport layer independence. Mappings to relevant transport protocol stacks, like (RTP)/UDP/IP
or MPEG-2 transport stream can be or are being defined jointly with the responsible standardization
bodies.
Text representation with international language support, font and font style selection, timing and
synchronization.
The initialization and continuous management of the receiving terminals buffers.
Timing identification, synchronization and recovery mechanisms.
Datasets covering identification of Intellectual Property Rights relating to media objects.
DAI (DMIF Application Interface) : interface between the
demultiplexer and th decoding buffer
ESI (Elementary Stream Interface): interface between the
decoding buffer and decoder.
DAI provides series of packets, called SL packets, to the
decoding buffer of each elementary stream. SL packet contains
one full access unit or the fragment of it as a payload and also
carries the timing information of the payload for decoding and
composition in a header. Each access unit remains in the
decoding buffer before the decoding time arrives and produces
a composition unit as a result of decoding which will remain
the composition memory until the composition time arrives. By
using this conceptual model, sender can guarantee the stream
does not break the terminal receiving it by causing overflow or
underflow of the decoding buffer or composition.
Terminal architecture
Audio
General
Audio
Transform
coding
techniques
a 6 kbit/s + bandwidth
below 4 kHz to broadcast
quality audio from mono
up to multichannel
low
delays
Fine Granularity
Scalability (FGS)
scalability resolution
down to 1 kbit/s per
channel
Speech
signals
2 kbit/s up to 24 kbit/s
using the speech coding
tools.
Lower bitrates possible
when variable rate
coding
HVXC tools: speed
and pitch modified
under user control
during playback.
CELP tools: change of
the playback speed
Synthetic
Audio
MPEG-4 Structured Audio is a language to describe 'instruments' (little
programs that generate sound) and 'scores' (input that drives those
objects). These objects are not necessarily musical instruments, they are
in essence mathematical formulae, that could generate the sound of a
piano, that of falling water or something 'unheard' in nature
Synthesized
Speech
Scalable
TTS
coders
200 bit/s to 1.2 Kbit/s allows a text, or a text with
prosodic parameters (pitch
contour, phoneme duration,
and so on), as its inputs to
generate intelligible synthetic
speech
Formats Supported
The following formats and bitrates are be supported by MPEG-4
Visual :
bitrates: typically between 5 kbit/s and more than 1 Gbit/s
Formats: progressive as well as interlaced video
Resolutions: typically from sub-QCIF to 'Studio' resolutions (4k x 4k
pixels)
Compression Efficiency
For all bit rates addressed, the algorithms are very efficient. This
includes the compact coding of textures with a quality adjustable
between "acceptable" for very high compression ratios up to "near
lossless".
Efficient compression of textures for texture mapping on 2-D and 3-D
meshes.
Random access of video to allow functionalities such as pause, fast
forward and fast reverse of stored video
Video
Content-Based Functionalities
Content-based coding of images and video allows separate
decoding and reconstruction of arbitrarily shaped video
objects.
Random access of content in video sequences allows
functionalities such as pause, fast forward and fast reverse
of stored video objects.
Extended manipulation of content in video sequences allows
functionalities such as warping of synthetic or natural text,
textures, image and video overlays on reconstructed video
content. An example is the mapping of text in front of a
moving video object where the text moves coherently with
the object.
Scalability of Textures, Images and Video
Complexity scalability in the encoder allows encoders of different complexity to
generate valid and meaningful bitstreams for a given texture, image or video.
Complexity scalability in the decoder allows a given texture, image or video
bitstream to be decoded by decoders of different levels of complexity. The
reconstructed quality, in general, is related to the complexity of the decoder
used. This may entail that less powerful decoders decode only a part of the
bitstream.
Spatial scalability allows decoders to decode a subset of the total bitstream
generated by the encoder to reconstruct and display textures, images and video
objects at reduced spatial resolution. A maximum of 11 levels of spatial
scalability are supported in so-called 'fine-granularity scalability', for video as
well as textures and still images.
Temporal scalability allows decoders to decode a subset of the total bitstream
generated by the encoder to reconstruct and display video at reduced temporal
resolution. A maximum of three levels are supported.
Quality scalability allows a bitstream to be parsed into a number of bitstream
layers of different bitrate such that the combination of a subset of the layers can
still be decoded into a meaningful signal. The bitstream parsing can occur either
during transmission or in the decoder. The reconstructed quality, in general, is
related to the number of layers used for decoding and reconstruction.
Fine Grain Scalability a combination of the above in fine grain steps, up to 11
steps
Shape and Alpha Channel Coding
Shape coding assists the description and composition of conventional
images and video as well as arbitrarily shaped video objects.
Applications that benefit from binary shape maps with images are
content-based image representations for image databases, interactive
games, surveillance, and animation. There is an efficient technique to
code binary shapes. A binary alpha map defines whether or not a pixel
belongs to an object. It can be on or off.
Gray Scale or alpha Shape Coding
An alpha plane defines the transparency of an object, which is not
necessarily uniform; it can vary over the object, so that, e.g., edges are
more transparent (a technique called feathering). Multilevel alpha
maps are frequently used to blend different layers of image sequences.
Other applications that benefit from associated binary alpha maps
with images are content-based image representations for image
databases, interactive games, surveillance, and animation.
Coding of 2-D Meshes with Implicit Structure
2D mesh coding includes:
Mesh-based prediction and animated texture transfiguration
2-D Delaunay or regular mesh formalism with motion tracking of animated
objects
Motion prediction and suspended texture transmission with dynamic meshes.
Geometry compression for motion vectors:
2-D mesh compression with implicit structure & decoder reconstruction
Coding of 3-D Polygonal Meshes
MPEG-4 provides a suite of tools for coding 3-D polygonal meshes. Polygonal
meshes are widely used as a generic representation of 3-D objects. The
underlying technologies compress the connectivity, geometry, and properties
such as shading normals, colors and texture coordinates of 3-D polygonal
meshes.
The Animation Framework eXtension (AFX, see further down) will provide
more elaborate tools for 2D and 3D synthetic objects.
MPEG-7
A suite of standards for description and search of audio,
visual and multimedia content.
MPEG-21
A suite of standard that define a normative open framework
for end-to-end multimedia creation, delivery and
consumption that provides content creators, producers,
distributors and service providers with equal opportunities
in the MPEG-21 enabled open market, and also be to the
benefit of the content consumers providing them access to a
large variety of content in an interoperable manner
MPEG-A
A suite of standards specifying application formats that
involve multiple MPEG and, where required, non
MPEG standards
MPEG-B
A suite of standards for systems technologies that do
not fall in other well-established MPEG standards
MPEG-C
A suite of video standards that do not fall in other erll-
established MPEG standards
MPEG-D
A suite of standards for Audio technologies that do not fall in
other MPEG standards
MPEG-E
A standard for an Application Programming Interface (API) of
Multimedia Middleware (M3W) that can be used to providea
uniform view to an interoperable multimedia middleware
platform.
MPEG-V
MPEG-V outlines an architecture and specifies associated
information representations to enable interoperability between
virtual worlds (e.g., digital content provider of a virtual world,
gaming, simulation), and between real and virtual worlds( e.g.,
sensors, actuators, vision and rendering, robotics).
MPEG-M
MPEG-M is a suite of standards to enable the easy design and
implementation of media-handling value chains whose devices
interoperate because they are all based on the same set of
technologies, especially MPEG technologies accessible from the
middleware and multimedia services
MPEG-U
MPEG-U provides a general purpose technology with innovative
functionality that enable its use in heterogeneous scenarios such as
broadcast, mobile, home network and web domains:
MPEG-H
Suite of standards for heterogeneous environment delivery of
audio-visual information compressed with high efficiency
MPEG-DASH
DASH is a suite of standards providing a solution for the efficient
and easy streaming of multimedia using existing available HTTP
infrastructure (particularly servers and CDNs, but also proxies,
caches, etc.).