The H.263 Video Coding Standard
The H.263 Video Coding Standard
The H.263 Video Coding Standard
Bernd Girod Khaled Ben Younes Reinhard Bernstein Peter Eisert Uwe Horn Eckehard Steinbach Klaus Stuhlmuller Telecommunications Institute University of Erlangen-Nuremberg Cauerstrasse 7, 91058 Erlangen, Germany [email protected] Niko Farber Frank Hartung Thomas Wiegand
This paper gives an overview over recent advances in video compression and transmission. The new H.263 video compression standard is an important milestone towards wide-spread use of personal video communication. For mobile video applications and video transmission over networks, error robustness and scalability are important issues. Structure-from-motion methods, region-based coding and model-based coding are research topics pointing into the future, towards MPEG-4. Digital video compression is one of the key issues in video coding, enabling e cient interchange and distribution of visual information. New applications in the eld of communication, multimedia, and digital television broadcasting 1] require highly e cient, robust and exible digital video compression and encoding techniques 2]. Multimedia applications range for example from desktop videoconferencing and computer-supported cooperative work to interactive entertainment networks where video-on-demand, video games, and teleshopping are provided. The integration of motion video as an integral part of multimedia environments is technologically one of the most demanding tasks, due to the high data rates and real-time constraints. According to the importance of the problem, there is a signi cant amount of ongoing research in the area of video compression. The rate required to encode a full motion video signal at VHS quality has come down from 20 Mbit/s around 1980 to well below 1 Mbit/s today. For head-andshoulders scenes, typical for videoconferencing applications, rates can be substantially lower. In a parallel e ort, international standardization organizations are actively developing standards for video compression, which facilitate interoperability among di erent video systems and motivate the development and production of VLSI systems and devices in addition to software solutions. Since standards re ect the state of the art, they serve as references on developing compatible extensions and novel algorithms. One of the most recent video coding standards, ITU-T H.263 is discussed in section 2. To ensure robust video transmission, standard compatible extensions are proposed in section 3. Potential tools for new standards such as scalable video coding and region based coding are outlined in section 4 and 5. 3D scene and motion estimation for model-based video coding To appear in Proceedings ISCAS-96 1
ABSTRACT
schemes as discussed in section 6 can be supported by structure from motion algorithms, nally described in section 7.
1. INTRODUCTION
Many existing networks cannot provide a guaranteed quality of service. This may result from the underlying medium access control, like in IEEE 802.3 based Local Area Networks (Ethernet), or from the limitations of the transmission channel, e.g., in mobile environments where remaining errors can not be avoided during fading periods. Networks with these limitations are characterized as \best e ort" networks and require increased robustness for the transmission of video. This is especially true when motion-compensated prediction is utilized, which is essential for high coding efciency, but also causes spatio-temporal error propagation, i.e., the loss of information in one frame has considerable impact on the quality of the following frames. Error propagation can be e ciently reduced when a feedback channel between transmitter and receiver is utilized. The system proposed in 10] tolerates errors, but limits their e ect by error control techniques in the source codec. Error concealment is employed to hide visible distortion and residual errors are compensated using acknowledgment information from the receiver. Compensation is achieved in a standard-compatible way by coding image regions in INTRA-mode which could not be concealed successfully.
2 0
Error Compensation Based on Feedback Information
2 4 6 8 0
Error Concealment Only Feedback Information Received
time [s]
coders can be used for their e cient encoding and multiscale motion compensation can be easily included 13]. Filters for spatial downsampling and interpolation operations can be kept very simple such that fast and e cient codecs become feasible. Since one is completely free in choosing appropriate lters, morphological lters can be used to further improve subjective image quality 14]. A pyramid coder is a multistage quantization scheme. Therefore, e cient compression requires careful bit allocation to the various quantizers depending on the image to code. Performing an optimal bit allocation for each frame becomes computational infeasible in pyramids with more than two layers. A comparison between open-loop and closed-loop pyramid coders shows that closed-loop coders are better suited for practical applications since they are less sensitive against suboptimal bit allocations such that simpler heuristics can be used 8]. Multiscale motion compensation can be utilized in several ways 15]. Motion vectors can be e ciently computed and encoded by hierarchical motion estimation. All compensation schemes we investigated so far perform very similar under the condition of equal quantization noise in each spatial resolution layer 16]. A simulation model we set up for coding experiments is built upon a four layer closed-loop pyramid coder combined with multiscale motion compensation 15]. For quantization, an E8 -lattice vector quantizer is used 17]. Spatial and temporal resolution of the nest resolution layer correspond to the ITU-R 601-4 standard. In each subsequent layer, spatial and/or temporal resolution is reduced. With a single encoded bitstream, bit rates between 64 kbit/s up to 4 Mbit/s are supported. On a Sun SparcStation5 realtime decoding with a software-only decoder is possible up to 300 kbit/s. Simulations have shown that realtime performance is roughly proportional to bit rate.
PSNR (dB) 30 Flowergarden
PSNR [dB]
10
Equal Error Protection decoding impossible 0 10-4 10-3 bit error rate 5*10-3 10-2
Figure 2. Scalable video coder combined with equal and unequal error protection
By adding an unequal forward error correction scheme to a scalable source coder, robustness against data loss can be noticeably increased. Fig. 2 shows simulation results for ve di erent bit error rates on a binary symmetric channel for the Flowergarden test sequence. Unequal error protection is compared against equal error protection where all layers are protected at the same level. In both cases the increase in bit rate due to error protection is 26%. From Fig. 2 it can be seen that scalability combined with unequal error protection leads to a gracefully degrading coding scheme.
Scalability can also be utilized for transmission over heterogeneous packet networks since it o ers an easy way to reduce the bit rate of transmitted video data in case of congestion 18]. By giving each packet a priority the network layer itself is able to reduce the bitrate without informing the coder or knowing the content of a packet. In current standards 3, 4, 5, 6], motion compensation and encoding of the prediction error is block-based. However, the emerging MPEG-4 standard is supposed to provide contents based functionalities in addition to e cient video coding 19]. To ful ll this task a source model is needed that adapts better to the scene contents than blocks. In so-called object-based image coding techniques 20] the partition of images into a xed block structure is replaced by arbitrarily shaped regions that are related either to texture/color 21] or to motion 22]. The shape of the regions has to be transmitted as additional side information. The e ciency of such a region-based coding scheme depends very much on properly adjusting the amount of bits used for region coding.
38 36 34 32 0 10 20 30 Frame Number 40 50 PSNR [dB]
is synthesized with respect to the same models using parameters received from the coder. An important application for model-based coding is video telephony. The structure of such a system is shown in Fig. 4.
Video Sequence (Speech) Analysis: Extraction of Facial Expressions Parameter Coding Shape / Texture Motion Parameters Parameter Decoding Synthesis: Rendering of Head Model Video
Coder
Model: Shape Texture Dynamic Behavior
Decoder
A solution for the optimum trade-o by applying ratedistortion theory has been presented recently for regionbased motion compensation 22]. The regions are optimized with respect to a Lagrangian cost function by variation of the region contours. The resulting optimal contours do not necessarily coincide with the actual contours of the objects in the scene. However, for the optimized regions the improvement in distortion and the region's rate are well balanced in a rate-distortion sense. The improvement that has been achieved in 22] with a coding scheme using optimized regions is shown in Fig. 3. It can be seen that motion compensation with rate-distortion optimized regions is about 2 dB better than block-based motion compensation. In both cases a block-based DCT-coder is used for coding the prediction error image.
Figure 3. PSNR for the decoded QCIF sequence \Miss America" at 14 kbit/s, 8.33 Hz. '- -': blockbased motion compensation, 'oo': region-based motion compensation
A very promising technique for achieving very low data rates in video coding is model-based coding. It has been reported 23] that rates of less than 1000 bit/s are possible without reducing the resolution of the image. In a modelbased system the coder analyzes the scene using three dimensional models of the objects. At the decoder the scene
A natural test scene consists of two frames of the sequence Flowergarden. Fig. 5 shows the estimated depth map. White image regions have small depth values, black blocks represent scene parts far away.
8. CONCLUSIONS
REFERENCES
8] K. Ramchandran, A. Ortega, and M. Vetterli. Bit allocation for dependent quantization with applications to multiresolution and MPEG video coders. IEEE Trans. on Signal Processing, 3(5):533{545, Sep. 1994. 9] T. Wiegand, M. Lightstone, T.G. Campbell, D. Mukherjee, and S.K. Mitra. Rate-distortion optimized mode selection for very low bit rate video coding and the emerging H.263 standard. Submitted for publication to IEEE Trans. on Circuits and Systems for Video Technology. 10] N. Farber, E. Steinbach, and B. Girod. Robust H.263 compatible video transmission over wireless channels. In Proc. PCS, Mar. 1996. 11] B. Girod. Scalable video for multimedia systems. Computers & Graphics, 17(3):269{276, 1993. 12] M.K. Uz, M. Vetterli, and D.J. LeGall. Interpolative multiresolution coding of advanced television with compatible subchannels. IEEE Trans. on Circuits and Systems for Video Technology, 1(1):86{99, Mar. 1991. 13] M. Vetterli and K.M. Uz. Multiresolution coding techniques for digital television: A review. Multidimensional Systems and Signal Processing, 3:161{187, 1992. 14] R. Bernstein and U. Horn. Linear and morphological pyramids for scalable image coding - A comparison. submitted to: 1996 International Conference on Image Processing, 1996. 15] U. Horn, B. Girod, and B. Belzer. Scalable video coding with multiscale motion compensation and unequal error protection. In Proc. International Symposium on Multimedia Communications and Video Coding, New York, NY, Oct. 1995. 16] U. Horn and B. Girod. Performance analysis of multiscale motion compensation techniques in pyramid coders. submitted to: 1996 International Conference on Image Processing, 1996. 17] B. Girod, F. Hartung, and U. Horn. Subband image coding. In Ali N. Akansu and Mark J.T. Smith, editors, Subband and Wavelet Transforms, chapter 7. Kluwer Academic Publishers, Norwell, MA, 1995. 18] U. Horn, B. Girod, and B. Belzer. Scalable video coding for multimedia applications and robust transmission over wireless channels. 7th International Workshop on Packet Video, Mar. 1996. 19] R. Schafer and T. Sikora. Digital video coding standards and their role in video communications. Proc. IEEE, 83(6):907{924, Jun. 1995. 20] H.G. Musmann, M. Hotter, and J. Ostermann. Object-oriented analysis-synthesis coding of moving images. Signal Processing: Image Communication, 1(2):117{138, Oct. 1989. 21] P. Salembier, L. Torres, F. Meyer, and C. Gu. Region-based video coding using mathematical morphology. Proc. IEEE, 83(6):843{857, Jun. 1995. 22] K. W. Stuhlmuller, A. Salai, and B. Girod. Rate-constrained contour-representation for region-based motion compensation. In Proc. Symp. on Visual Comm. and Image Proc. SPIE, Mar. 1996. 23] B. Girod. Image sequence coding using 3D scene models. SPIE Symposium on Visual Communications and Image Processing 94, Sept. 1994. 24] M. Hoch, G. Fleischmann, and B. Girod. Modeling and animation of facial expressions based on B-splines. The Visual Computer, pages 87{95, Nov. 1994. 25] I. A. Essa and A. Pentland. A vision system for observing and extracting facial action parameters. IEEE Conference on Computer Vision and Pattern Recognition, Jun. 1994. 26] E. Steinbach and B. Girod. Estimation of rigid body motion and scene structure from image sequences using a novel epipolar transform. IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP, May 1996.