An Embedded Merging Scheme For H.264/Avc Motion Estimation: Chztun-Yu Cho, Shiung-Yang Huung Andjiu-Shung Wang
An Embedded Merging Scheme For H.264/Avc Motion Estimation: Chztun-Yu Cho, Shiung-Yang Huung Andjiu-Shung Wang
An Embedded Merging Scheme For H.264/Avc Motion Estimation: Chztun-Yu Cho, Shiung-Yang Huung Andjiu-Shung Wang
The emerging H.264/AVC video coding standard takes 50% performance improvement than the H.263. One of the key successful factors is to employee the variable block size matching algorithm in the motion compensation stage. However. the highly computational complexity leads the codec become too complex and hard to implement in some real time applications. Again, large memoty requirement to save the middle results limits its usage, too. In this paper, we proposed an embedded merging scheme for fast full search algorithm of H.26JIAVC. The proposed scheme can not only reduce the computation complexity but also provides a better way to be implemented in VLSI hardware.
scheme is realized by full search on the 4x4 partition mode 7 and then adds the results up to the remained modes 1-6. To implement this merging scheme on a PC platform, it is needless to wony about the lack of memory for temporal storage used in saving compared results. However, in many other applications. the memory may be a limited resource so chat the merging scheme should be redesigned to avoid unnecessary wste. In this paper, we describe the motion estimation process in H.264/AVC first. Then an embedded merging scheme with pipeline architecture is proposed. Also we suggest a possible hardware architecture base on some surveys of existed full search VLSI architectures to realize the proposed scheme.
Model
Mode2
1. INTRODUCTION
The emerging multimedia standard H.264/AVC emphasizes on back to the basis concept. It invokes several coding features to gain the coding performance. For example, in the compensation stage: the variable block size block matching algorithm the multi-reference frames mode and the sub-pixel motion vector mode are more precise to an accuracy of 118 pixel. Moreover, in the intra mode coding. 4x4 DCT coding and several newly prediction modes for intra and inter blocks w e k employed to take extra benefits. All of those proposed coding ideas give different contributions, but also raise additional computational complexity overhead. The motion compensation in H.264/AVC takes larger computational complexity overhead than the other old standards such as H.261, H.263, MPEG-1.2 because it has to deal with seven different macroblock motion types (Fig. 1). Foltunatdy, the well-structured relation between seven macroblock modes let us be able to develop a merging scheme to accelerate the motion estimation work. In software solution such as in the TML 9, the merging
This research is supported in pafl by the National Science Council. Taiwan R.O.C.. underConlrac1 NSC-91-2213-E407462 and the program for promotion academic excelleiice of universities 89-EFA04-1-4
Mode7
The motion estimation algorithm tries to find the best matching motion vector for current coding macroblock. and then make a macroblock mode decision base on the given R-D constraints. usually, the criterion to find the best matching block is measured by the MSE (Mean Square Error) between the searched candidate blocks and the current macroblock. To reduce the computational complexity for real world applications, the MSE is usually replaced by S A D (Sum of Absolute Difference).
The motion estimation algorithm of H.264/AVC could be divided into several stages. For each block or macroblock the motion vector is determined by full search on integer-pixel positions followed by sub-pixel
I - 909
refinement. The flow for making a macroblock mode decision is simply drawn as following:
1.Integer-pixel search with seven macroblock modes. 2.For each macroblock mode, perfonn the fractional pixel search to refine the motion vector precision. 3.Detemine the hest motion vector on all of the reference frames. 4. Select the hest macroblock mode and reference frame based on the R-D optimization equation.
distonion pair in each macroblock mode, we also need "any R-D comparators as ( I ) to locate the best motion vector.
In the JVT joint model Number 1[1], the motion estimation returns the motion vector that minimizes
J(%$4m,"
= S A D ( s , c ( m ) ) + a , ~ , , , . R ( m ~ p ) (1)
with m =(m,,mJ' being the motion vector. p =(P,,P,)~ being the prediction for the motion vector, and being the Lagrange multiplier. The rate term Rjm-p) represents the motion information only. For integer-pixel search, The SAD is computed as
B. B
S4D(s. c(m))=
.=I. 7-,
with s being the original video signal and c being the coded video signal. In conventional block matching algorithm the SAD is computed only for 16x16 macroblock or four 8x8 sub blocks. To perform the SAD computing in (2); it requires about seven times of computational complexity. An intuitional way to reduce the coinplexily is to compute the SAD using a merging scheme as Fig. 2. Since the different macroblock modes are overlapped in spatial domaiR the SAD can be calculated only on the 4x4 block mode and followed by a sequence of merging steps to gather the rest results of mode 1 to mode 6. In Fig. 2, the labels in the squares comspond to the names of sub-blocks in Fig. 1, and the links between two squares are defined as the merging relations, where the squares at the right side of the links are derived from the left side ones. This idea has been realized in the JM and TM!., reference software and a lot of computational complexity would be saved. It still has another problem in using the merging scheme in Fig. 2 lo implement a codec hardxvare. In traditional ASIC full search architecture. such as [2-8], the SAD results of each candidate block in the search area are output one result per cycle. That is, to do the merging work with the tmditional search architectures. large memory for storing the leiuporal merging results would be needed because of the dependencies between the merging stages. the merging work has to be completed in five cycles at least. Moreover, in order to find out the best motion vector and
The non-preemptive merging order and dependencies cause that to apply the merging scheme in traditional ASlCs with pamlleling schenie becomes infeasible. Hence, our design is to anange the merging scheme with a pipeline smcture for saving the required memo? and comparators' hardware.
3. SURVEY ON EXISTED MOTION ESTIMATION ARCHITECTURES
In this section, we will do some lectures suweys on existed fast full search block matching VLSI architectures. All of the surveyed architectures are designed based on a full search algorithm. Moreover, since the H.264/AVC is the latest standard, the existed block size matching architectures did not consider dealing with this kind of searching conditions. All of them are not suitable to be the block matching algorithms for H.264AVC directly. For example, to perform block matching algorithm seven times will need either longer time or larger memory requirements, no matter to parallelize or cascade the original architectures.
1-910
There are several VLSI architectures for fast full search algoritlnns proposed in the past decades. In general, those methods can be grouped into three types by the hardware structures: I-D array, 2-D anay and hierarchal style. E.q., the famous one on 1-D array is the architecture proposed by Yang, K.-M, Sun, M.-T. and Wu, L [ 2 ] . They proposed to broadcast search area data with two memoly polis and use one memoly port to feed the current block data into the architecture in pipeline. The architecture was good at both low hardware cost and fully PES utilization, however, its data reusability was much poor compared with the posterior architectures. For instance, YeongKang Lai and Liang-Gee Chen proposed an improved version based on a I-D PE anay and two data-interlacing shift-register mys[3]. It reused data more efficiently to decrease extemal memoly accesses. Besides, Koinarek proposed a 2-D array architecture in 141. The idea is to store current block &ita into a 2-D PES array. For the block size N*N, the architecture needs N ports to schedule search area data flow. but only one port for the data flow of current block. After that, Hsieh[5] designed a refinement architecture with shift-registers to improve Koqareks architecture. The scheduling data ports for search area is now decreased to 1, but the hardware utilization is getting worse when switching between two concatenate blocks. Therefore. Yuan-Hau Yeh and Chen-Yi Lee [6] separated the search area data into LS and RS parts, and employed two ports to schedule the current block data flow into the PE array. This solves the idle cycles caused by the switching between hvo neighboring blocks. For reducing more Computational power consumptio4 Do, V.L.and Ynn. K.Y.[7] added in a distoaion approximation unit before computing the exact distortion of each block. Ycu-Shen Jehng, LiangGee Chen and Tzi-Dar Chiueh[S] presented a binary adder in hierarchal tree structure to do the block matching algorithm in. A binary adder tree is used to calculate the , block SAD. Each tree level can be viewed as a palallel pipeline stage. and can reduce the computational data path length farther. However. without taking data reusability into consideration will provoke vastly memory bandwidth. The three types of searching architectures are designed
only for single block size full search algorithm for
etching. Hence, the proposed merging scheme will use it to be the pre-stage for generating the 4x4 SAD results, and then merging the results for mode 1 to mode 6 later.
Merging part
it4
0
Comparator 8 latch mm(MV, Dist)
0
Latch for min(MV. DE!)
traditional video codecs. For applying to H.264/AVC and using the merging scheme in Fig. 2, the searching architectures have to be operated in 4x4 block mode. This will introduce more data redundancies in search ae ra therefore, the architecture which takes better performance on saving search area data access times will be the first candidate. In the surveyed architectures, the Lai and Chen's one is the most suitable because of its high performance and easily dealing with 4x4 block size
Based on the discussion in the pervious sections, the proposed architecture could be described into two parts. The first part is to gather the sixteen 4x4 blocks SADs of each macroblock and then find out their best candidate motion vector by the R-D optimization comparator using equation (1). The second part is to add up SADs of the first part to form the S A D results of mode 1 to mode 6 according to the structure in Fig. 2. The architecture proposed by Lai and Chen in [3] is selected as the SAD computing engine for each 4x4 block. The architecture could perform 4x4 block motion estimation and output the SAD results with the order from block 0 to 15. To get the expected ordering in the Fig. 2_ we can use some latches as delay components for reordering the outputs. The merging scheme is now designed as Fig. 3. Look at the right half side of Fig. 3. Suppose the first 4x4 block
1-91]
is generated at I clock. as shown in Fig. 3_ all of the 4x4 blocks can share the same R-D comparator to look for the best pair of motion vector and distortion. After comparing on a 4x4 block, the better result is conserved in the latch chain placed at the right most side, and the same block will be compared again after 16 cycles. When the whole search area candidate blocks are compared done, the final best motion vector and distortion value will be sent to a R-D optimization mode selecting module for detennining the best macroblock modes. Second, the adding up module for the results of the other modes is located at the left side of Fig. 3. It is activated per 4 clock cycles because the merging procedure could be done only when four 4x4 blocks get ready. As plot in Fig. 3, each red square consists of a built in compamtor and a latch device for doing the R-D selection and saving the best motion result. The earliest time to start the merging module is shown as the clock indices beside the blocks. Moreover, since the merging work has a delay of four cycles the clock indices for the same level blocks in Fig. 2 will not be the same, that is_the latest merging start time for next level must wait until the last one merging module is completed. For instance, the latest start time of block Oc could be only finished as same as the block 3c, iCl7, then the total time of whole merging architecture will maintained, i.e.: Of could be finished still at clock I+19. Furthermore, in order to raise the comparators n utilization i the left side of Fig. 3, the 25 R-D compare a sguares, O to Of. should be reconfigured to achieve a lugher utilization. Actually, the 25 comparing components could be easily reduced into only 4 R-D comparators by adding some latches and a few control logic circuits. This is because that the four idle clock cycles have already provided enough time to separate the operations of all blocks. A simple example of the possible connections using only four compamtols and some sets of latches are given in Fig. 4.There is only four comparators needed and two selectors are employed to distribute the latch blocks data into the matching coinparing units. The additional hardmre components for the improved architecture are only five new comparators and 41 latches to store the intermediate comparing results.
introduce additional search area data overlap. but the total complexity still far smaller than lo perfonn seven times on searching works.
[l] Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG, Joint Model Number 1, Revision l(JM-lrl),ITUT SG16 Q.6 (VCEG) ond ISO/IEC JTC I/SC 29/U% 11 &PEG), JVT-A003rl.doc, ?002-01-18. [21 Yang, K , Sun, M.-T. and Wu. L, A family of VLSI : M
designs for the motion compensation block-matching algorithm, Circuits and Svsrems, IEEE Tra?tsachons on , Vol.36.Issue. 1O,pp.1317-1325,Oct 1989. [31 Yeong-Kang Lai and Liang-Gee Chen, A Data-Interlacing Architecture with Two-Dimensional Data-Reuse for FullSearch Block-Matching Algorithm, Circuits andSvstems.
IEEE Transaclions 011, Vol. 8_No. 2. pp. 124 -127: Aprial
1998.
[4]Komarek. T. and Pirsch, P., Array architectures for block matching algorithms, Circuits andSvstenrsfor Video Teeltriologl: IEEE Transactions o n , Vol. 36. Issue. IO, pp.1301 -1308. Oct 1989. [SI Hsieli, C.-H.; Lin and T.-P., VLSI architecture far blockmatching motion estimation algorithm. Circuils and Svstenmfor Video Tecknology IEEE Tmrtsaclions 017, Vol. 2 , Issue. 2, pp.169 -175. Jun 1992. [6] Yuan-Hau Yeh and Chen-Yi Lee, Scalable VLSI architectures for full-search block matching algorithms, Inrage Processing, 1996. Proceedings., I~rremotio,~al Confirenceon, vol. I _ pp.1035 -1038, 16-19 Sep 19%. [7] Do, V.L.and Yun, K.Y., A low-power VLSI architecture for full-search block-matching motion estimation, Circsits and Svstems for Video Technologv. IEEE Transocliom on, Vol. 8 Issue. 4, pp. 393 -398,Aug 1998. [XI Yeu-Shen Jehng. Liang-Gee Chen and Tzi-Dar Chiueh. An
5. CONCLUSION In this paper, we extend the conventional VLSI full search architecture with an embedded merging scheme so that the new architecture can perform the variable block size block matching in H.264/AVC. The proposed scheme uses only five additional Comparators and 41 intermediate latches, and can be integrated with the traditional full searching architecture easily. Using 4x4 blocks instead of l6xl6 ones in motion estimation will
efficient and simple VLSI tree architecture for motion estimation algon thms, Sig~ial Proceriny. IEEE Tmnsnctionson. Vol. 41. Issue. 2. pp. 889-900; Feh 1993.
I 912