0% found this document useful (0 votes)
44 views4 pages

A Dynamic Search Range Algorithm For H.264/AVC Full-Search Motion Estimation

video tech

Uploaded by

Nathan Love
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views4 pages

A Dynamic Search Range Algorithm For H.264/AVC Full-Search Motion Estimation

video tech

Uploaded by

Nathan Love
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

A Dynamic Search Range Algorithm for H.

264/AVC
Full-Search Motion Estimation
Yuan-Teng Chang and Wen-Hao Chung
Information & Communications Research Laboratories
Industrial Technology Research Institute, Hsinchu, Taiwan
[email protected]

Abstract Motion estimation plays an important role in inter-
frame prediction for the video coding standards such as
H.264/AVC, MPEG-2, MPEG-4, VC-1, and so on. Its huge
computation complexity, however, makes it difficult to achieve
real-time coding for the HDTV1080p. In this paper, we propose a
dynamic search range algorithm which reduces about 80% of
search points in full search algorithm for the H.264/AVC. In
addition, we design the corresponding VLSI architecture of
integer motion estimation. The proposed integer motion
estimation can achieve real-time coding for 30fps HDTV1080p
operating at 166 MHz.

Keywords H.264/AVC, Integer Motion Estimation, Dynamic
Search Range
I. INTRODUCTION
H.264/AVC is a video coding standard developed by the
ITU-T Video Coding Experts and ISO/IEC Moving Picture
Experts Group [1]. It has been widely adopted in various
applications, such as blu-ray disc, IPTV and HDTV
broadcasting. H.264/AVC provides better image quality and
compression ratio by adopting a lot of new features, such as
variable-block-size motion estimation, 1/4-pel fractional
motion estimation, multiple reference frames, de-blocking
filter, context-adaptive binary arithmetic coding, and so on.
To reduce temporal redundancy, the motion estimation is
used to find the best macroblock (MB) among different
reference frames for each MB of inter frame. In H.264/AVC,
there are four MB modes including 16x16, 16x8, 8x16, and
8x8. Each partition of MB mode 8x8 can be further divided
into four sub-macroblock modes consisting of 8x8, 8x4, 4x8,
and 4x4. It is required to totally estimate 8160x(3+4x4) modes
per frame for the HD1080p video sequences. However, it will
result in huge time computation and make it difficult to
achieve real-time coding by the RISC or DSP implementation.
Therefore, to achieve HD video coding, many hardware
accelerators for integer motion estimation (IME) are proposed
[2-3].
Chen adopts the full search (FS) algorithm and designs the
parallel SAD trees to estimate several search points [2]
simultaneously. In addition, the variable-block-size motion
estimation (VBSME) is employed and make it possible to
process all the MB modes in parallel. Chens design, however,
only supports HD720p real-time video coding. Afterward, Liu
proposes an IME architecture to achieve HD1080p real-time
coding [3]. Several algorithm-level optimizations are provided:
elimination of inter mode 4x4, 4x8, and 8x4, low-pass filter
based 4:1 down-sampling, and coarse-to-fine search.
The FS algorithm is adequate for hardware realization
because it provides regular search patterns and makes
reference pixel data reuse easy. The IME will check all points
in an assigned search range. The search range is usually set as
more than 64 when an HD video is coded, and thus the IME
needs to check (64x2+1)
2
points at least. To avoid redundant
search in FS, Minocha proposes a dynamic search range (DSR)
algorithm [4]. It exploits the temporal correlation of motion
vectors in successive frames to predict the search range.
Afterward, Saponara takes both temporal and spatial
correlation of motion vectors into account to predict the
search range [5]. However, the both DSR algorithms are not
able to reduce search points effectively.
In this paper, we propose a new DSR algorithm to reduce
search points and embeds it in our IME. It determines the
search range according to the max. motion vector and average
SAD of the pervious frame, as well as motion vectors and
SAD of neighbouring blocks. The DSR algorithm brings
another advantage of reducing internal and external memory
bandwidth of fetching reference pixels. The resultant IME can
achieve real-time coding for HD1080p video sequences
operating at 166 MHz with slight PSNR loss, and bit rate
increase.
The rest of this paper is organized as follows. In Section 2,
we present the proposed dynamic search range algorithm. In
Section 3, we present the proposed integer motion estimation
architecture. Implementation results and comparisons are
shown in Section 4. Finally, the paper is concluded in Section
5.
II. DYNAMIC SEARCH RANGE ALGORITHM
A. The Proposed Algorithm
The proposed DSR algorithm is described below:
Step 1. Calculate the maximum motion vector (max. MV
k-1
)
and average sum of absolute difference value per MB (avg.
SAD
k-1
) in the previous frame k-1.

D B C
A E

Fig. 1 The neighbouring blocks of the current block E
978-1-4244-7456-1/10/$26.00 2010 IEEE 124

Step 2. Calculate the search range of each block in current
frame k by applying the following rule. Fig. 1 shows the status
of neighbouring blocks for a current block E. It should be
noted that the block C is replaced with block D if block C is
not available.

if (block B and block C are not available)
search range = max. MV
k-1

else if (max. SAD > 1.75*avg. SAD
k-1
)
search range = max. search range
else
search range = min(2*max. MV, max. search range)

where,
max. SAD = max (SAD
A
, SAD
B
, SAD
C
), and
max. MV = max (MVX
A
, MVX
B
, MVX
C
, MVY
A
, MVY
B
,
MVY
C
)

The concept of this algorithm is that it exploits both the
temporal and spatial correlation of motion vectors to predict
the search range of a block. The initial search range of the
current frame is set as the max. motion vector of the previous
frame since the motion content is usually gradual especially
for low-motion scenes. The sudden large movement of certain
objects will, however, causes error prediction of search range
if only the temporal correlation is considered. Therefore, by
taking into account the spatial correlation, the max. motion
vector of neighbouring blocks is also a factor to determine the
search range. To avoid trapping into the local minimum, we
set the average SAD per MB in the previous frame as the
threshold of current frame. Once the max. SAD among
neighbouring blocks is greater than the 1.75*threshold, the
search range will return to maximum.


Fig. 2 (a) Three pipelined stages of motion estimation (b) Example of
prefetching reference pixels (c) Timing diagram of the pipelined motion
estimation
B. The Consideration of Hardware Realization
Although this algorithm is simple and adequate for
hardware realization, some slight modifications are required.
In general, the variable-block-size motion estimation
(VBSME) [2] is employed in IME to achieve higher
throughput because it can estimate all MB modes in parallel.
To make it possible that both the VBSME and our DSR
algorithm coexist in IME, we must decide the search range at
MB level instead of block level. That is, all block modes use
the same search range in an MB.
Second, motion estimation is divided into three separate
pipelined stages so as to improve the throughput as shown in
Fig. 2 (a). Consequently, we can not acquire the correct
motion vector of left MB at IME stage until the FME stage is
completed due to the data dependency. To overcome this
dilemma, the motion vector of the best mode of the left MB at
IME stage may be used.
TABLE I THE PERFORMANCE OF OUR PROPOSED ALGORITHM COMPARED WITH
THE FULL-SEARCH MOTION ESTIMATION
CIF
Video
Sequence
PSNR
(dB)
Bit-rate
(%)
Search
Points /
MB
Search Points
Reduction
(%)
QP=0
akiyo -0.01 0.00 504 88.07
container 0.01 0.01 266 93.70
foreman +0.02 0.01 883 79.10
mobile -0.02 -0.04 481 88.62
stefan 0.00 -0.15 1344 68.19
QP=12
akiyo 0 0 238 94.44
container -0.01 0.05 279 93.40
foreman 0 0.05 906 78.56
mobile 0.02 -0.07 474 88.78
stefan 0.01 0.21 1414 66.53
QP=24
akiyo 0 0 213 94.96
container -0.01 0.09 127 97.00
foreman -0.01 0.29 750 82.25
mobile 0 0.06 410 90.30
stefan -0.01 0.13 2029 52.00
QP=36
akiyo 0 -0.02 505 88.05
container 0 0.02 339 92.00
foreman -0.05 1.36 652 84.57
mobile 0.01 -0.06 206 95.12
stefan -0.02 0.59 1193 71.76
QP=48
akiyo 0 0.05 717 83.03
container -0.01 0.06 582 86.22
foreman -0.17 2.00 933 77.92
mobile -0.04 -0.26 91 97.85
stefan -0.08 0.83 1048 75.20

However, to lower influence of external memory latency, it
intends to prefetch the reference pixels and determine the
search range ahead of the IME stage as shown in Fig. 2 (a).
For instance, the operation PREF0~2 fetch the reference
pixels in the red dashed rectangle 0~2 before the IME0 is
performed as shown in Fig. 2 (b)(c) if the search range of
MB0 is set as 32. Because there are three MB latency between
IME and DSR, it is impossible to acquire the SAD and motion
125


vectors of the left MB. Therefore, we replace MB A with MB
D for the sake of realizing the DSR algorithm with hardware.
C. Performance Analysis
Table I shows the performance of the proposed DSR
compared with the full search (FS) algorithm in JM15.1 [6]. It
simulates five 30fps CIF sequences with IPPPPPPP (intra
frame period 8), RDO off, maximum search range 32, no intra
MB in inter frame, 1 reference frame, and CAVLC entropy
coding under five different QPs 0, 12, 24, 36, and 48. In
which, the akiyo and container are low-motion sequences, but
foreman, mobile and stefan are high-motion sequences. The
search points per MB are calculated by the equation (2*search
range+1)
2
. Thus, there are totally (2*32+1)
2
search points per
MB for the FS algorithm with search range 32.
The proposed DSR algorithm behaves better than FS. It
greatly reduces the search points with only a little PSNR
decrease and bit rate increase. Besides, the DSR lower the
demand of external memory bandwidth and the power
consumption resulting from redundant search as the search
points are decreasing.
III. INTEGER MOTION ESTIMATION ARCHITECTURE
Fig. 3 depicts the proposed IME architecture mainly
composed of a dynamic search range generator, a reference
pixel prefetch unit, two SAD trees, and three SRAM modules.
The dynamic search range generator is responsible for
generating a predicted search range for each MB according
the proposed algorithm. To avoid the use of extra memory to
store SAD value, the status register stores the judgement result
whether or not the min. SAD is greater than 1.75*SAD
k-1
after
completing the IME of an MB. Since there are 120 MBs in
horizontal resolution for HD1080p video sequences, the status
register requires 120 bits.


Fig. 3 Architecture of integer motion estimation (IME)
The DSR generators calculates the search range according
to the motion vectors and SAD status of neighbouring MBs
fetched from the MV SRAM and the status register
respectively. Then, the prefetch unit will load the reference
pixels from external memory based on the search range.
Simultaneously, the control unit will command the SAD trees
to scan the allowed search area and calculate the rate-
distortion cost by the lagrange equation:
rate-distortion cost = *rate(mvd)+SAD
: lagrange multiplier;
rate(mvd): coded bit number of motion vector difference
In the period of scanning search area, the SAD trees will
calculate the rate distortion cost of each point for all MB
modes in parallel and keep the min. cost and its corresponding
position, namely motion vector. Finally, the mode decision
will determine the best mode, motion vectors and SAD.
1:1
4:1
16:1
SR=16
SR=32
SR=64
1:1
4:1
16:1
(a)
(b) (c)
(d)
Fig. 4 Different search point sub-sampling based on search location

For the HD1080p coding, it is appropriate to set the search
range as 64 at least. This way, the IME must check 16641
points in the search area in the worst case; that is, it totally
consumes 16641 clock cycles if it employs one SAD tree and
checks every point. To meet the real-time requirement, we
adopt multi-resolution motion estimation and use two SAD
tree as shown in Fig. 4. It performs the elaborate search and
checks each point in the area of search range 16 as shown in
Fig. 4 (b). Out of search range 16, it performs the coarse
search, which adopts the 4:1 (Fig. 4 (c)) and 16:1 (Fig. 4 (d))
search point sub-sampling in the area of search range 32 and
64 respectively. Therefore, it at most requires 1325 clock
cycles in the case of max. search range 64. To reduce
hardware cost, the SAD tree adopts 2:1 pixel decimation and
truncates two least significant bits of luma pixels as shown in
Fig. 5.

Fig. 5 2:1 pixel decimation and 2-bit pixel truncation
126


TABLE II COMPARISON OF PSNR, BIT-RATE, AND AVERAGE SEARCH POINTS PER MACROBLOCK
QP=27
container (CIF) mobile (CIF) pedestrian (1080p) station2 (1080p)
FS [4] [5] ours FS [4] [5] ours FS [4] [5] ours FS [4] [5] ours
PSNR
(db)
37.22 37.23 37.23 37.22 36.07 36.07 36.07 36.07 40.05 40.04 40.04 40.04 39.26 39.25 39.25 39.25
bit-rate
(kb)
602.6 603.2 603.3 603.5 3325.7 3326.4 3326.7 3325.6 8969.2 9101.1 9101.1 9112.0 5642.3 5671.4 5671.9 5718.9
Search Points
/MB
4225 2043 1747 134 4225 1849 2098 384 16641 16641 16641 3249 16641 16589 16538 729
Search points
reduction (%)
no 51.64 58.65 96.83 no 56.24 50.34 90.91 no 0 0 80.48 no 0.31 0.62 95.62
IV. EXPERIMENTAL RESULTS
Table II compares our integer motion estimation with the
other three methods in terms of PSNR, bit-rate, and search
points per MB. It simulates four 30fps video sequences
including two CIF sequences container and mobile, and two
HD1080p sequences pedestrian and station with IPPPPPPP
(intra frame period 8), RDO off, QP27, no intra MB in inter
frame, 1 reference frame, and CAVLC entropy coding. The
max. search range is set as 32 for CIF sequences and 64 for
HD1080p sequences. Our algorithm combines the proposed
dynamic search algorithm, 2:1 pixel decimation, and 2-bit
pixel truncation. The motion estimation methods in [4] and
[5] also adopt 2:1 pixel decimation and 2-bit pixel truncation.
The threshold TH1 and TH2 mentioned by [5] are set as
1024 and 2048 respectively.
Compared with the full search algorithm, our algorithm
can achieve 80% to 90% reduction of search points per MB
with only a bit PSNR decay and bit-rate increase. It is not
useful to reduce search points for the Minochas method [4]
because it ignores the spatial correlation of motion vectors.
Although the [5] takes account of temporal and spatial
correlation, the search range will be set as maximum once it
finds a high-motion object in a frame. In addition, it is
difficult to determine appropriate value for the threshold
TH1 and TH2.
TABLE III COMPARISONS WITH OTHER INTEGER MOTION ESTIMATION
ARCHITECTURE
Design [2] [3] Proposed
technology 180 nm 180 nm 180 nm
cycles / MB 1000 817 677
clock freq. 108 MHz 200 MHz 166 MHz
max. resolution 720p 30fps 1080p 30fps 1080p 30fps
gate count 305 k 486 k 117 k
internal SRAM 13.71 KB 40 KB 20 KB

The proposed IME has been implemented with Verilog
HDL and synthesized under the TSMC 180 nm CMOS
technology operating at 166 MHz. Table III shows the
synthesis results compared with the other two related works.
The [2] processes an MB within 1000 cycles but only
supports HD720p video coding. In order to achieve


HD1080p coding, the [3] uses thirty-two sets of SAD trees to
improve the throughput and consumes much hardware cost.
By applying the DSR algorithm and multi-resolution motion
estimation, our IME just employs two sets of SAD trees and
processes an MB within 677 cycles. Therefore, our IME not
only uses less hardware cost, but achieves real-time coding
for HDTV1080p operating at 166 MHz.
V. CONCLUSION
In this paper, we present a new dynamic search algorithm
which take both the temporal and spatial correlation into
consideration. Some design issues are provided when we
need to implement it with hardware. Furthermore, our IME
also adopts several optimization methods including multi-
resolution motion estimation, 2:1 pixel decimation, and 2-bit
pixel truncation to improve the throughput and save
hardware cost. The resultant IME can process an MB within
677 cycles quite enough for achieving real-time coding for
HDTV1080p operating at 166 MHz.
REFERENCES
[1] ITU-T, Advanced Video Coding for Generic Audiovisual
Services, ITU-T Recommendation H.264 and ISO/IEC
14496-10 (MPEG4-AVC), March 2005.

[2] T.C. Chen; S.Y. Chien; Y.W. Huang; C.H. Tsai; C.Y. Chen;
T.W. Chen; L.G. Chen, "Analysis and architecture design of
an HDTV720p 30 frames/s H.264/AVC encoder," IEEE
Trans. on CSVT, vol.16, no.6, pp.673-668, June 2006.

[3] Zhenyu Liu; Yang Song; Ming Shao; Shen Li; Lingfeng Li;
Ishiwata, S.; Nakagawa, M.; Goto, S., "HDTV1080p
H.264/AVC encoder chip design and performance analysis,"
IEEE Journal of Solid State Circuits, vol. 44, no. 2, pp.594-
608, Feb. 2007.

[4] Minocha, J.; Shanbhag, N.R., A low power data-adaptive
motion estimation algorithm, in Proc. of MMSP, 1999.

[5] Saponara, S.; Fanucci, L., "Data-adaptive motion estimation
algorithm and VLSI architecture design for low-power video
systems," Proc. IEE Computers and Digital Techniques, vol.
151, no. 1, pp51-59, 2004

[6] Joint Video Team Reference Software JM 15.1
127

You might also like